2. 1. Understand the business
2. Understand data
3. Prepare data
4. Modell
5. Evaluation
6. Deployment
CRISP Value Process
Frank Kienle
3. Data are individual units of
information
We store more and more data which
leads to
Big Data
Data to Big Data
Frank Kienle
4. Erik Larson, Harper’s magazine:
‘The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes
other than originally intended.’
(Reality today: private data is becoming commoditized)
Big Data definitions 1989
Frank Kienle
5. Doug Laney, Gartner,2001:
,3-D Data Management:
Controlling Data
Volume, Velocity and Variety’
Big data definition 2001
Source: avyamuthanna.files.wordpress.com/2013/01/picture-11.png
Frank Kienle
6. Big Data is any data that is expensive to manage and hard to extract value from
(Souce: Michael Franklin, Dirctor of the Algorithms, Machines and Computer Science, Unverisity of Berkeley)
Extracting value out of big data is all about predicting the futures based on
observation of the past
Big Data today: it’s all about value
Frank Kienle
7. Big Data: the four V’s https://www.ibmbigdatahub.com/infographic/four-vs-big-data
Frank Kienle
9. § up to 75 control devices in each BMW
§ ~ 1.000 individual configurations possible
§ ~1 GByte functional software, 15 GByte data in the
car
§ ~ 2.000 customer functions implemented
§ ~ 12.000 error storage memories for onboard
§ daily up to 60.000 diagnoses processes world
wide
§ centralized data storage and organization
§ data fusion and data mining for quality insurance
and better understanding of realistic
environments
Source: Bitcom BMW keynote talk
source: pixabay
Frank Kienle
10. Tracking the data in a car can
have benefits
but
comes with security / privacy
challenges
See lecture on ethical
challenges
Big Data Sources: Car black boxes
source: Los Angeles Times
Frank Kienle
11. A gas turbine has up 1000 sensors
§ Each sensor can (theoretically) processes data in the
millisecond range
§ example real live set up:
§ averages are stored per second (history kept for
one year)
§ often long history available, e.g. up to year 2000 in
5 minutes range (averages)
IoT Sensor Data example: Gas Turbine
source: pixabay
Frank Kienle
12. Realistic scenario store tuples: (timestamp, value)
• new sensors will be introduced, sensors might change
Theoretical data stream storage, gas turbine example
§(timestamp, value) 64 Byte X 1000 sensors à
Reality:
► 1 year stored in 1 s averages:
► 10 years stored in 5min averages:
3.2 Mbyte Time: 1 s
276Mbyte Time: 1 day
100.9 GByte Time: 1 year
64 kByte Time: 20 ms
x 100 engines in one data center à 10 TByte Time: 1 year
200 GByte Time: 1 year
~ 7 TByte Time: 10 years
Frank Kienle
13. Big Data Landscape - Data Lake Architecture
Components overview and terminology
14. mostly structuredsemi-structuredunstructured
The data lake is one part in the overall data to value path
§#123 §10101
§
Raw (Big) Data is
typically coming from
different sources and
has many different data
types
A data lake is a storage
repository that holds a vast
amount of (big) data in its
native format and provides
intelligent (semi-
structured) access until it is
needed
The value of data is
delivered via enterprise
systems / UX components
with the overall goal to
perform data driven
decisions
twitter www social
sensors mobile payments
transactions transport
video
Source Manage Value
pictures voice
Frank Kienle
15. Stages of Data in Data Lake – High Level Architecture
The data flow and used technology, tools, programming depend on data type and the final application layer
Business Systems
Business Systems
Business Systems
Business Systems
Data Sources
Delivery
Applications
Applications
ApplicationsApplications
&
Visualizations
Enriched
Data
Raw
Data
Ingestion
Transform /
Curate
File Transfer,
RDB Import
REST APIs
Stream or
batch transfer
WhatHow
Initial raw
raw data
storage
Distributed
Storage (e.g.
Hadoop)
Cleansing /
transform for
purpose
Distributed
Storage (e.g.
Hadoop)
add semantic,
searchable,
anonymized, …
Data bases for
purpose
semantic
data access
On request
data services
simplified data lake data path
16. Exemplary high-level walk through to extract, store and deliver trend information
Clean, structured data(Semi) Unstructured
or raw data
Mining big data
Information
retrieving
Data Lake
storing and mining relevant information
Final PresentationData Source
Drill down boards
WWW sources
Large-scaled
Web crawlers
download all
links found
Saved
webpages
Search and mine
data to extract
semantic
(relevance)
Structured
(graph)
database of
trends to
allow for easy
access
Relevant
Internet
Webpages
for topic
Trend Report
source: trends.google
17. A data to value architecture is composed of many building blocks
Data sources and data
ingestion
Data Storage
Data Access / Pipelines
Value DeliveryDepending on the data type and
final business application
different elements are utilized
Business Application
Data
Governance
Functional Layer
Deployment / Physical
raw data input
valuable
data output
18. A data lake is often a fundamental part of the data to value stack and focuses on
the technical management of big data
Data sources and data
ingestion
Data Storage
Data Access / Pipelines
Value DeliveryDepending on the data type and
final business application
different elements are utilized
Business Application
Data
Governance
Functional Layer
Deployment / Physical
raw data input
valuable
data output
(often) focus of
data lake
architectures
19. Data Lake High level architecture with different possibilities to store, process, and
deliver valuable information
Text, emails
documents
Video,
Media
Voice, Music,
Sound
Unstructured
XML, JSON Sensor
Semi-structured data
Databases ERP core
Structured data
Data Sources
Stream
Batch
Hybrid
Data
Ingestion
Row Based
Column
Based
Relational
Graph DB
Document
DB
Non-Relational
Key-Value
Data Storage
Stream Batch Interactive
Data Access / Pipelines
Descriptive Predictive
Value Delivery
Visualizations Interfaces
Operational
Depending on the data type and
final business application
different elements are utilized
IoT
Prescriptive
Business Application
Availability
Data Security
Compliance &
Controls
Data Governance
Functional Layer
…
Roles &
Responsibility
Data Quality
Reporting's
Tactical Strategic
Deployment
On premise Cloud Hybrid
Application
Life cycle
20. Data Requirements
Which data are needed?
The design of a data pipeline / data lake depends on the business, technical, non-
functional requirements
Row Based
Column
Based
Relational
Graph DB
Document
DB
Non-Relational
Key-Value
Data Storage
Stream Batch Interactive
Data Access / Pipelines
Descriptive Predictive
Value Delivery
Visualizations Interfaces
Operational
Prescriptive
Business Application
Availability
Data Security
Roles &
Responsibility
Data Quality
Reporting's
Tactical Strategic
Deployment
On premise Cloud Hybrid
Application
Life cycle
Technical Requirements
How to realize?
Business Requirements
Why we need this?
Non-
functional
requirements
What
constraints?
21. The design of a data pipeline / data lake depends on the business, technical, non-
functional requirements – example questions to be unswered
Technical Requirements
How to realize?
Business Requirements
Why we need this?
Non-functional
requirements
What constraints?
Who is the customer (internal, external)?
How does it help in which situation / process?
Which value do we expect?
When we improve quality by x% which benefit do we expect?
How to visualize / serve the results / back integration?
Which service level has the solution (on request, 99%uptime)?
Where is the data allowed to be stored, e.g. GDPR?
Who has access to the application / data?
How is the support organized?
Which security level is granted?
How does the application provide the result, e.g. which technical interface?
How is the data stored, what are the latency requirements for read / write?
How to ensure a test / productive setup?
Where do we compute and which libraries?
Which algorithms serve best the requirements?
22. For each layer in the data stack many different vendors and applications exist
Data Storage
Data Access / Pipelines
Value Delivery
Business Application
Functional Layer
Deployment / Physical
Managing
big Data and
data pipelines
• Infrastructure und Hardware for Big Data
• Big Data Distributions (e.g.. Hadoop)
• Components for data management
(distributed data systems,
• in memory data bases,…)
Focus
Extracting
value
• Full business SaaS Services
• Tool boxes visualization
• Workflow enablement
23. Nearly all technical Big Data / Data Lakes are based on the (open source) Hadoop
& Ecosystem.
Com-
ponent
Description
HDFS The Hadoop Distributed File System.
Mahout Machine Learning on HDFS system
Zoo-
keeper
A centralized service for maintaining
synchronization and group services.
Yarn
Hadoop’s resource manager and job
scheduler.
HBase The Hadoop database.
Pig
A high-level data-flow language and execution
framework for parallel computation.
Spark
SQL
A module for structured and semi-structured
data processing.
Hive
A data warehouse infrastructure supporting
data summarization, query, and analysis.
Sqoop A tool to move data from RDBMS to Hadoop.
Flume A service for moving log data into Hadoop
Flume
Sqoop
Unstructured or semi-structured data Structured data
HDFS (Hadoop Distributed Files System)
HBase
Map Reduce Framework
Apache Oozie (Workflow)
Hive
DW System
PIG Latin
Data Analysis
Mahout
Machine Learning
Z
O
O
K
E
E
P
E
R
Data Storage
Data Access /
Pipelines
Ingestions
Functions
Focus
in stack
Components Layers
Frank Kienle
24. Nearly all Big Data / Data Lakes are based on the (open source) Hadoop & Ecosystem. However, only Enterprise Big Data
Platforms ensure a professional management
Component Description
Ambari
An open operational framework for provisioning, managing and monitoring Apache
Hadoop clusters.
HDFS The Hadoop Distributed File System.
Zookeeper
A centralized service for maintaining configuration information and naming, and for
providing distributed synchronization and group services.
Yarn Hadoop’s resource manager and job scheduler.
HBase The Hadoop database.
Pig A high-level data-flow language and execution framework for parallel computation.
Spark SQL A module for structured and semi-structured data processing.
Hive A data warehouse infrastructure supporting data summarization, query, and analysis.
Sqoop A tool to move data from RDBMS to Hadoop.
Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data into Hadoop.
Kafka A high-throughput, distributed, publish-subscribe messaging system.
Frank Kienle
25. Visualization Tools example for Data Scientists
(some practical tools/libraries, the purpose defines the tool)
Open source programming language,
active community participation, quick results
and must know-how for a data scientist
Focus on, interactive data visualizations
in web browsersava. Script library for
manipulating documents based on data
Most often used from nearly everybody
for visualization due to its mighty capabilities
and penetration
ExcelGeneral
Purpose Example
Web D3.js + derivates
Description
Rapid
Prototyping
Python (Matplotlib)
R (Shiny)
Professional
Visual Exploration
Tableau, Qlik
MS PowerBI
Professional interactive visualization tools
with focus on quick insights, with the goal
to provide business intelligence (BI) for an
enterprise
Focus
in stack
Visualization
Frank Kienle
26. Libraries/Algorithms/Programming/Tools
(some practical tools/libraries, the purpose defines the selection)
Query Languages and stream/batch processing
programming paradigms with ease access to
managed big data (there exist many more)
The two most important languages
for data science (there exist many more)
World wide most used tool for data
processing/calculation purposes with
mighty capabilities (mostly not know)
ExcelGeneral
Purpose Example
Statistics /
Machine Learning
Python + R
Description
(Big) Data
Processing
Spark + SQL
Tool Providers
Statistics/ML
SAS, Rapid Miner,
Knime, Matlab, …
Professional tools with the goal to provide
packaged, maintained and easy consumable
analytics for professional and citizen data
scientists
Focus
in stack
Functional Layer
Data Pipelines
Frank Kienle