This document summarizes a presentation given by Michael Ger, Dr. Andreas Pawlik, and Dr. Seunghan Han of NorCom and Hortonworks about their DaSense data science platform. DaSense is designed to help researchers developing autonomous vehicle systems by allowing them to more efficiently run simulations and test algorithms on large datasets using distributed high performance computing resources. It aims to accelerate the development process by enabling experiments that previously took days to be completed within hours or minutes by leveraging large compute clusters. DaSense provides tools for building end-to-end data science pipelines for tasks like data filtering, model training, evaluation and analysis.
7. 7NorCom Information TechnologyAG
NorCom at a Glance
NorCom IT AG
Munich, Nuremberg, San Jose
Est. ‘89, IPO '99
EAGLE
Document Based
Collaboration for Large Enterprises
DaSense
Logistic, Analyses & Simulation
for Sensor Test & Field Data
8. 8NorCom Information TechnologyAG
Data Tsunami is Coming
− Development
− Few development locations worldwide
− Some test vehicles (<100)
− raw camera sensor data
− Parameter Sweeps / Training of neural networks
− Testing Phase
− Many locations worldwide
− Lots of test vehicles
− Compressed Data (Video)
− Verification
− Field
− All around the world (with many regulators)
− Hundreds of thousands of connected cars
− Triggered Data
− Machine Learning
Next Generation
Data Rate
2GB/s per vehicle
Current Generations
Data Rate
350MB/s per vehicle
Connected Cars
Data Rate
mainly mobile
10. 10NorCom Information TechnologyAG
Big Data is not a new Phenomenon
2003 Google
Distributed File
System Paper MapReduce Paper
2006 Hadoop
is born
from Nutch 2008 Facebook launches Hive
2004 Doug Cutting adds
DFS and MapReduce to Nutch
2009 Yahoo! used Hadoop
to sort one terabyte in 62 seconds
2017 Apache
Hadoop 3.0
-Innovation
-Operation
-Stability
2010 Spark Paper
2000 2005 2010 2015
11. 11NorCom Information TechnologyAG
Leverage Big Data Technology
− DaSense Technology
− Automotive Formats
− Geo-Distributed Analyses
− Engineer Self Service
− Enterprise Level Implementation
− Security & Access Control
− IT Process Integration
− Open-Source HADOOP Technology:
− Scalable
− Cost Effective
− Flexible
− Fast Access
− Resilient
DaSense
12. 12NorCom Information TechnologyAG
DaSense
Data Science Platform
Orchestration platform for data driven innovation
- Connect to multiple heterogeneous clusters
- Prepare interactive data analysis environments
- Build analytics apps
13. 13NorCom Information TechnologyAG
DaSense - An orchestration platform for data driven innovation
DaSense
Enterprise IT
Cross-cutting
Concerns
Data Analytics
Big Data
Big Data Tools
14. 14NorCom Information TechnologyAG
DaSense: Cross Distribution, Cross Architecture, Cross Countries
DaSense
DaSense
On Premise
Standard Processing
Performance & GPU Processing
Cloud
Elastic Compute
Distributed Storage
Distributed
Query
15. 15NorCom Information TechnologyAG
DaSense: Processes for Department and Enterprise IT
Mission
Critical
Data
Science
Engineer
fast “one time” results
specify
PROD
TRAIN
DEV
Security/SLAAgile/DevOps
Data
Downsampling &
Tokenisation
Approval
Department IT
Enterprise IT
16. 16NorCom Information TechnologyAG
DaSense: Key Architectural Aspects
Workspace & Codespace
Organize your code, analytic
libraries and images
in your desired environment
Reporting &
Collaboration.
Stay connected
with project members
DaSense App
Package your analytics
as DaSense App.
Deploy and Launch with ease
On PremiseCloud
Multi-User management & RBAC
. . .
Multi-Cluster Job Orchestration Service
Analytic Runtime Management Service
Multi-Cluster Binding Service
DaSenseManagesDaSenseSupports
17. 17NorCom Information TechnologyAG
DaSense
Let‘s DaSense
DaSense in Action
- Connect to multiple heterogeneous clusters
- Prepare interactive data analysis environments
- Build analytics apps
20. 20NorCom Information TechnologyAG
Use various analytic environment stacks
e.g.
Work in “R”
Use your integrated development
environment of choice
e.g.
Work in “python”
23. 23NorCom Information TechnologyAG
Durability
§ Fleet statistics
§ Combined analysis
of logs and and
traces
Event Search
§ Event search in
traces
§ Cut-Outs/Snippets
§ Root-Cause
Analysis Diagnostic
Data
Simulation/Optimization
ECU integration
§ Network analysis
§ CAN, FlexRay
Root Cause Analysis
§ Particle Emission
PowerTrain
§ Gear shift quality
Anomaly detection
§ Deviation from
reference
… and many more
Example Use Cases
Image Processing
§ Similarity Search
§ Classification
§ Automated
annotation
24. 24NorCom Information TechnologyAG
DaSense
Use Case: Simulation/Optimization
Development and Validation of Autonomous Driving Algorithms
- Schedule simulations on a Hadoop Cluster
- Integrate GPUs for Deep Learning
- Build end-to-end Data Science pipelines
25. 25NorCom Information TechnologyAG
algorithm
development
Algorithm
v0.1
testing on the
data
algorithm
development
Algorithm
v0.2
…Apply:
Execute algorithm
on real world big data
Test-driven algorithm development
=> iterative process
software in the loop
“virtual test drive”
https://www.cityscapes-dataset.com/examples/#videos
26. 26NorCom Information TechnologyAG
Local Development
New
algorithms
Parameter
sets
C/C++
GPU
Limited to local workstation CPU
Limited to small local data set
Limited ingest of new data
à Very slow development cycle
à Limited test coverage
Data transfer through office
networks very slow
27. 27NorCom Information TechnologyAG
Big Data Cluster Development
New
algorithms
Parameter
sets
C/C++
GPU
Fast datacenter-grade Network Fast datacenter-grade Network
Selectable
Access to all Data
Speed only Limited by
Cluster Size
C/C++ C/C++
28. 28NorCom Information TechnologyAG
DaSense container service:
put algorithm inside a Docker
container
=> make simulations portable and reproducible
=> facilitate resource sharing through resource isolation
Implementation Step 1: Containerize Algorithms
l Run non-native Hadoop
code on Hadoop
l light-weight (= fast)
l resource isolation
29. 29NorCom Information TechnologyAG
DaSense CPU/GPU
scheduler: organize
jobs in the Hadoop
cluster
Implementation Step 2: Execute in the Hadoop cluster
https://www.cityscapes-dataset.com/examples/#videos
30. 30NorCom Information TechnologyAG
Results
0
50
100
150
200
250
300
350
400
450
Sequential run (desktop) Parallel run (cluster)
Time[minutes]
moving algorithm development from a
desktop workstation into a cluster:
• using a Hadoop cluster with 6 nodes
• running 50 evaluation configurations (e.g.
5 datasets, 10 parameter sets)
=> Experiments that used to take a working day to run (7 hours)
can now be run “within a coffee break" (17 minutes)
one cycle per
working day
one cycle per
coffee break
Based on work published in Proceedings of the 25th Aachen Kolloquium,Abthoff et al., 2016
31. 31NorCom Information TechnologyAG
algorithm
development
Algorithm
v0.1
testing on the
data
algorithm
development
Algorithm
v0.2
…Apply:
Execute algorithm
on real world big data
Build End-to-End Data Science Pipelines
Filter & Summarize:
process results & compile statistics
Search:
select test data
e.g.“find all cases where
this algorithm missed a
stop sign and count how
often this happened”
e.g.: “select all data of
crossings in bad weather”
https://www.cityscapes-dataset.com/examples/#videos
32. 32NorCom Information TechnologyAG
Summary
− FASTER
Day-taking tasks can be
computed “in a coffee break”
− BETTER
Test coverage for verification
purpose is much higher
Shorter iterations = better
algorithms
− CHEAPER
Shared cluster CPUs/GPUs lead
to higher hardware utilization
Storage cost a fraction of
traditional solutions
Storage
Raw data; Indices; Processed data; Trained models
Search / Analysis
Queries; Statistical analysis; Report generation
Simulation /
Evaluation
Parameter optimization;
Model performance tests
Machine learning
Classification;Clustering;Deep
learning
Data
Ingest
Algorithms
Queries
Evaluation
Reports