Spring Boot vs Quarkus the ultimate battle - DevoxxUK
What’s New in the Berkeley Data Analytics Stack
1. What’s Next for the
Berkeley Data Analytics
Stack
UC BERKELEY
Michael Franklin
July 20 2015
Data Science Summit
SF
2. The Berkeley AMPLab
80+ Students, Postdocs, Faculty and Staff from:
Databases, Machine Learning, Systems, Security, and Netwo
Mission Statement: Making Sense of Data at Scale by Integratin
• Algorithms – Machine Learning, Statistical Methods,
• Machines – Cluster and Cloud Computing
• People – Crowdsourcing and Human Computation
Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldberg Mahoney
PopaGonzalez
3. AMPLab: A Public/Private Partnership
NSF CISE Expedition Award:
Part of 2012 White House Big Data Initiative
Darpa XData Program
DoE/Lawrence Berkeley National Lab
And these Industrial Sponsors:
4. Velox Model Serving
Tachyon
Spark
Streamin
g
Shark
BlinkDB
GraphX MLlib
MLBa
se
Spark
R
Cancer Genomics, Energy Debugging, Smart
Buildings
Sample
Clean
In House Applications
Spark
Berkeley Data Analytics
Stack
(Apache and BSD open source)
HDFS,
S3, …Mesos Yarn
Access and Interfaces
Processing Engine
Resource Virtualization
Tachyon
Storage
6. AMPLab Unification
Philosophy
Don’t specialize MapReduce – Generalize it!
Two additions to Hadoop MR can enable all the
models shown earlier!
1. General Task DAGs
2. Data Sharing
For Users:
Fewer Systems to Use
Less Data Movement
Spark
Streaming
GraphX
…SparkSQL
MLbase
7. In-Memory
Dataflow
System
M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing
with Working Sets, USENIX HotCloud, 2010.
• Developed in AMPLab and its predecessor the
RADLab
• Alternative to Hadoop MapReduce
• 10-100x speedup for ML and interactive queries
• Central component of the BDAS Stack
• “Graduated” to Apache Foundation -> Apache
Spark
17. Velox Model Serving
System
Decompose personalized predictive models:
17
[Crankshaw, Bailis, Gonzalez et al. CIDR’15]
Split
Personalization
Model
Feature
Model
OnlineBatch
Feature
Caching
Approx.
Features
Online
Updates
Active
Learning
Order-of-magnitude reductions in prediction latencies.
18. Access and
Interfaces
BDAS: Latest
Developments
Resource
Virtualization
Storage
Processing
Engine
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Cancer Genomics, Energy Debugging, Smart
Buildings
Velox
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
MLPipelin
es
• MLPipelines KeystoneML
– Alpha release
– End-to-end pipelines in vision, speech, and NLP
– Horizontal scalability to 100’s of machines and
multi-terabyte datasets
19. What is KeystoneML?
Software framework for building scalable end-to-end machine
learning pipelines.
Helps us explore how to build systems for robust, scalable, end-
to-end advanced analytics workloads and the patterns that
emerge.
Example pipelines that achieve state-of-the-art results on large
scale datasets in computer vision, NLP, and speech - fast.
Previewed at AMP Camp 5 and on AMPLab Blog as “ML
Pipelines”
Public release last month! http://keystone-ml.org/
20. How does it fit with
BDAS?
Spark
MLlibGraphX ml-matrix
KeystoneML
Batch Model Training
Velox
Model Server
Real Time Serving
http://amplab.github.io/velox-modelserver
21. Example: Image
Classification
Images
(VOC2007)
.fit( )
Resize
Grayscale
SIFT
PCA
Fisher Vector
MaxClassifier
Linear
Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance of
Chatfield et. al., 2011
Embarassingly parallel
featurization and evaluation
15 min on a modest cluster
5K examples, 40K features,
20 classes
23. Research Direction:
Automatic Resource
Estimation
Long-complicated pipelines.
» Just a composition of dataflows!
How long will this thing take to run?
When do I cache?
» Pose as a constrained optimization
problem.
Enables Efficient Hyperparameter Tuning
(ref. E. Sparks et al. “Automating Model Search for
Large Scale Machine Learning”, SOCC, Aug 2015)
Resize
Grayscale
SIFT
PCA
Fisher
Vector
Top 5
Classifier
LCS
PCA
Fisher
Vector
Block Linear
Solver
Weighted
Block Linear
Solver
26. Summary
• AmpLab project
• Cross-disciplinary team, Industry engagement
• Open Source development and community
building
• BDAS philosophy: Unification
• Spark + SQL + Graphs + ML + …
• After graduating Mesos, Tachyon & Spark
we are moving up the stack to support
declarative and real-time Machine
Learning and analytics.
27. To find out more or
get involved:
amplab.berkeley.edu
franklin@berkeley.e
du
UC BERKELEY
Thanks to NSF CISE Expeditions in Computing, DARPA XData,
Founding Sponsors: Amazon Web Services, Google, IBM, and SAP,
the Thomas and Stacy Siebel Foundation,
all our industrial sponsors and partners, and all the members of the AMPLab Team.
Notas do Editor
Connect to political bias story
Spark batch analytics vs low-latency serving system