SlideShare uma empresa Scribd logo
1 de 39
MLlib: Scalable Machine Learning on Spark
Xiangrui Meng
1
Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao
Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi,
Joseph Gonzalez, Michael Franklin, Michael I. Jordan, Tim Kraska, etc.
What is MLlib?
2
What is MLlib?
MLlib is a Spark subproject providing machine learning
primitives:
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8
• 33 contributors (March 2014)
3
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive Bayes
• regression: generalized linear regression (GLM)
• collaborative filtering: alternating least squares (ALS)
• clustering: k-means
• decomposition: singular value decomposition (SVD), principal
component analysis (PCA)
4
Why MLlib?
5
scikit-learn?
Algorithms:
• classification: SVM, nearest neighbors, random forest, …
• regression: support vector regression (SVR), ridge regression,
Lasso, logistic regression, …
• clustering: k-means, spectral clustering, …
• decomposition: PCA, non-negative matrix factorization (NMF),
independent component analysis (ICA), …
6
Mahout?
Algorithms:
• classification: logistic regression, naive Bayes, random forest, …
• collaborative filtering: ALS, …
• clustering: k-means, fuzzy k-means, …
• decomposition: SVD, randomized SVD, …
7
Vowpal Wabbit?H2O?
R?
MATLAB?
Mahout?
Weka?
scikit-learn?
LIBLINEAR?
8
GraphLab?
Why MLlib?
9
• It is built on Apache Spark, a fast and general
engine for large-scale data processing.
• Run programs up to 100x faster than
Hadoop MapReduce in memory, or
10x faster on disk.
• Write applications quickly in Java, Scala, or Python.
Why MLlib?
10
Word count
val counts = sc.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1L))
.reduceByKey(_ + _)
Gradient descent
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.zeros(d)
for (i <- 1 to numIterations) {
val gradient = points.map { p =>
(1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x
).reduce(_ + _)
w -= alpha * gradient
}
12
// Load and parse the data.
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache()
// Cluster the data into two classes using KMeans.
val clusters = KMeans.train(parsedData, 2, numIterations = 20)
// Compute the sum of squared errors.
val cost = clusters.computeCost(parsedData)
println("Sum of squared errors = " + cost)
k-means (scala)
13
k-means (python)
# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line:
array([float(x) for x in line.split(' ‘)])).cache()
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,
runs = 1, initialization_mode = "kmeans||")
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point))
.reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
14
Dimension reduction
+ k-means
// compute principal components
val points: RDD[Vector] = ...
val mat = RowMatrix(points)
val pc = mat.computePrincipalComponents(20)
// project points to a low-dimensional space
val projected = mat.multiply(pc).rows
// train a k-means model on the projected data
val model = KMeans.train(projected, 10)
Why MLlib?
• It ships with Spark as
a standard component.
16
Spark community
One of the largest open source projects in big data:
• 170+ developers contributing
• 30+ companies contributing
• 400+ discussions per month on the mailing list
Out for dinner?
• Search for a restaurant and make a reservation.
• Start navigation.
• Food looks good? Take a photo and share.
18
Why smartphone?
Out for dinner?
• Search for a restaurant and make a reservation. (Yellow Pages?)
• Start navigation. (GPS?)
• Food looks good? Take a photo and share. (Camera?)
19
Why MLlib?
A special-purpose device may be better at one aspect
than a general-purpose device. But the cost of context
switching is high:
• different languages or APIs
• different data formats
• different tuning tricks
20
Spark SQL + MLlib
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingTable = sql("""
SELECT e.action,
u.age,
u.latitude,
u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
// Since `sql` returns an RDD, the results of the above
// query can be easily used in MLlib.
val training = trainingTable.map { row =>
val features = Vectors.dense(row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model = SVMWithSGD.train(training)
Streaming + MLlib
// collect tweets using streaming
// train a k-means model
val model: KMmeansModel = ...
// apply model to filter tweets
val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0)))
val statuses = tweets.map(_.getText)
val filteredTweets =
statuses.filter(t => model.predict(featurize(t)) == clusterNumber)
// print tweets within this particular cluster
filteredTweets.print()
GraphX + MLlib
// assemble link graph
val graph = Graph(pages, links)
val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices
// load page labels (spam or not) and content features
val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ...
val training: RDD[LabeledPoint] =
labelAndFeatures.join(pageRank).map {
case (id, ((label, features), pageRank)) =>
LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank))
}
// train a spam detector using logistic regression
val model = LogisticRegressionWithSGD.train(training)
Why MLlib?
• Built on Spark’s lightweight yet powerful APIs.
• Spark’s open source community.
• Seamless integration with Spark’s other components.
24
• Comparable to or even better than other libraries
specialized in large-scale machine learning.
Logistic regression
25
Logistic regression - weak scaling
• Full dataset: 200K images, 160K dense features.
• Similar weak scaling.
• MLlib within a factor of 2 of VW’s wall-clock time.
MLlib
MLlib
26
Logistic regression - strong scaling
• Fixed Dataset: 50K images, 160K dense features.
• MLlib exhibits better scaling properties.
• MLlib is faster than VW with 16 and 32 machines.
MLlib
MLlib
27
Collaborative filtering
28
Collaborative filtering
• Recover a rating matrix from a
subset of its entries.?
?
?
?
?
29
30
Alternating least squares
ALS - wall-clock time
• Dataset: scaled version of Netflix data (9X in size).
• Cluster: 9 machines.
• MLlib is an order of magnitude faster than Mahout.
• MLlib is within factor of 2 of GraphLab.
System Wall-clock time (seconds)
MATLAB 15443
Mahout 4206
GraphLab 291
MLlib 481
31
Implementation of ALS
• broadcast everything
• data parallel
• fully parallel
32
Broadcast everything
• Master loads (small)
data file and initializes
models.
• Master broadcasts
data and initial models.
• At each iteration,
updated models are
broadcast again.
• Works OK for small
data.
• Lots of communication
overhead - doesn’t
scale well.
• Ships with Spark
ExamplesMaster
Workers
Ratings
Movie
Factors
User
Factors
Data parallel
• Workers load data
• Master broadcasts
initial models
• At each iteration,
updated models are
broadcast again
• Much better scaling
• Works on large
datasets
• Works well for smaller
models. (low K)
Master
Workers
Ratings
Movie
Factors
User
Factors
Ratings
Ratings
Ratings
Fully parallel
• Workers load data
• Models are
instantiated at workers.
• At each iteration,
models are shared via
join between workers.
• Much better scalability.
• Works on large
datasets
• Works on big models
(higher K)
Master
Workers
RatingsMovie
FactorsUser
Factors
RatingsMovie
FactorsUser
Factors
RatingsMovie
FactorsUser
Factors
RatingsMovie
FactorsUser
Factors
Implementation of ALS
• broadcast everything
• data parallel
• fully parallel
• block-wise parallel
• Users/products are partitioned into blocks and join is
based on blocks instead of individual user/product.
36
New features in v1.0
• Sparse data support
• Classification and regression tree (CART)
• Tall-and-skinny SVD and PCA
• L-BFGS
• Binary classification evaluation
37
Contributors
Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan
Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau, Hossein
Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark Hamstra,
Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath, Patrick
Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy Ryza,
Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui Meng,
Xinghao Pan, Xusen Yin, Jerry Shao, Ryan LeCompte
38
Interested?
• Website: http://spark.apache.org
• Tutorials: http://ampcamp.berkeley.edu
• Spark Summit: http://spark-summit.org
• Github: https://github.com/apache/spark
• Mailing lists: user@spark.apache.org
dev@spark.apache.org
39

Mais conteĂşdo relacionado

Mais procurados

CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
JaideepKatkar
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 

Mais procurados (20)

Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Om nom nom nom
Om nom nom nomOm nom nom nom
Om nom nom nom
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slides
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 

Destaque

Presentation2 2
Presentation2 2Presentation2 2
Presentation2 2
bobird
 
Cloud Based Digital Signage Framework
Cloud Based Digital Signage FrameworkCloud Based Digital Signage Framework
Cloud Based Digital Signage Framework
truDigital Signage
 
MLconf NYC Justin Basilico
MLconf NYC Justin BasilicoMLconf NYC Justin Basilico
MLconf NYC Justin Basilico
MLconf
 
Relatorio Global de Competitividade 2016
Relatorio Global de Competitividade 2016Relatorio Global de Competitividade 2016
Relatorio Global de Competitividade 2016
Fundação Dom Cabral - FDC
 
Why User Experience?
Why User Experience?Why User Experience?
Why User Experience?
Interakt
 
Vasona Networks @ Telco Vision 2013
Vasona Networks @ Telco Vision 2013Vasona Networks @ Telco Vision 2013
Vasona Networks @ Telco Vision 2013
vasonanetworks
 

Destaque (20)

Presentation2 2
Presentation2 2Presentation2 2
Presentation2 2
 
Cloud Based Digital Signage Framework
Cloud Based Digital Signage FrameworkCloud Based Digital Signage Framework
Cloud Based Digital Signage Framework
 
How to book_transfer_quick_engl
How to book_transfer_quick_englHow to book_transfer_quick_engl
How to book_transfer_quick_engl
 
2016.12.01 CP LaPrimaire.org crÊÊ 1er parti politique indÊpendant et ÊphÊmère
2016.12.01 CP LaPrimaire.org crÊÊ 1er parti politique indÊpendant et ÊphÊmère2016.12.01 CP LaPrimaire.org crÊÊ 1er parti politique indÊpendant et ÊphÊmère
2016.12.01 CP LaPrimaire.org crÊÊ 1er parti politique indÊpendant et ÊphÊmère
 
MLconf NYC Justin Basilico
MLconf NYC Justin BasilicoMLconf NYC Justin Basilico
MLconf NYC Justin Basilico
 
Aplicaçþes financeiras em períodos turbulentos
Aplicaçþes financeiras em períodos turbulentosAplicaçþes financeiras em períodos turbulentos
Aplicaçþes financeiras em períodos turbulentos
 
Bifiform in News & Social Media - Analytics Research
Bifiform in News & Social Media - Analytics ResearchBifiform in News & Social Media - Analytics Research
Bifiform in News & Social Media - Analytics Research
 
Inovação e excelência operacional características da empresa ambidestra
Inovação e excelência operacional características da empresa ambidestraInovação e excelência operacional características da empresa ambidestra
Inovação e excelência operacional características da empresa ambidestra
 
BigML Fall 2016 Release
BigML Fall 2016 ReleaseBigML Fall 2016 Release
BigML Fall 2016 Release
 
6 dicas de livros para uma nova liderança
6 dicas de livros para uma nova liderança6 dicas de livros para uma nova liderança
6 dicas de livros para uma nova liderança
 
Relatorio Global de Competitividade 2016
Relatorio Global de Competitividade 2016Relatorio Global de Competitividade 2016
Relatorio Global de Competitividade 2016
 
Gett, Uber & Yandex Taxi в социальных медиа
Gett, Uber & Yandex Taxi в социальных медиаGett, Uber & Yandex Taxi в социальных медиа
Gett, Uber & Yandex Taxi в социальных медиа
 
Why User Experience?
Why User Experience?Why User Experience?
Why User Experience?
 
Cyber safety 101
Cyber safety 101Cyber safety 101
Cyber safety 101
 
Vasona Networks @ Telco Vision 2013
Vasona Networks @ Telco Vision 2013Vasona Networks @ Telco Vision 2013
Vasona Networks @ Telco Vision 2013
 
From Quantified Self to Quantified Surgery
From Quantified Self to Quantified SurgeryFrom Quantified Self to Quantified Surgery
From Quantified Self to Quantified Surgery
 
GeoVantage Agriculture Imagery
GeoVantage Agriculture ImageryGeoVantage Agriculture Imagery
GeoVantage Agriculture Imagery
 
Inside3DPrinting_johnhornick
Inside3DPrinting_johnhornickInside3DPrinting_johnhornick
Inside3DPrinting_johnhornick
 
JSONModel Lightning Talk
JSONModel Lightning TalkJSONModel Lightning Talk
JSONModel Lightning Talk
 
How Ecosystem Economics™ Predicts the Winners in the Digital Age
How Ecosystem Economics™ Predicts the Winners in the Digital AgeHow Ecosystem Economics™ Predicts the Winners in the Digital Age
How Ecosystem Economics™ Predicts the Winners in the Digital Age
 

Semelhante a MLconf NYC Xiangrui Meng

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 

Semelhante a MLconf NYC Xiangrui Meng (20)

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Device status anomaly detection
Device status anomaly detectionDevice status anomaly detection
Device status anomaly detection
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubix
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 

Mais de MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

Mais de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

MLconf NYC Xiangrui Meng

  • 1. MLlib: Scalable Machine Learning on Spark Xiangrui Meng 1 Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph Gonzalez, Michael Franklin, Michael I. Jordan, Tim Kraska, etc.
  • 3. What is MLlib? MLlib is a Spark subproject providing machine learning primitives: • initial contribution from AMPLab, UC Berkeley • shipped with Spark since version 0.8 • 33 contributors (March 2014) 3
  • 4. What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive Bayes • regression: generalized linear regression (GLM) • collaborative filtering: alternating least squares (ALS) • clustering: k-means • decomposition: singular value decomposition (SVD), principal component analysis (PCA) 4
  • 6. scikit-learn? Algorithms: • classification: SVM, nearest neighbors, random forest, … • regression: support vector regression (SVR), ridge regression, Lasso, logistic regression, … • clustering: k-means, spectral clustering, … • decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), … 6
  • 7. Mahout? Algorithms: • classification: logistic regression, naive Bayes, random forest, … • collaborative filtering: ALS, … • clustering: k-means, fuzzy k-means, … • decomposition: SVD, randomized SVD, … 7
  • 10. • It is built on Apache Spark, a fast and general engine for large-scale data processing. • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Write applications quickly in Java, Scala, or Python. Why MLlib? 10
  • 11. Word count val counts = sc.textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1L)) .reduceByKey(_ + _)
  • 12. Gradient descent val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.zeros(d) for (i <- 1 to numIterations) { val gradient = points.map { p => (1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x ).reduce(_ + _) w -= alpha * gradient } 12
  • 13. // Load and parse the data. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache() // Cluster the data into two classes using KMeans. val clusters = KMeans.train(parsedData, 2, numIterations = 20) // Compute the sum of squared errors. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost) k-means (scala) 13
  • 14. k-means (python) # Load and parse the data data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations = 10, runs = 1, initialization_mode = "kmeans||") # Evaluate clustering by computing the sum of squared errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) cost = parsedData.map(lambda point: error(point)) .reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost)) 14
  • 15. Dimension reduction + k-means // compute principal components val points: RDD[Vector] = ... val mat = RowMatrix(points) val pc = mat.computePrincipalComponents(20) // project points to a low-dimensional space val projected = mat.multiply(pc).rows // train a k-means model on the projected data val model = KMeans.train(projected, 10)
  • 16. Why MLlib? • It ships with Spark as a standard component. 16
  • 17. Spark community One of the largest open source projects in big data: • 170+ developers contributing • 30+ companies contributing • 400+ discussions per month on the mailing list
  • 18. Out for dinner? • Search for a restaurant and make a reservation. • Start navigation. • Food looks good? Take a photo and share. 18
  • 19. Why smartphone? Out for dinner? • Search for a restaurant and make a reservation. (Yellow Pages?) • Start navigation. (GPS?) • Food looks good? Take a photo and share. (Camera?) 19
  • 20. Why MLlib? A special-purpose device may be better at one aspect than a general-purpose device. But the cost of context switching is high: • different languages or APIs • different data formats • different tuning tricks 20
  • 21. Spark SQL + MLlib // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib. val training = trainingTable.map { row => val features = Vectors.dense(row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = SVMWithSGD.train(training)
  • 22. Streaming + MLlib // collect tweets using streaming // train a k-means model val model: KMmeansModel = ... // apply model to filter tweets val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0))) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter(t => model.predict(featurize(t)) == clusterNumber) // print tweets within this particular cluster filteredTweets.print()
  • 23. GraphX + MLlib // assemble link graph val graph = Graph(pages, links) val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices // load page labels (spam or not) and content features val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ... val training: RDD[LabeledPoint] = labelAndFeatures.join(pageRank).map { case (id, ((label, features), pageRank)) => LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank)) } // train a spam detector using logistic regression val model = LogisticRegressionWithSGD.train(training)
  • 24. Why MLlib? • Built on Spark’s lightweight yet powerful APIs. • Spark’s open source community. • Seamless integration with Spark’s other components. 24 • Comparable to or even better than other libraries specialized in large-scale machine learning.
  • 26. Logistic regression - weak scaling • Full dataset: 200K images, 160K dense features. • Similar weak scaling. • MLlib within a factor of 2 of VW’s wall-clock time. MLlib MLlib 26
  • 27. Logistic regression - strong scaling • Fixed Dataset: 50K images, 160K dense features. • MLlib exhibits better scaling properties. • MLlib is faster than VW with 16 and 32 machines. MLlib MLlib 27
  • 29. Collaborative filtering • Recover a rating matrix from a subset of its entries.? ? ? ? ? 29
  • 31. ALS - wall-clock time • Dataset: scaled version of Netflix data (9X in size). • Cluster: 9 machines. • MLlib is an order of magnitude faster than Mahout. • MLlib is within factor of 2 of GraphLab. System Wall-clock time (seconds) MATLAB 15443 Mahout 4206 GraphLab 291 MLlib 481 31
  • 32. Implementation of ALS • broadcast everything • data parallel • fully parallel 32
  • 33. Broadcast everything • Master loads (small) data file and initializes models. • Master broadcasts data and initial models. • At each iteration, updated models are broadcast again. • Works OK for small data. • Lots of communication overhead - doesn’t scale well. • Ships with Spark ExamplesMaster Workers Ratings Movie Factors User Factors
  • 34. Data parallel • Workers load data • Master broadcasts initial models • At each iteration, updated models are broadcast again • Much better scaling • Works on large datasets • Works well for smaller models. (low K) Master Workers Ratings Movie Factors User Factors Ratings Ratings Ratings
  • 35. Fully parallel • Workers load data • Models are instantiated at workers. • At each iteration, models are shared via join between workers. • Much better scalability. • Works on large datasets • Works on big models (higher K) Master Workers RatingsMovie FactorsUser Factors RatingsMovie FactorsUser Factors RatingsMovie FactorsUser Factors RatingsMovie FactorsUser Factors
  • 36. Implementation of ALS • broadcast everything • data parallel • fully parallel • block-wise parallel • Users/products are partitioned into blocks and join is based on blocks instead of individual user/product. 36
  • 37. New features in v1.0 • Sparse data support • Classification and regression tree (CART) • Tall-and-skinny SVD and PCA • L-BFGS • Binary classification evaluation 37
  • 38. Contributors Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau, Hossein Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark Hamstra, Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath, Patrick Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy Ryza, Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui Meng, Xinghao Pan, Xusen Yin, Jerry Shao, Ryan LeCompte 38
  • 39. Interested? • Website: http://spark.apache.org • Tutorials: http://ampcamp.berkeley.edu • Spark Summit: http://spark-summit.org • Github: https://github.com/apache/spark • Mailing lists: user@spark.apache.org dev@spark.apache.org 39