SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
Recent Developments in
Spark MLlib and Beyond
Xiangrui Meng
What is MLlib?
What is MLlib?
MLlib is an Apache Spark component focusing on
machine learning:!
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8 (Sep 2013)
• 40 contributors
3
What is MLlib?
Algorithms:!
• classification: logistic regression, linear support vector machine
(SVM), naive Bayes, classification tree
• regression: generalized linear models (GLMs), regression tree
• collaborative filtering: alternating least squares (ALS), non-negative
matrix factorization (NMF)
• clustering: k-means
• decomposition: singular value decomposition (SVD), principal
component analysis (PCA)
4
Why MLlib?
scikit-learn
Algorithms:!
• classification: SVM, nearest neighbors, random forest, …
• regression: support vector regression (SVR), ridge
regression, Lasso, logistic regression, …!
• clustering: k-means, spectral clustering, …
• decomposition: PCA, non-negative matrix factorization
(NMF), independent component analysis (ICA), …
6
It is built on Apache Spark, a fast and general engine
for large-scale data processing.
• Run programs up to 100x faster than 

Hadoop MapReduce in memory, or 

10x faster on disk.
• Write applications quickly in Java, Scala, or Python.
Why MLlib?
7
Spark philosophy
Make life easy and productive for data scientists:
• well documented, expressive APIs
• powerful domain specific libraries
• easy integration with storage systems
• … and caching to avoid data movement
8
Word count (Scala)
val counts = sc.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1L))
.reduceByKey(_ + _)
9
Gradient descent
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.zeros(d)
for (i <- 1 to numIterations) {
  val gradient = points.map { p =>
    (1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x
  ).reduce(_ + _)
  w -= alpha * gradient
}
w w ↵ ·
nX
i=1
g(w; xi, yi)
10
K-means (scala)
// Load and parse the data.
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache()
!
// Cluster the data into two classes using KMeans.
val clusters = KMeans.train(parsedData, 2, numIterations = 20)
!
// Compute the sum of squared errors.
val cost = clusters.computeCost(parsedData)
println("Sum of squared errors = " + cost)
11
K-means (python)
# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line:
array([float(x) for x in line.split(' ‘)])).cache()
!
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,
runs = 1, initialization_mode = "kmeans||")
!
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
!
cost = parsedData.map(lambda point: error(point))
.reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
12
Dimension reduction

+ k-means
// compute principal components
val points: RDD[Vector] = ...
val mat = RowMatrix(points)
val pc = mat.computePrincipalComponents(20)
!
// project points to a low-dimensional space
val projected = mat.multiply(pc).rows
!
// train a k-means model on the projected data
val model = KMeans.train(projected, 10)
13
Collaborative filtering
// Load and parse the data
val data = sc.textFile("mllib/data/als/test.data")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
!
// Build the recommendation model using ALS
val numIterations = 20
val model = ALS.train(ratings, 1, 20, 0.01)
!
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
14
Why MLlib?
It ships with Spark as 

a standard component.
15
Community
Spark community
One of the largest open source projects in big data:!
• 190+ developers contributing (40 for MLlib)
• 50+ companies contributing
• 400+ discussions per month on the mailing list
17
Spark community
The most active project in the Hadoop ecosystem:
JIRA 30 day summary
Spark Map/Reduce YARN
created 296 30 100
resolved 203 28 69
18
Unification
Out for dinner?!
• Search for a restaurant and make a reservation.
• Start navigation.
• Food looks good? Take a photo and share.
20
Why smartphone?
Out for dinner?!
• Search for a restaurant and make a reservation. (Yellow Pages?)
• Start navigation. (GPS?)
• Food looks good? Take a photo and share. (Camera?)
21
Why MLlib?
A special-purpose device may be better at one
aspect than a general-purpose device. But the cost
of context switching is high:
• different languages or APIs
• different data formats
• different tuning tricks
22
Spark SQL + MLlib
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingTable = sql("""
SELECT e.action,
u.age,
u.latitude,
u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
!
// Since `sql` returns an RDD, the results of the above
// query can be easily used in MLlib.
val training = trainingTable.map { row =>
val features = Vectors.dense(row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
!
val model = SVMWithSGD.train(training)
23
Streaming + MLlib
// collect tweets using streaming
!
// train a k-means model
val model: KMmeansModel = ...
!
// apply model to filter tweets
val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0)))
val statuses = tweets.map(_.getText)
val filteredTweets =
statuses.filter(t => model.predict(featurize(t)) == clusterNumber)
!
// print tweets within this particular cluster
filteredTweets.print()
24
GraphX + MLlib
// assemble link graph
val graph = Graph(pages, links)
val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices
!
// load page labels (spam or not) and content features
val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ...
val training: RDD[LabeledPoint] =
labelAndFeatures.join(pageRank).map {
case (id, ((label, features), pageRank)) =>
LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank))
}
!
// train a spam detector using logistic regression
val model = LogisticRegressionWithSGD.train(training)
25
Why MLlib?
• built on Spark’s lightweight yet powerful APIs
• Spark’s open source community
• seamless integration with Spark’s other components
26
• comparable to or even better than other libraries
specialized in large-scale machine learning
Why MLlib?
In MLlib’s implementations, we care about!
• scalability
• performance
• user-friendly documentation and APIs
• cost of maintenance
27
v1.0
What’s new in MLlib
• new user guide and code examples
• API stability
• sparse data support
• regression and classification tree
• distributed matrices
• tall-and-skinny PCA and SVD
• L-BFGS
• binary classification model evaluation
29
MLlib user guide
• v0.9 (link)
• a single giant page
• mixed code snippets in different languages
• v1.0 (link)
• The content is more organized.
• Tabs are used to separate code snippets in different
languages.
• Unified API docs.
30
Code examples
Code examples that can be used as templates to
build standalone applications are included under the
“examples/” folder with sample datasets:
• binary classification (linear SVM and logistic regression)
• decision tree
• k-means
• linear regression
• naive Bayes
• tall-and-skinny PCA and SVD
• collaborative filtering
31
Code examples
MovieLensALS: an example app for ALS on MovieLens data.
Usage: MovieLensALS [options] <input>
!
--rank <value>
rank, default: 10
--numIterations <value>
number of iterations, default: 20
--lambda <value>
lambda (smoothing constant), default: 1.0
--kryo
use Kryo serialization
--implicitPrefs
use implicit preference
<input>
input paths to a MovieLens dataset of ratings
32
API stability
• Following Spark core, MLlib is guaranteeing binary
compatibility for all 1.x releases on stable APIs.
• For changes in experimental and developer APIs,
we will provide migration guide between releases.
33
Sparse data support
“large-scale sparse problems”
Sparse datasets appear almost everywhere in the
world of big data, where the sparsity may come from
many sources, e.g.,
• feature transformation: 

one-hot encoding, interaction, and bucketing,
• large feature space: n-grams,
• missing data: rating matrix,
• low-rank structure: images and signals.
35
Sparsity is almost everywhere
The Netflix Prize:!
• number of users: 480,189
• number of movies: 17,770
• number of observed ratings: 100,480,507
• sparsity = 1.17%
36
Sparsity is almost everywhere
rcv1.binary (test)!
• number of examples: 677,399
• number of features: 47,236
• sparsity: 0.15%
• storage: 270GB (dense) or 600MB (sparse)
37
Exploiting sparsity
• In Spark v1.0, MLlib adds support for sparse input
data in Scala, Java, and Python.
• MLlib takes advantage of sparsity in both storage
and computation in
• linear methods (linear SVM, logistic regression, etc)
• naive Bayes,
• k-means,
• summary statistics.
38
Sparse representation of features
39
dense : 1. 0. 0. 0. 0. 0. 3.
sparse :
8
><
>:
size : 7
indices : 0 6
values : 1. 3.
Exploiting sparsity in k-means
Training set:!
• number of examples: 12 million
• number of features: 500
• sparsity: 10%
!
Not only did we save 40GB of storage by switching to the sparse
format, but we also received a 4x speedup.
dense sparse
storage 47GB 7GB
time 240s 58s
40
Implementation of sparse k-means
Algorithm:!
• For each point, find its closest center.



• Update cluster centers.
li = arg min
j
kxi cjk2
2
cj =
P
i,li=j xj
P
i,li=j 1
41
Implementation of sparse k-means
The points are usually sparse, but the centers are most likely to be
dense. Computing the distance takes O(d) time. So the time
complexity is O(n d k) per iteration. We don’t take any advantage of
sparsity on the running time. However, we have
kx ck2
2 = kxk2
2 + kck2
2 2hx, ci
Computing the inner product only needs non-zero elements. So we
can cache the norms of the points and of the centers, and then only
need the inner products to obtain the distances. This reduce the
running time to O(nnz k + d k) per iteration.
42
Exploiting sparsity in linear methods
The essential part of the computation in a gradient-
based method is computing the sum of gradients. For
linear methods, the gradient is sparse if the feature
vector is sparse. In our implementation, the total
computation cost is linear on the input sparsity.
43
Acknowledgement
For sparse data support, special thanks go to!
• David Hall (breeze),
• Sam Halliday (netlib-java),
• and many others who joined the discussion.
44
Decision tree
Decision tree
• a major feature introduced in MLlib v1.0
• supports both classification and regression
• supports both continuous features and categorical
features
• scales well on massive datasets
46
Decision tree
Come to to learn more:!
• Scalable distributed decision trees in Spark MLlib
• Manish Made, Hirakendu Das, 

Evan Sparks, Ameet Talwalkar
47
More features to explore
• distributed matrices
• tall-and-skinny PCA and SVD
• L-BFGS
• binary classification model evaluation
48
Beyond v1.0
Next release (v1.1)
Spark has a 3-month release cycle:!
• July 25: cut-off for new features
• August 1: QA period starts
• August 15+: RC’s and voting
50
MLlib roadmap for v1.1
• Standardize interfaces!
• problem / formulation / algorithm / parameter set / model
• towards building an ecosystem
• Train multiple models in parallel!
• large-scale vs. medium-scale
• Statistical analysis!
• more descriptive statistics
• more sampling and bootstrapping
• hypothesis testing
51
MLlib roadmap for v1.1
• Learning algorithms!
• Non-negative matrix factorization (NMF)
• Sparse SVD
• Multiclass support in decision tree
• Latent Dirichlet allocation (LDA)
• …
• Optimization algorithms!
• Alternating direction method of multipliers (ADMM)
• Accelerated gradient methods
52
Beyond v1.1?
What to expect from the community (and maybe you):!
• scalable implementations of well known and well
understood machine learning algorithms
• user-friendly documentation and consistent APIs
• better integration with Spark SQL, Streaming, and GraphX
• addressing practical machine learning pipelines
53
Contributors
Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai,
Evan Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden
Karau, Hossein Falaki, Jey Kottalam, Cheng Lian, Marek
Kolodziej, Mark Hamstra, Martin Jaggi, Martin Weindel, Matei
Zaharia, Nick Pentreath, Patrick Wendell, Prashant Sharma,
Reynold Xin, Reza Zadeh, Sandy Ryza, Sean Owen, Shivaram
Venkataraman, Tor Myklebust, Xiangrui Meng, Xinghao Pan,
Xusen Yin, Jerry Shao, Sandeep Singh, Ryan LeCompte
54
Interested?
• : http://spark-summit.org
• Website: http://spark.apache.org
• Tutorials: http://ampcamp.berkeley.edu
• Github: https://github.com/apache/spark
• Mailing lists: user@spark.apache.org

dev@spark.apache.org
55

Mais conteúdo relacionado

Mais procurados

Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindingsDmitriy Lyubimov
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Universitat Politècnica de Catalunya
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymSri Ambati
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMark Lefevre, CQF
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 

Mais procurados (20)

Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 

Destaque

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengSpark Summit
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Spark Summit
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlibXiangrui Meng
 
Computational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local AdsComputational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local Adssoupsranjan
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalRunwei Qiang
 
(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scaleLawrence Evans
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsSpark Summit
 
Presentación final
Presentación finalPresentación final
Presentación finaldocentecis
 
8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry Febrey8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry FebreyTerry Febrey
 
医用画像情報イントロダクション Ver.1 0_20160726
医用画像情報イントロダクション Ver.1 0_20160726医用画像情報イントロダクション Ver.1 0_20160726
医用画像情報イントロダクション Ver.1 0_20160726Tatsuaki Kobayashi
 
効果的なXPの導入を目的とした プラクティス間の相互作用の分析
効果的なXPの導入を目的とした プラクティス間の相互作用の分析効果的なXPの導入を目的とした プラクティス間の相互作用の分析
効果的なXPの導入を目的とした プラクティス間の相互作用の分析Makoto SAKAI
 
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy Ogilvy Consulting
 
PEDIDO DE PROVIDÊNCIA 814
PEDIDO DE PROVIDÊNCIA 814PEDIDO DE PROVIDÊNCIA 814
PEDIDO DE PROVIDÊNCIA 814vereadoreduardo
 
The sps code of conduct 2011
The sps code of conduct 2011The sps code of conduct 2011
The sps code of conduct 2011bambangsaja
 
Information wants to be free
Information wants to be freeInformation wants to be free
Information wants to be freeEric Tachibana
 
Recommendation Letter - Xiuting
Recommendation Letter - XiutingRecommendation Letter - Xiuting
Recommendation Letter - XiutingXiuting Hao
 
Excel dad6 8
Excel dad6 8Excel dad6 8
Excel dad6 8daalt209
 

Destaque (20)

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Training
TrainingTraining
Training
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Computational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local AdsComputational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local Ads
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog Retrieval
 
(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
 
Presentación final
Presentación finalPresentación final
Presentación final
 
8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry Febrey8 Truths About Exercising presented by Terry Febrey
8 Truths About Exercising presented by Terry Febrey
 
医用画像情報イントロダクション Ver.1 0_20160726
医用画像情報イントロダクション Ver.1 0_20160726医用画像情報イントロダクション Ver.1 0_20160726
医用画像情報イントロダクション Ver.1 0_20160726
 
効果的なXPの導入を目的とした プラクティス間の相互作用の分析
効果的なXPの導入を目的とした プラクティス間の相互作用の分析効果的なXPの導入を目的とした プラクティス間の相互作用の分析
効果的なXPの導入を目的とした プラクティス間の相互作用の分析
 
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
Day 1 Reflection at #SXSW 2013 -- #SXSWOgilvy
 
PEDIDO DE PROVIDÊNCIA 814
PEDIDO DE PROVIDÊNCIA 814PEDIDO DE PROVIDÊNCIA 814
PEDIDO DE PROVIDÊNCIA 814
 
8i standby
8i standby8i standby
8i standby
 
The sps code of conduct 2011
The sps code of conduct 2011The sps code of conduct 2011
The sps code of conduct 2011
 
Information wants to be free
Information wants to be freeInformation wants to be free
Information wants to be free
 
Recommendation Letter - Xiuting
Recommendation Letter - XiutingRecommendation Letter - Xiuting
Recommendation Letter - Xiuting
 
Excel dad6 8
Excel dad6 8Excel dad6 8
Excel dad6 8
 

Semelhante a Recent Developments in Spark MLlib and Beyond

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Sparkfelixcss
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache SparkSimon Lia-Jonassen
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedChao Chen
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 

Semelhante a Recent Developments in Spark MLlib and Beyond (20)

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 

Último

JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfTobias Schneck
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptkinjal48
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntelliSource Technologies
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyRaymond Okyere-Forson
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfBrain Inventory
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxJoão Esperancinha
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Incrobinwilliams8624
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 

Último (20)

JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.ppt
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptx
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human Beauty
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdf
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptx
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Inc
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 

Recent Developments in Spark MLlib and Beyond

  • 1. Recent Developments in Spark MLlib and Beyond Xiangrui Meng
  • 3. What is MLlib? MLlib is an Apache Spark component focusing on machine learning:! • initial contribution from AMPLab, UC Berkeley • shipped with Spark since version 0.8 (Sep 2013) • 40 contributors 3
  • 4. What is MLlib? Algorithms:! • classification: logistic regression, linear support vector machine (SVM), naive Bayes, classification tree • regression: generalized linear models (GLMs), regression tree • collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) • clustering: k-means • decomposition: singular value decomposition (SVD), principal component analysis (PCA) 4
  • 6. scikit-learn Algorithms:! • classification: SVM, nearest neighbors, random forest, … • regression: support vector regression (SVR), ridge regression, Lasso, logistic regression, …! • clustering: k-means, spectral clustering, … • decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), … 6
  • 7. It is built on Apache Spark, a fast and general engine for large-scale data processing. • Run programs up to 100x faster than 
 Hadoop MapReduce in memory, or 
 10x faster on disk. • Write applications quickly in Java, Scala, or Python. Why MLlib? 7
  • 8. Spark philosophy Make life easy and productive for data scientists: • well documented, expressive APIs • powerful domain specific libraries • easy integration with storage systems • … and caching to avoid data movement 8
  • 9. Word count (Scala) val counts = sc.textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1L)) .reduceByKey(_ + _) 9
  • 10. Gradient descent val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.zeros(d) for (i <- 1 to numIterations) {   val gradient = points.map { p =>     (1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x   ).reduce(_ + _)   w -= alpha * gradient } w w ↵ · nX i=1 g(w; xi, yi) 10
  • 11. K-means (scala) // Load and parse the data. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache() ! // Cluster the data into two classes using KMeans. val clusters = KMeans.train(parsedData, 2, numIterations = 20) ! // Compute the sum of squared errors. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost) 11
  • 12. K-means (python) # Load and parse the data data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() ! # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations = 10, runs = 1, initialization_mode = "kmeans||") ! # Evaluate clustering by computing the sum of squared errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) ! cost = parsedData.map(lambda point: error(point)) .reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost)) 12
  • 13. Dimension reduction
 + k-means // compute principal components val points: RDD[Vector] = ... val mat = RowMatrix(points) val pc = mat.computePrincipalComponents(20) ! // project points to a low-dimensional space val projected = mat.multiply(pc).rows ! // train a k-means model on the projected data val model = KMeans.train(projected, 10) 13
  • 14. Collaborative filtering // Load and parse the data val data = sc.textFile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) ! // Build the recommendation model using ALS val numIterations = 20 val model = ALS.train(ratings, 1, 20, 0.01) ! // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts) 14
  • 15. Why MLlib? It ships with Spark as 
 a standard component. 15
  • 17. Spark community One of the largest open source projects in big data:! • 190+ developers contributing (40 for MLlib) • 50+ companies contributing • 400+ discussions per month on the mailing list 17
  • 18. Spark community The most active project in the Hadoop ecosystem: JIRA 30 day summary Spark Map/Reduce YARN created 296 30 100 resolved 203 28 69 18
  • 20. Out for dinner?! • Search for a restaurant and make a reservation. • Start navigation. • Food looks good? Take a photo and share. 20
  • 21. Why smartphone? Out for dinner?! • Search for a restaurant and make a reservation. (Yellow Pages?) • Start navigation. (GPS?) • Food looks good? Take a photo and share. (Camera?) 21
  • 22. Why MLlib? A special-purpose device may be better at one aspect than a general-purpose device. But the cost of context switching is high: • different languages or APIs • different data formats • different tuning tricks 22
  • 23. Spark SQL + MLlib // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") ! // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib. val training = trainingTable.map { row => val features = Vectors.dense(row(1), row(2), row(3)) LabeledPoint(row(0), features) } ! val model = SVMWithSGD.train(training) 23
  • 24. Streaming + MLlib // collect tweets using streaming ! // train a k-means model val model: KMmeansModel = ... ! // apply model to filter tweets val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0))) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter(t => model.predict(featurize(t)) == clusterNumber) ! // print tweets within this particular cluster filteredTweets.print() 24
  • 25. GraphX + MLlib // assemble link graph val graph = Graph(pages, links) val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices ! // load page labels (spam or not) and content features val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ... val training: RDD[LabeledPoint] = labelAndFeatures.join(pageRank).map { case (id, ((label, features), pageRank)) => LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank)) } ! // train a spam detector using logistic regression val model = LogisticRegressionWithSGD.train(training) 25
  • 26. Why MLlib? • built on Spark’s lightweight yet powerful APIs • Spark’s open source community • seamless integration with Spark’s other components 26 • comparable to or even better than other libraries specialized in large-scale machine learning
  • 27. Why MLlib? In MLlib’s implementations, we care about! • scalability • performance • user-friendly documentation and APIs • cost of maintenance 27
  • 28. v1.0
  • 29. What’s new in MLlib • new user guide and code examples • API stability • sparse data support • regression and classification tree • distributed matrices • tall-and-skinny PCA and SVD • L-BFGS • binary classification model evaluation 29
  • 30. MLlib user guide • v0.9 (link) • a single giant page • mixed code snippets in different languages • v1.0 (link) • The content is more organized. • Tabs are used to separate code snippets in different languages. • Unified API docs. 30
  • 31. Code examples Code examples that can be used as templates to build standalone applications are included under the “examples/” folder with sample datasets: • binary classification (linear SVM and logistic regression) • decision tree • k-means • linear regression • naive Bayes • tall-and-skinny PCA and SVD • collaborative filtering 31
  • 32. Code examples MovieLensALS: an example app for ALS on MovieLens data. Usage: MovieLensALS [options] <input> ! --rank <value> rank, default: 10 --numIterations <value> number of iterations, default: 20 --lambda <value> lambda (smoothing constant), default: 1.0 --kryo use Kryo serialization --implicitPrefs use implicit preference <input> input paths to a MovieLens dataset of ratings 32
  • 33. API stability • Following Spark core, MLlib is guaranteeing binary compatibility for all 1.x releases on stable APIs. • For changes in experimental and developer APIs, we will provide migration guide between releases. 33
  • 35. “large-scale sparse problems” Sparse datasets appear almost everywhere in the world of big data, where the sparsity may come from many sources, e.g., • feature transformation: 
 one-hot encoding, interaction, and bucketing, • large feature space: n-grams, • missing data: rating matrix, • low-rank structure: images and signals. 35
  • 36. Sparsity is almost everywhere The Netflix Prize:! • number of users: 480,189 • number of movies: 17,770 • number of observed ratings: 100,480,507 • sparsity = 1.17% 36
  • 37. Sparsity is almost everywhere rcv1.binary (test)! • number of examples: 677,399 • number of features: 47,236 • sparsity: 0.15% • storage: 270GB (dense) or 600MB (sparse) 37
  • 38. Exploiting sparsity • In Spark v1.0, MLlib adds support for sparse input data in Scala, Java, and Python. • MLlib takes advantage of sparsity in both storage and computation in • linear methods (linear SVM, logistic regression, etc) • naive Bayes, • k-means, • summary statistics. 38
  • 39. Sparse representation of features 39 dense : 1. 0. 0. 0. 0. 0. 3. sparse : 8 >< >: size : 7 indices : 0 6 values : 1. 3.
  • 40. Exploiting sparsity in k-means Training set:! • number of examples: 12 million • number of features: 500 • sparsity: 10% ! Not only did we save 40GB of storage by switching to the sparse format, but we also received a 4x speedup. dense sparse storage 47GB 7GB time 240s 58s 40
  • 41. Implementation of sparse k-means Algorithm:! • For each point, find its closest center.
 
 • Update cluster centers. li = arg min j kxi cjk2 2 cj = P i,li=j xj P i,li=j 1 41
  • 42. Implementation of sparse k-means The points are usually sparse, but the centers are most likely to be dense. Computing the distance takes O(d) time. So the time complexity is O(n d k) per iteration. We don’t take any advantage of sparsity on the running time. However, we have kx ck2 2 = kxk2 2 + kck2 2 2hx, ci Computing the inner product only needs non-zero elements. So we can cache the norms of the points and of the centers, and then only need the inner products to obtain the distances. This reduce the running time to O(nnz k + d k) per iteration. 42
  • 43. Exploiting sparsity in linear methods The essential part of the computation in a gradient- based method is computing the sum of gradients. For linear methods, the gradient is sparse if the feature vector is sparse. In our implementation, the total computation cost is linear on the input sparsity. 43
  • 44. Acknowledgement For sparse data support, special thanks go to! • David Hall (breeze), • Sam Halliday (netlib-java), • and many others who joined the discussion. 44
  • 46. Decision tree • a major feature introduced in MLlib v1.0 • supports both classification and regression • supports both continuous features and categorical features • scales well on massive datasets 46
  • 47. Decision tree Come to to learn more:! • Scalable distributed decision trees in Spark MLlib • Manish Made, Hirakendu Das, 
 Evan Sparks, Ameet Talwalkar 47
  • 48. More features to explore • distributed matrices • tall-and-skinny PCA and SVD • L-BFGS • binary classification model evaluation 48
  • 50. Next release (v1.1) Spark has a 3-month release cycle:! • July 25: cut-off for new features • August 1: QA period starts • August 15+: RC’s and voting 50
  • 51. MLlib roadmap for v1.1 • Standardize interfaces! • problem / formulation / algorithm / parameter set / model • towards building an ecosystem • Train multiple models in parallel! • large-scale vs. medium-scale • Statistical analysis! • more descriptive statistics • more sampling and bootstrapping • hypothesis testing 51
  • 52. MLlib roadmap for v1.1 • Learning algorithms! • Non-negative matrix factorization (NMF) • Sparse SVD • Multiclass support in decision tree • Latent Dirichlet allocation (LDA) • … • Optimization algorithms! • Alternating direction method of multipliers (ADMM) • Accelerated gradient methods 52
  • 53. Beyond v1.1? What to expect from the community (and maybe you):! • scalable implementations of well known and well understood machine learning algorithms • user-friendly documentation and consistent APIs • better integration with Spark SQL, Streaming, and GraphX • addressing practical machine learning pipelines 53
  • 54. Contributors Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau, Hossein Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark Hamstra, Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath, Patrick Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy Ryza, Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui Meng, Xinghao Pan, Xusen Yin, Jerry Shao, Sandeep Singh, Ryan LeCompte 54
  • 55. Interested? • : http://spark-summit.org • Website: http://spark.apache.org • Tutorials: http://ampcamp.berkeley.edu • Github: https://github.com/apache/spark • Mailing lists: user@spark.apache.org
 dev@spark.apache.org 55