1. About Me
• Postdoc in AMPLab
• Led initial development of MLlib
• Technical Advisor for Databricks
• Assistant Professor at UCLA
• Research interests include scalability and ease-of-
use issues in statistical machine learning
4. MLbase and MLlib
MLOpt
MLI
MLlib
Apache Spark
MLbase aims to
simplify development
and deployment of
scalable ML pipelines
MLlib: Spark’s core ML library
MLI, Pipelines: APIs to simplify ML development
• Tables, Matrices, Optimization, ML Pipelines
Experimental
Testbeds
Production
MLOpt: Declarative layer to automate hyperparameter tuning
Code
Pipelines
5. History of MLlib
Initial Release
• Developed by MLbase team in AMPLab (11 contributors)
• Scala, Java
• Shipped with Spark v0.8 (Sep 2013)
15 months later…
• 80+ contributors from various organization
• Scala, Java, Python
• Latest release part of Spark v1.1 (Sep 2014)
6. What’s in MLlib?
• Alternating Least Squares
• Lasso
• Ridge Regression
• Logistic Regression
• Decision Trees
• Naïve Bayes
• Support Vector Machines
• K-Means
• Gradient descent
• L-BFGS
• Random data generation
• Linear algebra
• Feature transformations
• Statistics: testing, correlation
• Evaluation metrics
Collaborative Filtering
for Recommendation
Prediction
Clustering
Optimization
Many Utilities
7. Benefits of MLlib
• Part of Spark
• Integrated data analysis workflow
• Free performance gains
Apache Spark
SparkSQL
Spark
Streaming MLlib GraphX
8. Benefits of MLlib
• Part of Spark
• Integrated data analysis workflow
• Free performance gains
• Scalable
• Python, Scala, Java APIs
• Broad coverage of applications & algorithms
• Rapid improvements in speed & robustness
9. History and Overview
Example Applications
Use Cases
Distributed ML Challenges
Code Examples
Ongoing Development
15. Clustering with K-Means
Data distributed by instance (point/row)
Smart initialization
Limited communication
(# clusters << # instances)
16. K-Means: Scala
// Load and parse data.
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map { x =>
Vectors.dense(x.split(‘ ').map(_.toDouble))
}.cache()
// Cluster data into 5 classes using KMeans.
val clusters = KMeans.train(
parsedData, k = 5, numIterations = 20)
// Evaluate clustering error.
val cost = clusters.computeCost(parsedData)
println("Sum of squared errors = " + cost)
17. K-Means: Python
# Load and parse data.
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line:
array([float(x) for x in line.split(' ‘)])).cache()
# Cluster data into 5 classes using KMeans.
clusters = KMeans.train(parsedData, k = 5, maxIterations = 20)
# Evaluate clustering error.
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point))
.reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
19. Recommendation
Collaborative filtering
Goal: Recommend
movies to users
Challenges:
• Defining similarity
• Dimensionality
Millions of Users / Items
• Sparsity
20. Recommendation
Solution: Assume ratings are
determined by a small
number of factors.
≈ x
25M Users, 100K Movies
! 2.5 trillion ratings
With 10 factors/user
! 250M parameters
22. Recommendation with Alternating Least
Squares (ALS)
Algorithm
Alternating update of
user/movie factors
= x
Can update factors
in parallel
Must be careful about
communication
23. Recommendation with Alternating Least
Squares (ALS)
// Load and parse the data
val data = sc.textFile("mllib/data/als/test.data")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val model = ALS.train(
ratings, rank = 10, numIterations = 20, regularizer = 0.01)
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
24. ALS: Today’s ML Exercise
• Load 1M/10M ratings from MovieLens
• Specify YOUR ratings on examples
• Split examples into training/validation
• Fit a model (Python or Scala)
• Improve model via parameter tuning
• Get YOUR recommendations
25. History and Overview
Example Applications
Ongoing Development
Performance
New APIs
26. Performance
50
ALS on Amazon Reviews on 16 nodes
minutes)
40
(30
Runtime 20
10
0
0M 200M 400M 600M 800M MLlib
Mahout
Number of Ratings
On a dataset with 660M users, 2.4M items, and 3.5B ratings
MLlib runs in 40 minutes with 50 nodes
27. Performance
Steady performance gains
Decision Trees
ALS
K-Means
Logistic
Regression
Ridge
Regression
~3X speedups on average
Speedup
(Spark 1.0 vs. 1.1)
28. Algorithms
In Spark 1.2
• Random Forests: ensembles of Decision Trees
• Boosting
Under development
• Topic modeling
• (many others)
Many others!
29. ML Pipelines
Typical ML workflow
Training
Data
Feature
Extraction
Model
Training
Model
Testing
Test
Data
30. ML Pipelines
Typical ML workflow is complex.
Training
Data
Feature
Extraction
Model
Training
Model
Testing
Test
Data
Feature
Extraction
Other
Training
Data
Other Model
Training
31. ML Pipelines
Typical ML workflow is complex.
Pipelines in 1.2 (alpha release)
• Easy workflow construction
• Standardized interface for model tuning
• Testing & failing early
Inspired by MLbase / Pipelines Project (see Evan’s talk)
Collaboration with Databricks
MLbase / MLOpt aims to autotune these pipelines
32. Datasets
Further Integration
with SparkSQL
ML pipelines require Datasets
• Handle many data types (features)
• Keep metadata about features
• Select subsets of features for different parts of pipeline
• Join groups of features
ML Dataset = SchemaRDD
Inspired by MLbase / MLI API
33. MLlib Programming
Guide
spark.apache.org/docs/latest/mllib-guide.
html
Databricks training info
databricks.com/spark-training
Spark user lists &
community
spark.apache.org/community.html
edX MOOC on Scalable
Machine Learning
www.edx.org/course/uc-berkeleyx/uc-berkeleyx-
cs190-1x-scalable-machine-6066
4-day BIDS minicourse
in January
Resources