Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
2. Netflix Scale
▪ Started streaming videos
10 years ago
▪ > 100M members
▪ > 190 countries
▪ > 1000 device types
▪ A third of peak US
downstream traffic
3. Turn on Netflix, and the
absolute best content for you
would automatically start playing
Recommendation System: Ideal State
4. Title Ranking
Everything is a RecommendationRowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
Image
5. • Try an idea offline using historical data to see if it
would have made better recommendations
• If it would, deploy a live A/B test to see if it performs
well in production
Running Experiments
6. Design Experiment
Collect Label Dataset
Offline Feature
Generation
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline
Experiment
Online
System
Online
AB Testing
Running Experiments
7. Feature Generation: Feature Computation
S3
Snapshot
Model Training
Labeled Features
Label Data
Feature Model
Feature Encoders
Required
Features Data
Features
Fact Data
Required Data
1
3
42
5
6
8. Version 1: RDD-Based Feature Generation
• RDD: Resilient Distributed Dataset
• Our first version was written when only RDD operations
were available
• Opacity
▪ Data are opaque
▪ Computation is opaque
9. Version 1: RDD-Based Feature Generation
S3
Snapshot
Model Training
Labeled Features
RDD of POJO’s
Feature Model
Required
Feature
Maps of Data
POJO
Feature Encoders
Features
Required Data
Label Data
10. Version 1: RDD-Based Feature Generation
RDD operations are at low level.
You are responsible for performance
optimization.
RDD operations are on whole objects,
even if only one field is required.
11. Version 2: Using DataFrame
• DataFrame: Structured Data Organized into Named
Columns
• Transparency
▪ Data are structured
▪ Computations are planned based on common patterns
12. Version 2: Using DataFrame
Spark SQL optimizer, Catalyst, optimizes
DataFrame operation
13. Version 2: Using DataFrame
S3
Snapshot
Model Training
Labeled Features
RDD of POJO’s
Feature Model
Required
Feature
Feature Encoders
Maps of Data
POJO
Features
Required Data
Label Data
14. Version 2: Using DataFrame
S3
Snapshot
Model Training
Structured Labeled Features
Structured Data in
DataFrame
Feature Model
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
15. Version 2: Using DataFrame
~3x run time gain in feature generation
▪ 50 ~ 80 executors
▪ ~3 cores per executor
▪ ~24GB per executor
16. Version 2: Using DataFrame
Let’s take a look at the physical plan of
the DataFrame taken from snapshot...
17. Version 2: Using DataFrame (with RDD[Row])
We use RDD[Row] from data frame and
create a new data frame by manipulating
the Row object.
S3
Snapshot
Structured Data in
DataFrame
Deduping Logic with
Row Manipulation
18. Version 2: Using DataFrame (with RDD[Row])
Even the new DataFrame, created from
RDD[Row], has columns with the same
names, they are different to Spark
19. Version 2: Using DataFrame (with RDD[Row])
Manipulations on row
objects are completely
opaque, blocking
optimizer from moving
operations around.
20. Version 3: Column Operations To The Rescue
Most of the operations are essentially
column(s) to column(s)
21. Version 3: Column Operations To The Rescue
Possible Replacement for row manipulations:
▪ Spark SQL Functions
▪ User-Defined Functions
▪ Catalyst Expression
Most of the operations are essentially
column(s) to column(s)
22. Version 3: Column Operations To The Rescue
Spark SQL Functions
(org.apache.spark.sql.functions)
▪ Built-in
▪ Highly efficient
▪ Internal data structure
▪ Code generation
▪ Supports rule-based optimization
▪ A variety of categories
▪ Aggregation
▪ Collection
▪ Math
▪ String
23. Version 3: Column Operations To The Rescue
User-Defined Functions (UDFs)
▪ Scala functions with certain types
▪ Highly flexible
▪ Data encoding/decoding required
24. Version 3: Column Operations To The Rescue
User-Defined Catalyst Expressions
▪ Flexible
▪ User defines the operations
▪ Efficient
▪ Internal data structure
▪ Code generation possible
25. Version 3: Column Operations To The Rescue
S3
Snapshot
Model Training
Structured Labeled Features
Feature Model
Structured Data in
DataFrame
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
Row
Manipulation
26. Version 3: Column Operations To The Rescue
S3
Snapshot
Model Training
Structured Labeled Features
Feature Model
Structured Data in
DataFrame
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
Catalyst
Expressions
27. Version 3: Column Operations To The Rescue
We replaced row manipulation with
Catalyst expression
S3
Snapshot
Structured Data in
DataFrame
case class RemoveDuplications(child: Expression) extends
UnaryExpression {
...
}
Catalyst Expression
29. Version 3: Column Operations To The Rescue
Physical Plan with Column Operations
30. Version 3: Column Operations To The Rescue
~2x run time gain compared to version 2
▪ 50 ~ 80 executors
▪ ~3 cores per executor
▪ ~24GB per executor
31. Conclusions
▪ Time Travel in Offline Training
▪ Fact logging + offline feature generation
▪ Optimization
▪ Remove “black boxes”
▪ Prefer high-level DataFrame APIs
▪ Prefer column operations over row manipulations