4. 4
Netflix Scale
> 81M members
> 190 countries
> 1000 device types
> 3B hours/month
> 36% of peak US
downstream traffic
5. 5
Goal
Help members find content to watch and enjoy
to maximize member satisfaction and retention
6. 6
Everything is a Recommendation
Rows
Ranking
Over 80% of what
people watch
comes from our
recommendations
Recommendations
are driven by
Machine Learning
13. 13
System Architecture
Offline: Process data
Batch learning
Nearline: Process events
Model evaluation
Online learning
Asynchronous
Online: Process requests
Real-time
Netflix.Hermes
Netflix.Manhattan
Nearline
Computation
Models
Online
Data Service
Offline Data
Model
training
Online
Computation
Event Distribution
User Event
Queue
Algorithm
Service
UI Client
Member
Query results
Recommendations
NEARLINE
Machine
Learning
Algorithm
Machine
Learning
Algorithm
Offline
Computation Machine
Learning
Algorithm
Play, Rate,
Browse...
OFFLINE
ONLINE
More details on Netflix Techblog
14. 14
Where to place components?
Example: Matrix Factorization
Offline:
Collect sample of play data
Run batch learning algorithm like
SGD to produce factorization
Publish video factors
Nearline:
Solve user factors
Compute user-video dot products
Store scores in cache
Online:
Presentation-context filtering
Serve recommendations
Netflix.Hermes
Netflix.Manhattan
Nearline
Computation
Models
Online
Data Service
Offline Data
Model
training
Online
Computation
Event Distribution
User Event
Queue
Algorithm
Service
UI Client
Member
Query results
Recommendations
NEARLINE
Machine
Learning
Algorithm
Machine
Learning
Algorithm
Offline
Computation Machine
Learning
Algorithm
Play, Rate,
Browse...
OFFLINE
ONLINE
V
sij=uivj Aui=b
sij
X≈UVt
X
sij>t
16. 16
Example development process
Idea Data
Offline
Modeling
(R, Python,
MATLAB, …)
Iterate
Implement in
production
system (Java,
C++, …)
Data
discrepancies
Missing post-
processing
logic
Performance
issues
Actual
output
Experimentation environment
Production environment
(A/B test) Code
discrepancies
Final
model
17. 17
Solution: Share and lean towards production
Developing machine learning is iterative
Need a short pipeline to rapidly try ideas
Want to see output of complete system
So make the application easy to experiment with
Share components between online, nearline, and offline
Use the real code whenever possible
Have well-defined interfaces and formats to allow you to go
off-the-beaten-path
18. 18
Shared Engine
Avoid dual implementations
Experiment
code
Production
code
ProductionExperiment • Models
• Features
• Algorithms
• …
20. 20
Make algorithms and models extensible and modular
Algorithms often need to be tailored for a
specific application
Treating an algorithm as a black box is
limiting
Better to make algorithms extensible and
modular to allow for customization
Separate models and algorithms
Many algorithms can learn the same model
(i.e. linear binary classifier)
Many algorithms can be trained on the same
types of data
Support composing algorithms
Data
Parameters
Data
Model
Parameters
Model
Algorithm
Vs.
21. 21
Provide building blocks
Don’t start from scratch
Linear algebra: Vectors, Matrices, …
Statistics: Distributions, tests, …
Models, features, metrics, ensembles, …
Loss, distance, kernel, … functions
Optimization, inference, …
Layers, activation functions, …
Initializers, stopping criteria, …
…
Domain-specific components
Build abstractions on
familiar concepts
Make the software put
them together
22. 22
Example: Tailoring Random Forests
Using Cognitive Foundry: http://github.com/algorithmfoundry/Foundry
Use a custom
tree split
Customize to
run it for an
hour
Report a
custom metric
each iteration
Inspect the
ensemble
24. 24
Application
Putting learning in an application
Feature
Encoding
Output
Decoding
?
Machine
Learned Model
Rd ⟶ Rk
Application or model code?
25. 25
Example: Simple ranking system
High-level API: List<Video> rank(User u, List<Video> videos)
Example model description file:
{
“type”: “ScoringRanker”,
“scorer”: {
“type”: “FeatureScorer”,
“features”: [
{“type”: “Popularity”, “days”: 10},
{“type”: “PredictedRating”}
],
“function”: {
“type”: “Linear”,
“bias”: -0.5,
“weights”: {
“popularity”: 0.2,
“predictedRating”: 1.2,
“predictedRating*popularity”:
3.5
}
}
}
Ranker
Scorer
Features
Linear function
Feature transformations
26. 26
Maximize out a single machine before
distributing your algorithms
Recommendation 5
27. 27
Problem: Your great new algorithm doesn’t scale
Want to run your algorithm on larger data
Temptation to go distributed
Spark/Hadoop/etc seem to make it easy
But building distributed versions of non-trivial ML algorithms is hard
Often means changing the algorithm or making lots of approximations
So try to squeeze as much out of a single machine first
Have a lot more communication bandwidth via memory than network
You will be surprised how far one machine can go
Example: Amazon announced today an X1 instance type with 2TB
memory and 128 virtual CPUs
28. 28
How?
Profile your code and think about memory
cache layout
Small changes can have a big impact
Example: Transposing a matrix can drop
computation from 100ms to 3ms
Go multicore
Algorithms like HogWild for SGD-type optimization
can make this very easy
Use specialized resources like GPU (or TPU?)
Only go distributed once you’ve optimized on
these dimensions (often you won’t need to)
29. 29
Example: Training Neural Networks
Level 1: Machines in different
AWS regions
Level 2: Machines in same AWS
region
Simple: Grid search
Better: Bayesian optimization using
Gaussian Processes
Mesos, Spark, etc. for coordination
Level 3: Highly optimized, parallel
CUDA code on GPUs
31. 31
Machine Learning and Testing
Temptation: Use validation metrics to test software
When things work and metrics go up this seems great
When metrics don’t improve was it the
code
data
metric
idea
…?
32. 32
Reality of Testing
Machine learning code involves intricate math and logic
Rounding issues, corner cases, …
Is that a + or -? (The math or paper could be wrong.)
Solution: Unit test
Testing of metric code is especially important
Test the whole system: Just unit testing is not enough
At a minimum, compare output for unexpected changes across
versions
34. 34
Two ways to solve computational problems
Know
solution
Write code
Compile
code
Test code Deploy code
Know
relevant
data
Develop
algorithmic
approach
Train model
on data using
algorithm
Validate
model with
metrics
Deploy
model
Software Development
Machine Learning
(steps may involve Software Development)
35. 35
Take-aways for building machine learning software
Building machine learning is an iterative process
Make experimentation easy
Take a holistic view of application where you are placing
learning
Design your algorithms to be modular
Optimize how your code runs on a single machine before
going distributed
Testing can be hard but is worthwhile
36. 36
Thank You Justin Basilico
jbasilico@netflix.com
@JustinBasilico
We’re hiring