As a data driven company, we use Machine Learning algos and A/B tests to drive all of the content recommendations for our members. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.The heart of such offline analyses are historical facts data that are used to generate features required by the machine learning model. For example, viewing history of a member, videos in mylist etc.
Building a fact store at an ever evolving Netflix scale is non trivial. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement. In this talk, we will present the key requirements, evolution of our fact store design, its implementation, the scale and our learnings.
We will also take a deep dive into fact vs feature logging, design tradeoffs, infrastructure performance, reliability and query API for the store. We use Spark and Scala extensively and variety of compression techniques to store/retrieve data efficiently.
3. #DevSAIS11
Recommendations at Netflix
● Personalized Homepage for each member
○ Goal: Quickly help members find content they’d like to
watch
○ Risk: Member may lose interest and abandon the service
○ Challenge: Recommendations at Scale
5. #DevSAIS11
Experimentation Cycle @ Netflix
Design a New Experiment to Test Out Different Ideas
Offline
Experiment
Online
System
Design Experiment Model Testing
Collect Label Data
Offline Feature
Generation
Model Training Model Validation Metrics
Online A/B Testing
6. #DevSAIS11
ML Feature Engineering - Architectural View
Online Feature
Generation
Microservices
Online Scoring
Offline Feature
Generation
Shared Feature
Encoders
Model
Training
Deploy Models
Online SystemOffline Experiment
Features
Facts
7. #DevSAIS11
What is a Fact?
● Fact
○ Input data for feature encoders. Used to construct a feature
○ Example: Viewing history of member, my list of a member
● Historical Version of a fact
○ Rewindable - State of the world at that time
● Temporal
■ Facts are temporal i.e. they change with time
■ Each online scoring service uses the latest value of a fact
9. #DevSAIS11
Fact Logging - Pull Architecture
Pull
● Daily snapshots of key facts
● Storage
○ S3 & Parquet
● Api to access the data
○ RDD & DataFrames
● Cons
○ Lacks temporal accuracy
○ Load on Microservices
○ Missing Experiment specific facts
Capture
Snapshots
Fact Microservices
Stratified
Member sets
Snapshots
10. #DevSAIS11
Fact Logging - Push Architecture
● Compute engines
themselves control
what to log
● Stratification
● Temporal accuracy
Compute Services
Fact Store
Fact Transformer
Fact Fetcher
Fact Logger
ML Workflows
Feature
Generation
Model
Training
11. #DevSAIS11
Fact Logger
Precompute Live Compute
Fact
Logger
● Library
● Facts
○ User Related
○ Video Related
○ Computation Specific
● Serialization
● Stratification Service
● Fact Stream
● Storage
Base Fact Tables
Stratification
12. #DevSAIS11
Fact Logging - Scalability
Precompute Live Compute
Fact Logger
● 5-10x increase in data through
Kafka
● SLA Impact; Cost Increase
● Compression - 70% decrease
13. Storage & Access
Fact Store
Fact Transformer
Deduplication
Precompute Live Compute
Fact Logger
● Pipeline load
○ Repeated facts
● Aggressive or not
○ Loss threshold
Conditional push
● Spark Job
○ Fact pointers
○ SLA
#DevSAIS11
14. #DevSAIS11
API Lookback
Member ID My List View History Thumbs
122312 My List Value View History Pointer Thumbs Value
254637 My List Pointer View History Pointer Thumbs Pointer
Member n My List Pointer View History Value Thumbs Pointer
My List
Partition 1
Values
Partition m
Values
View
History
Partition 1
Values
Partition m
Values
Thumbs
Partition 1
Values
Partition m
Values
log_time - x
log_time - y
log_time - z
15. Storage & Access
Fact Store
Fact Transformer
Read API
Precompute Live Compute
Fact Logger
● Query performance
○ Slow moving facts
● Point query
○ Connector
● Query time reduction
○ Hours to minutes
Deduplication
Conditional push
Write
Read
#DevSAIS11
16. Performance: Storage
• Partitioning scheme
– Noisy neighbor
• Storage format
– Exploratory vs production
• Fast & Slow lane
– Lookback limit
#DevSAIS11
17. Performance: Spark reads
• Bloom Filters
– Reduce scan
• Cache Access
– EVCache, Spectator
• MapPartitions vs UDF
– Eager vs Lazy
– SPARK-11438, SPARK-11469,
SPARK-20586
Application
ML Library
Read API
#DevSAIS11
18. Future Work
• Structured with schema evolution
– Best of both (POJO & Spark SQL), Iceberg
• Streaming vs Batch
– Multiple lanes, accountability, independent scale
• Duplication
– Storage vs Runtime cost
#DevSAIS11