The document describes the process of developing and productionizing a recommendation engine for BBC Sounds. It discusses:
1) The initial challenge of replacing an outsourced recommendation engine and prototyping a new one using factorization machines. Qualitative user tests showed improved recommendations over the external provider.
2) Productionizing involved using Google Cloud Platform, Apache Airflow for workflows, Apache Beam for efficient data processing, and precomputing recommendations to serve 1500 requests/second with low latency.
3) Initial A/B tests found a 59% increase in interactions and 103% increase for under 35s using the new recommendation engine. Ongoing work includes optimizing costs and API performance.
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
From Idea to Production: BBC's Recommender Engine
1. From an idea to production
Recommender for BBC Sounds
Tatiana Al-Chueyr
Principal Data Engineer at Datalab
MLOps London, 28 September 2021 @tati_alchueyr
7. @tati_alchueyr
BBC.
Vision
For the BBC to be a leader in Machine Learning that
delights audiences and prioritises the needs of
individuals and society over corporations and states.
Mission
To develop and deploy Machine Learning at BBC scale
so that teams can tailor services to individuals whilst
upholding our editorial values.
7
8. @tati_alchueyr
BBC. . .hummingbirds
8
The knowledge in this presentation is the result of lots of teamwork
within one squad of a larger team and even broader organisation
current squad team members
previous squad team members
Darren
Mundy
David
Hollands
Richard
Bownes
Marc
Oppenheimer
Bettina
Hermant
Tatiana
Al-Chueyr
Jana
Eggink
20. @tati_alchueyr
1-2 months of work:
● Collected data (quick-and-dirty™ scripts)
● Compared existing Python Factorisation Machines libraries (winner: LightFM)
● Trained and predicted recommendations (quick-and-dirty™ scripts)
● Implemented a qualitative experiment tool
● Recruited volunteers to join the qualitative experiment
● Ran qualitative experiment, comparing:
○ External provider recommendations
○ Our own Factorization Machines-powered recommendations
The prototype
20
21. @tati_alchueyr
Qualitative experiment: how
Who
● ~30 test users recruited
○ Internal BBC employees
○ Under 35
How
● Two sets with 9 recommendations each:
○ External provider
○ Internal factorisation machines
● Users, without knowing the origin of the recs, had to:
○ choose “the best”, “both”, or “neither”
○ explain why
21
24. @tati_alchueyr
Productionising machine learning
Configuration
Data Collection
and
Transformation
Feature Extraction
Data
Verification
Machine
Resource
Management
Serving
Infrastructure
Monitoring
Process Management
Tools
Analysis Tools
ML Code
Image copied from presentation by Googler @mpyeager
24
25. @tati_alchueyr
Tech stack
● Google Cloud Platform
● Python used as our main coding language
○ LightFM library used for the model: very handy dataset class and multi-threading processing
● Apache Airflow (Composer) for the workflows orchestration
● Apache Beam (Dataflow) for the parallelizable data processing
● Redis to store our pre-computed recommendations
● Kubernetes for running the API and CPU-intensive tasks
● Terraform for the infrastructure as code development
25
27. @tati_alchueyr
Machine learning workflow
Input
Processing
Output
User activity data Content metadata
Business Rules, part I - Non-personalised
- Recency
- Availability
- Excluded Masterbrands
- Excluded genres
Business Rules, part II - Personalised
- Already seen items
- Local radio (if not consumed previously)
- Specific language (if not consumed previously)
- Episode picking from a series
- Diversification (1 episode per brand/series)
Recommendations
Machine Learning model
training
Predict recommendations
27
28. @tati_alchueyr
Steps to be done in the workflows, before the API
Input
Processing
Output
User activity data Content metadata
Business Rules, part I - Non-personalised
- Recency
- Availability
- Excluded Masterbrands
- Excluded genres
Business Rules, part II - Personalised
- Already seen items
- Local radio (if not consumed previously)
- Specific language (if not consumed previously)
- Episode picking from a series
- Diversification (1 episode per brand/series)
Recommendations
Machine Learning model
training
Predict recommendations
28
31. @tati_alchueyr
Recommendation API: load performance
On the fly Precomputed Precomputed
Concurrent load tests
requests/s
50 50 1500
Success percentage 63.88% 100% 100%
Latency of p50 (success) 323.78 ms 1.68 ms 4.75 ms
Latency of p95 (success) 939.28 ms 3.21 ms 57.53 ms
Latency of p99 (success) 979.24 ms 4.51 ms 97.49 ms
Maximum successful
requests per second
23 50 1500
Goal:
1500 requests/s with
P95 responses < 60
ms
Machine type: c2-standard-8, Python 3.7, Sanic workers: 7, Prediction threads: 1, vCPU cores: 7, Memory: 15 Gi, Deployment Replicas: 1
31
32. @tati_alchueyr
model
Strategies to serve recommendations
API
API
user
activity
content
metadata
cached
recs
A. On the fly
B. Precompute
predicts & applies rules
retrieves pre-computed recommendations
32
33. @tati_alchueyr
Steps to be done in the workflows, before the API
Input
Processing
Output
User activity data Content metadata
Business Rules, part I - Non-personalised
- Recency
- Availability
- Excluded Masterbrands
- Excluded genres
Business Rules, part II - Personalised
- Already seen items
- Local radio (if not consumed previously)
- Specific language (if not consumed previously)
- Episode picking from a series
- Diversification (1 episode per brand/series)
Precomputed
recommendations
Machine Learning model
training
Predict recommendations
33
39. @tati_alchueyr
Limitation of Apache Airflow
● Good for orchestrating tasks
● Not good for processing data within an Airflow worker
○ Separation of concern with runtime
39
42. @tati_alchueyr
Limitation of Apache Airflow
Issue:
Depending on the
volumes of data, a single
PythonOperator task
which usually takes
10 min could take almost
3h!
Consequences:
Overall delay
Blocked worker
42
43. @tati_alchueyr
Limitation of Apache Airflow
Time estimations (in seconds) to predict recommendations using a c2-standard-30 instance (30 vCPU and 120 GB RAM)
43
44. @tati_alchueyr
Limitation of Apache Airflow
Time estimations (in seconds) to predict recommendations using a c2-standard-30 instance (30 vCPU and 120 GB RAM)
2h to predict
recommendations for
10k users
What about 5 million
users - or more?
44
45. @tati_alchueyr
Limitation of Apache Airflow: solutions
Delegating processing to other services
● Tasks which scale vertically (better hardware)
○ Airflow Compute Engine (Virtual Machine) Operator (GceInstanceStartOperator)
○ Airflow Kubernetes Pod Operator (GKEPodOperator)
● Tasks which scale horizontally (can be split and distributed in multiple nodes)
○ Airflow Dataflow Operator (Google Dataflow, Apache Beam )
○ Airflow Dataproc Operator (Google Dataproc, Apache Spark & Hadoop)
45
50. @tati_alchueyr
Apache Beam: overview of Dataflow job
Parallel processing “effortlessly”
Image from the book “Google Cloud Platform In Action” by JJ Geewax, Chapter 20
50
53. @tati_alchueyr
Adoption of Apache Beam & Dataflow
“Serverless” parallel processing of 41,258,135 items (27.32 GB) with
Python in 1min 24s using 10 default workers
53
54. @tati_alchueyr
Pure Airflow
PythonOperator in
Cloud Composer
DataflowOperator
running a Beam
pipeline within
Dataflow
episode
availability episode
s/PythonOperator/DataflowOperator
Computation time reduced almost by one
order of magnitude
Document
type
PythonOperator DataflowOperator Performance
gain
episode 60 min 6 min 90%
availability
episode
12 min 5 min 58%
54
61. @tati_alchueyr
Overall architecture
61
Availability, metadata and
episode data to support a
snapshot view of sounds
content.
DATA
SOURCE
S
Sounds User activity data for
content consumption. Signed in
only
DATA
PLATFORM
Dedicated stream of UAS
(encrypted) into a Sounds Data
Lake
ML PLATFORM
Dedicated Sounds Metadata
Snapshot built intra-day
User Activity features processed
intra-day
Both sources merged into a feature
set to support the re-training of the
Sounds recommender
RECOMMENDER
Serving layer that exposes the
recommender to Unirecs.
Unirecs serves
recommendations to end
users
Build Cluster for intra-day
build/training of the model.
Dedicated service for
pre-computing
recommendations
63. @tati_alchueyr
● +59% increase in interactions in Recommended for You rail
● +103% increase in interactions for under 35s
Initial A/B Tests results
63
66. @tati_alchueyr
Choosing the level of abstraction
66
None
Cloud built-in
proprietary tools
Open source / Proprietary
ready-to-use tools (eg. tfx)
Build custom workflows and train your own
models using Keras, Sklearn, PyTorch,
others
Customisation x Easiness of adoption