ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment

November 19th, 2020 l Data+AI Summit l Michael Winer & Daniel Hen
Offer Wall Revenue Uplift
with Spark, XGBoost and Statistics

About us
Daniel Hen
Data Scientist
Michael Winer
Data Science & BI lead
2

Agenda 01
02
03
04
05
06
Fyber Overview
Business Use Case
Solution Exploration
Solution
A/B Testing
Main Insights
3

SAN FRANCISCO
NEW YORK
LONDON
BERLIN
TEL AVIV SEOUL
BEIJING
This is Fyber
We’re builders
40% of 300+ employees focused
on technology and product
We’re app people
Building solutions that app
developers love
We’re publicly traded
FRA: FBEN
We’re global
7 offices
5

How big is our
Big Data?
25B Auctions
Per Day
200M DAU
800B Bid
Requests
Per Day
15K+ Apps
300TB
Generated Monthly
300 Users
Level
Dimensions
80+ Reported
Dimensions
(on real-time reporting)
60+ Reported
Metrics

7
October
2019
Marshal 100%
Marshal increased the Offer Wall revenues by 11%
Why are we here

8
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Business Use Case

● Integrated in a user’s application
● Contains offers in which users can
execute, in order to proceed within
a game
● Gives the user an option to win an “in-
app” claim / reward
Offer Wall

Increase in our
user engagement
Maximize
revenues for our
clients (publishers)
Motivation
10

Challenges
● Our data is too big for ordinary frameworks
(~hundreds of millions of events)
● Delayed Feedback Conversions
○ Conversion with a long delay presents a
challenge to models, however they can have
a high monetary value
11

User A
Click: January 1st, 2020
Conversion: January 3rd, 2020
Conversion Value: $2
(had 2 days of delay)
User C
Conversion: February 1st, 2020
Conversion Value: $40
User B
Conversion: January 10th, 2020
Conversion Value: $3.5
Nature Of Delayed Conversions
12

Multi Arm Bandit - Vanilla Setting
13

Multi Arm Bandit - Delayed Feedback Setting
14

15
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Solution Exploration

Existing Environment
Delayed Feedback
Big Data - mostly Tabular
Time Series (event based)
16

Where we looked
Literature
Review
(Arxiv,
Papers with Code)
Kaggle
17

What we found
This paper formulates two main aspects of feedback
in Display Advertising.
Instead of directly calculating:
● P(Conversion | Impression)
We can calculate:
● P(click) = P(Click | Impression)
● P(conversion) = P(Conversion | Click)
● P(Conversion | Impression) = P(click)*P(conversion)
18

Solution Principles
● Don’t try to predict all at once - Use different tools for different problems
● We need a framework that is able to deal with our needs:
○ Big Data Aggregation
○ ML Modeling
○ Testing and Visualization
○ Debugging and Troubleshooting
19

20
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
The Chosen Solution

CTR Prediction Model
Spark Support
Handling Missing Data
Great performance
with (Big) Tabular Data
22

● XGBoost4J is a project which is being constantly updated and stabilized
○ The latest stable release - Sept. 2020
● We use it in order to perform distributed training on our big data
● Can be added directly from Maven repository
● Easily integrates with Spark ML framework (MLlib)
● Databricks allows us to use it pretty easily, and that was one of the main reasons for choosing it
XGBoost with Spark in Databricks #1
23

Databricks XGBoost4J Documentation
Relevant Imports
Data preprocessing
Vector Assembler
24

XGBoost4J “X, Y” definition
Model Train
Model Transform
XGBoost4J Model
Instantiation (with Map)
Distributed Training
25

● XGBoost knows how to handle missing values within a dataset
● In tree-based algorithms, branch directions for missing values are learned during training
● You can tell XGBoost to treat a value (-999) as if it was a missing value. Example below:
Missing Value Flag
XGBoost - Handling Missing Data
26

● One of our technical challenges was how to save the pipeline / models, which were
trained in Spark (Databricks)
● We looked for a solution which is able to provide us a model export / import
for both online & offline prediction modes
● MLeap provides all of the above
● Databricks contains great documentation about it, which made this even easier
● We also wrote a short blog post on how to create synergy between Spark and MLeap
A word about MLeap |
27

Conversion Prediction Model #1
● Some conversions will arrive with a delay (E.g 14 days delay)
● By predicting the num. of conversions before they all arrive, we
make our model faster and better
● For this purpose we look at this flow as a poisson process
● A poisson process is mostly used where we count the occurrences
of events that happen at a certain rate, but at random
0 1 2 K
28

Conversion Prediction Model #2
● A single event within a poisson process can be
modeled using the Exponential Distribution
● Probability estimation using Exponential Distribution
is straightforward to calculate:
1 / ( 1+ e^(-x*λ) )
● λ = 1 / (Avg. time to convert from click)
x = Elapsed time from click
● Using only these 2 parameters, we can calculate a
probability for each user’s click to become a
conversion
0.00.51.01.5
0 1 2 3 4 5
= 0.5
= 0.1
= 1.5
ProbabilityDensity
29

● Airflow is a platform to programmatically
author, schedule and monitor workflows
● Our data pipeline is complex, as there are
several dependencies affecting each other
● Databricks Airflow Operator to the rescue!
● Databricks have great documentation about it
Airflow & Databricks Scheduling |
Airflow Databricks Operator
30

31
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
A/B Testing

A/B Testing #1
Our Best Practices
● Decide on one dominant KPI, and 2-3 supporting ones
● Build proper analysis tools for analyzing the tests
● Run an A/B test with a (small) portion of traffic
● Analyzing results using Databricks Dashboards &
Scheduling capabilities
3232

A/B Testing #2
From Events to A/B testing with Databricks
● Read raw events using spark
● Aggregate raw data to results, and save
periodically using Databricks jobs scheduler
● Use SQL, built-in widgets and visual libraries (E.g bokeh)
to build a dashboard
● Again, Use Databricks Jobs to run the report every
couple of hours and share the link with colleagues
3333

34
Notebook
Scheduling
Notifications
A/B Testing #3
From Events to A/B testing with Databricks

A/B Testing - Summary Statistics
Variant Main_KPI KPI_2 KPI_3 KPI_4 KPI_5
C 1.001 29.839 8.289 0.673 0.047
B 1.02 31.585 10.285 0.606 0.061
A 0.975 32.261 25.819 0 0.14
35

Model Analysis
Model A CTR
Predictions Distribution
Model B CTR
Model C CTR
36

37
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Main Insights

Main Insights
● Exploratory Data Analysis is crucial
● There’s a high chance that the first experiment will go wrong. It’s OK, Keep on
● Late conversions = Late results
● Work is not done once deployment is done
● Post-deployment tools are crucial, especially if other teams are supporting
your models
38

Post-Deployment
Tools Using
Databricks
39

Summary
■ Fyber Overview
■ Offer Wall Overview
■ Our Use-Case Motivation
■ Our solution - how we explored it, what we wanted to achieve
■ A/B testing in a nutshell
■ Main Insights
40

Feel free to reach out!
Daniel Hen
Data Scientist
Michael Winer
Data Science & BI lead
Email | Linkedin |
Medium | GitHub
Email | Linkedin
41

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment

Semelhante a ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment