In this talk, we will present how we used Spark, Databricks, Airflow and MLflow to process big data, and build a pipeline of both ML(XGBoost) and statistical models that maximizes our revenues in one of our core products, called the “Offer Wall”. The “Offer wall” is a mobile phone product that is integrated with existing apps, suggesting users to perform tasks in exchange for in-app currency. The problem gets even more interesting when considering the fact that some of the tasks users do take 15 minutes and some may take up to 2 to weeks, forcing us to make revenue determining decisions in an uncertain space all of the time. The solution we developed utilizes Databricks and Spark’s strengths and diversity in machine learning, big data, MLflow and Airflow integrations, allowing us to deliver a production-grade solution with short development time between experiments.
5. SAN FRANCISCO
NEW YORK
LONDON
BERLIN
TEL AVIV SEOUL
BEIJING
This is Fyber
We’re builders
40% of 300+ employees focused
on technology and product
We’re app people
Building solutions that app
developers love
We’re publicly traded
FRA: FBEN
We’re global
7 offices
5
6. How big is our
Big Data?
25B Auctions
Per Day
200M DAU
800B Bid
Requests
Per Day
15K+ Apps
300TB
Generated Monthly
300 Users
Level
Dimensions
80+ Reported
Dimensions
(on real-time reporting)
60+ Reported
Metrics
9. ● Integrated in a user’s application
● Contains offers in which users can
execute, in order to proceed within
a game
● Gives the user an option to win an “in-
app” claim / reward
Offer Wall
10. Increase in our
user engagement
Maximize
revenues for our
clients (publishers)
Motivation
10
11. Challenges
● Our data is too big for ordinary frameworks
(~hundreds of millions of events)
● Delayed Feedback Conversions
○ Conversion with a long delay presents a
challenge to models, however they can have
a high monetary value
11
12. User A
Click: January 1st, 2020
Conversion: January 3rd, 2020
Conversion Value: $2
(had 2 days of delay)
User C
Click: January 1st, 2020
Conversion: February 1st, 2020
Conversion Value: $40
(had 30 days of delay)
User B
Click: January 1st, 2020
Conversion: January 10th, 2020
Conversion Value: $3.5
(had 9 days of delay)
Nature Of Delayed Conversions
12
18. What we found
This paper formulates two main aspects of feedback
in Display Advertising.
Instead of directly calculating:
● P(Conversion | Impression)
We can calculate:
● P(click) = P(Click | Impression)
● P(conversion) = P(Conversion | Click)
● P(Conversion | Impression) = P(click)*P(conversion)
18
19. Solution Principles
● Don’t try to predict all at once - Use different tools for different problems
● We need a framework that is able to deal with our needs:
○ Big Data Aggregation
○ ML Modeling
○ Testing and Visualization
○ Debugging and Troubleshooting
19
23. ● XGBoost4J is a project which is being constantly updated and stabilized
○ The latest stable release - Sept. 2020
● We use it in order to perform distributed training on our big data
● Can be added directly from Maven repository
● Easily integrates with Spark ML framework (MLlib)
● Databricks allows us to use it pretty easily, and that was one of the main reasons for choosing it
XGBoost with Spark in Databricks #1
23
24. XGBoost with Spark in Databricks #2
Databricks XGBoost4J Documentation
Relevant Imports
Data preprocessing
Vector Assembler
24
25. XGBoost with Spark in Databricks #3
XGBoost4J “X, Y” definition
Model Train
Model Transform
XGBoost4J Model
Instantiation (with Map)
Distributed Training
25
26. ● XGBoost knows how to handle missing values within a dataset
● In tree-based algorithms, branch directions for missing values are learned during training
● You can tell XGBoost to treat a value (-999) as if it was a missing value. Example below:
Missing Value Flag
XGBoost - Handling Missing Data
26
27. ● One of our technical challenges was how to save the pipeline / models, which were
trained in Spark (Databricks)
● We looked for a solution which is able to provide us a model export / import
for both online & offline prediction modes
● MLeap provides all of the above
● Databricks contains great documentation about it, which made this even easier
● We also wrote a short blog post on how to create synergy between Spark and MLeap
A word about MLeap |
27
28. Conversion Prediction Model #1
● Some conversions will arrive with a delay (E.g 14 days delay)
● By predicting the num. of conversions before they all arrive, we
make our model faster and better
● For this purpose we look at this flow as a poisson process
● A poisson process is mostly used where we count the occurrences
of events that happen at a certain rate, but at random
0 1 2 K
28
29. Conversion Prediction Model #2
● A single event within a poisson process can be
modeled using the Exponential Distribution
● Probability estimation using Exponential Distribution
is straightforward to calculate:
1 / ( 1+ e^(-x*λ) )
● λ = 1 / (Avg. time to convert from click)
x = Elapsed time from click
● Using only these 2 parameters, we can calculate a
probability for each user’s click to become a
conversion
0.00.51.01.5
0 1 2 3 4 5
= 0.5
= 0.1
= 1.5
ProbabilityDensity
29
30. ● Airflow is a platform to programmatically
author, schedule and monitor workflows
● Our data pipeline is complex, as there are
several dependencies affecting each other
● Databricks Airflow Operator to the rescue!
● Databricks have great documentation about it
Airflow & Databricks Scheduling |
Airflow Databricks Operator
30
32. A/B Testing #1
Our Best Practices
● Decide on one dominant KPI, and 2-3 supporting ones
● Build proper analysis tools for analyzing the tests
● Run an A/B test with a (small) portion of traffic
● Analyzing results using Databricks Dashboards &
Scheduling capabilities
3232
33. A/B Testing #2
From Events to A/B testing with Databricks
● Read raw events using spark
● Aggregate raw data to results, and save
periodically using Databricks jobs scheduler
● Use SQL, built-in widgets and visual libraries (E.g bokeh)
to build a dashboard
● Again, Use Databricks Jobs to run the report every
couple of hours and share the link with colleagues
3333
38. Main Insights
● Exploratory Data Analysis is crucial
● There’s a high chance that the first experiment will go wrong. It’s OK, Keep on
● Late conversions = Late results
● Work is not done once deployment is done
● Post-deployment tools are crucial, especially if other teams are supporting
your models
38
40. Summary
■ Fyber Overview
■ Offer Wall Overview
■ Our Use-Case Motivation
■ Our solution - how we explored it, what we wanted to achieve
■ A/B testing in a nutshell
■ Main Insights
40
41. Feel free to reach out!
Daniel Hen
Data Scientist
Michael Winer
Data Science & BI lead
Email | Linkedin |
Medium | GitHub
Email | Linkedin
41