SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
November 19th, 2020 l Data+AI Summit l Michael Winer & Daniel Hen
Offer Wall Revenue Uplift
with Spark, XGBoost and Statistics
About us
Daniel Hen
Data Scientist
Michael Winer
Data Science & BI lead
2
Agenda 01
02
03
04
05
06
Fyber Overview
Business Use Case
Solution Exploration
Solution
A/B Testing
Main Insights
3
4
Fyber Overview
SAN FRANCISCO
NEW YORK
LONDON
BERLIN
TEL AVIV SEOUL
BEIJING
This is Fyber
We’re builders
40% of 300+ employees focused
on technology and product
We’re app people
Building solutions that app
developers love
We’re publicly traded
FRA: FBEN
We’re global
7 offices
5
How big is our
Big Data?
25B Auctions
Per Day
200M DAU
800B Bid
Requests
Per Day
15K+ Apps
300TB
Generated Monthly
300 Users
Level
Dimensions
80+ Reported
Dimensions
(on real-time reporting)
60+ Reported
Metrics
7
October
2019
Marshal 100%
Marshal increased the Offer Wall revenues by 11%
Why are we here
8
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Business Use Case
● Integrated in a user’s application
● Contains offers in which users can
execute, in order to proceed within
a game
● Gives the user an option to win an “in-
app” claim / reward
Offer Wall
Increase in our
user engagement
Maximize
revenues for our
clients (publishers)
Motivation
10
Challenges
● Our data is too big for ordinary frameworks
(~hundreds of millions of events)
● Delayed Feedback Conversions
○ Conversion with a long delay presents a
challenge to models, however they can have
a high monetary value
11
User A
Click: January 1st, 2020
Conversion: January 3rd, 2020
Conversion Value: $2
(had 2 days of delay)
User C
Click: January 1st, 2020
Conversion: February 1st, 2020
Conversion Value: $40
(had 30 days of delay)
User B
Click: January 1st, 2020
Conversion: January 10th, 2020
Conversion Value: $3.5
(had 9 days of delay)
Nature Of Delayed Conversions
12
Multi Arm Bandit - Vanilla Setting
13
Multi Arm Bandit - Delayed Feedback Setting
14
15
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Solution Exploration
Existing Environment
Delayed Feedback
Big Data - mostly Tabular
Time Series (event based)
16
Where we looked
Literature
Review
(Arxiv,
Papers with Code)
Kaggle
17
What we found
This paper formulates two main aspects of feedback
in Display Advertising.
Instead of directly calculating:
● P(Conversion	|	Impression)
We can calculate:
● P(click) =	P(Click	|	Impression)
● P(conversion) =	P(Conversion	|	Click)
● P(Conversion	|	Impression)	=	P(click)*P(conversion)
18
Solution Principles
● Don’t try to predict all at once - Use different tools for different problems
● We need a framework that is able to deal with our needs:
○ Big Data Aggregation
○ ML Modeling
○ Testing and Visualization
○ Debugging and Troubleshooting
19
20
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
The Chosen Solution
Architecture HL Overview
21
CTR Prediction Model
Spark Support
Handling Missing Data
Great performance
with (Big) Tabular Data
22
● XGBoost4J is a project which is being constantly updated and stabilized
○ The latest stable release - Sept. 2020
● We use it in order to perform distributed training on our big data
● Can be added directly from Maven repository
● Easily integrates with Spark ML framework (MLlib)
● Databricks allows us to use it pretty easily, and that was one of the main reasons for choosing it
XGBoost with Spark in Databricks #1
23
XGBoost with Spark in Databricks #2
Databricks XGBoost4J Documentation
Relevant Imports
Data preprocessing
Vector Assembler
24
XGBoost with Spark in Databricks #3
XGBoost4J “X, Y” definition
Model Train
Model Transform
XGBoost4J Model
Instantiation (with Map)
Distributed Training
25
● XGBoost knows how to handle missing values within a dataset
● In tree-based algorithms, branch directions for missing values are learned during training
● You can tell XGBoost to treat a value (-999) as if it was a missing value. Example below:
Missing Value Flag
XGBoost - Handling Missing Data
26
● One of our technical challenges was how to save the pipeline / models, which were
trained in Spark (Databricks)
● We looked for a solution which is able to provide us a model export / import
for both online & offline prediction modes
● MLeap provides all of the above
● Databricks contains great documentation about it, which made this even easier
● We also wrote a short blog post on how to create synergy between Spark and MLeap
A word about MLeap |
27
Conversion Prediction Model #1
● Some conversions will arrive with a delay (E.g 14 days delay)
● By predicting the num. of conversions before they all arrive, we
make our model faster and better
● For this purpose we look at this flow as a poisson process
● A poisson process is mostly used where we count the occurrences
of events that happen at a certain rate, but at random
0 1 2 K
28
Conversion Prediction Model #2
● A single event within a poisson process can be
modeled using the Exponential Distribution
● Probability estimation using Exponential Distribution
is straightforward to calculate:
1 / ( 1+ e^(-x*λ) )
● λ = 1 / (Avg. time to convert from click)
x = Elapsed time from click
● Using only these 2 parameters, we can calculate a
probability for each user’s click to become a
conversion
0.00.51.01.5
0 1 2 3 4 5
= 0.5
= 0.1
= 1.5
ProbabilityDensity
29
● Airflow is a platform to programmatically
author, schedule and monitor workflows
● Our data pipeline is complex, as there are
several dependencies affecting each other
● Databricks Airflow Operator to the rescue!
● Databricks have great documentation about it
Airflow & Databricks Scheduling |
Airflow Databricks Operator
30
31
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
A/B Testing
A/B Testing #1
Our Best Practices
● Decide on one dominant KPI, and 2-3 supporting ones
● Build proper analysis tools for analyzing the tests
● Run an A/B test with a (small) portion of traffic
● Analyzing results using Databricks Dashboards &
Scheduling capabilities
3232
A/B Testing #2
From Events to A/B testing with Databricks
● Read raw events using spark
● Aggregate raw data to results, and save
periodically using Databricks jobs scheduler
● Use SQL, built-in widgets and visual libraries (E.g bokeh)
to build a dashboard
● Again, Use Databricks Jobs to run the report every
couple of hours and share the link with colleagues
3333
34
Notebook
Scheduling
Notifications
A/B Testing #3
From Events to A/B testing with Databricks
A/B Testing - Summary Statistics
Variant Main_KPI KPI_2 KPI_3 KPI_4 KPI_5
C 1.001 29.839 8.289 0.673 0.047
B 1.02 31.585 10.285 0.606 0.061
A 0.975 32.261 25.819 0 0.14
35
Model Analysis
Model A CTR
Predictions Distribution
Model B CTR
Predictions Distribution
Model C CTR
Predictions Distribution
36
37
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Main Insights
Main Insights
● Exploratory Data Analysis is crucial
● There’s a high chance that the first experiment will go wrong. It’s OK, Keep on
● Late conversions = Late results
● Work is not done once deployment is done
● Post-deployment tools are crucial, especially if other teams are supporting
your models
38
Post-Deployment
Tools Using
Databricks
39
Summary
■ Fyber Overview
■ Offer Wall Overview
■ Our Use-Case Motivation
■ Our solution - how we explored it, what we wanted to achieve
■ A/B testing in a nutshell
■ Main Insights
40
Feel free to reach out!
Daniel Hen
Data Scientist
Michael Winer
Data Science & BI lead
Email | Linkedin |
Medium | GitHub
Email | Linkedin
41
Q&A
42
43
THANK YOU!
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Mais conteúdo relacionado

Mais procurados

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Databricks
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Databricks
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
Vinoth Kannan
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
Databricks
 
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
Databricks
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
Eric Chiang
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 

Mais procurados (20)

DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Life is but a Stream
Life is but a StreamLife is but a Stream
Life is but a Stream
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
 
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
NextGenML
NextGenML NextGenML
NextGenML
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
 
Simplifying AI integration on Apache Spark
Simplifying AI integration on Apache SparkSimplifying AI integration on Apache Spark
Simplifying AI integration on Apache Spark
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 

Semelhante a ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment

Semelhante a ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment (20)

Building the BI system and analytics capabilities at the company based on Rea...
Building the BI system and analytics capabilities at the company based on Rea...Building the BI system and analytics capabilities at the company based on Rea...
Building the BI system and analytics capabilities at the company based on Rea...
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
 
Applying BigQuery ML on e-commerce data analytics
Applying BigQuery ML on e-commerce data analyticsApplying BigQuery ML on e-commerce data analytics
Applying BigQuery ML on e-commerce data analytics
 
SOP Planning and Optimization Solution-as-a-Service.pdf
SOP Planning and Optimization Solution-as-a-Service.pdfSOP Planning and Optimization Solution-as-a-Service.pdf
SOP Planning and Optimization Solution-as-a-Service.pdf
 
Application Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and SucceedApplication Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and Succeed
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Visualizations that make an impact - see what s new in minitab statistical s...
Visualizations that make an impact  - see what s new in minitab statistical s...Visualizations that make an impact  - see what s new in minitab statistical s...
Visualizations that make an impact - see what s new in minitab statistical s...
 
Architecting for analytics
Architecting for analyticsArchitecting for analytics
Architecting for analytics
 
Customer Success Story: Interact Everywhere with IBM Active Reports
Customer Success Story: Interact Everywhere with IBM Active ReportsCustomer Success Story: Interact Everywhere with IBM Active Reports
Customer Success Story: Interact Everywhere with IBM Active Reports
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
Edgewater Spreadsheet Planning with Microsoft PPS
Edgewater Spreadsheet Planning with Microsoft PPSEdgewater Spreadsheet Planning with Microsoft PPS
Edgewater Spreadsheet Planning with Microsoft PPS
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Data Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM ExellysData Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM Exellys
 
Building TaxBrain: Numba-enabled Financial Computing on the Web
Building TaxBrain: Numba-enabled Financial Computing on the WebBuilding TaxBrain: Numba-enabled Financial Computing on the Web
Building TaxBrain: Numba-enabled Financial Computing on the Web
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
 
The Digital Twin For Production Optimization
The Digital Twin For Production OptimizationThe Digital Twin For Production Optimization
The Digital Twin For Production Optimization
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Cracking web development
Cracking web developmentCracking web development
Cracking web development
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
How to successfully implement change in your organization (REX Dashlane) (EN)
How to successfully implement change in your organization (REX Dashlane) (EN)How to successfully implement change in your organization (REX Dashlane) (EN)
How to successfully implement change in your organization (REX Dashlane) (EN)
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment

  • 1. November 19th, 2020 l Data+AI Summit l Michael Winer & Daniel Hen Offer Wall Revenue Uplift with Spark, XGBoost and Statistics
  • 2. About us Daniel Hen Data Scientist Michael Winer Data Science & BI lead 2
  • 3. Agenda 01 02 03 04 05 06 Fyber Overview Business Use Case Solution Exploration Solution A/B Testing Main Insights 3
  • 5. SAN FRANCISCO NEW YORK LONDON BERLIN TEL AVIV SEOUL BEIJING This is Fyber We’re builders 40% of 300+ employees focused on technology and product We’re app people Building solutions that app developers love We’re publicly traded FRA: FBEN We’re global 7 offices 5
  • 6. How big is our Big Data? 25B Auctions Per Day 200M DAU 800B Bid Requests Per Day 15K+ Apps 300TB Generated Monthly 300 Users Level Dimensions 80+ Reported Dimensions (on real-time reporting) 60+ Reported Metrics
  • 7. 7 October 2019 Marshal 100% Marshal increased the Offer Wall revenues by 11% Why are we here
  • 9. ● Integrated in a user’s application ● Contains offers in which users can execute, in order to proceed within a game ● Gives the user an option to win an “in- app” claim / reward Offer Wall
  • 10. Increase in our user engagement Maximize revenues for our clients (publishers) Motivation 10
  • 11. Challenges ● Our data is too big for ordinary frameworks (~hundreds of millions of events) ● Delayed Feedback Conversions ○ Conversion with a long delay presents a challenge to models, however they can have a high monetary value 11
  • 12. User A Click: January 1st, 2020 Conversion: January 3rd, 2020 Conversion Value: $2 (had 2 days of delay) User C Click: January 1st, 2020 Conversion: February 1st, 2020 Conversion Value: $40 (had 30 days of delay) User B Click: January 1st, 2020 Conversion: January 10th, 2020 Conversion Value: $3.5 (had 9 days of delay) Nature Of Delayed Conversions 12
  • 13. Multi Arm Bandit - Vanilla Setting 13
  • 14. Multi Arm Bandit - Delayed Feedback Setting 14
  • 16. Existing Environment Delayed Feedback Big Data - mostly Tabular Time Series (event based) 16
  • 18. What we found This paper formulates two main aspects of feedback in Display Advertising. Instead of directly calculating: ● P(Conversion | Impression) We can calculate: ● P(click) = P(Click | Impression) ● P(conversion) = P(Conversion | Click) ● P(Conversion | Impression) = P(click)*P(conversion) 18
  • 19. Solution Principles ● Don’t try to predict all at once - Use different tools for different problems ● We need a framework that is able to deal with our needs: ○ Big Data Aggregation ○ ML Modeling ○ Testing and Visualization ○ Debugging and Troubleshooting 19
  • 22. CTR Prediction Model Spark Support Handling Missing Data Great performance with (Big) Tabular Data 22
  • 23. ● XGBoost4J is a project which is being constantly updated and stabilized ○ The latest stable release - Sept. 2020 ● We use it in order to perform distributed training on our big data ● Can be added directly from Maven repository ● Easily integrates with Spark ML framework (MLlib) ● Databricks allows us to use it pretty easily, and that was one of the main reasons for choosing it XGBoost with Spark in Databricks #1 23
  • 24. XGBoost with Spark in Databricks #2 Databricks XGBoost4J Documentation Relevant Imports Data preprocessing Vector Assembler 24
  • 25. XGBoost with Spark in Databricks #3 XGBoost4J “X, Y” definition Model Train Model Transform XGBoost4J Model Instantiation (with Map) Distributed Training 25
  • 26. ● XGBoost knows how to handle missing values within a dataset ● In tree-based algorithms, branch directions for missing values are learned during training ● You can tell XGBoost to treat a value (-999) as if it was a missing value. Example below: Missing Value Flag XGBoost - Handling Missing Data 26
  • 27. ● One of our technical challenges was how to save the pipeline / models, which were trained in Spark (Databricks) ● We looked for a solution which is able to provide us a model export / import for both online & offline prediction modes ● MLeap provides all of the above ● Databricks contains great documentation about it, which made this even easier ● We also wrote a short blog post on how to create synergy between Spark and MLeap A word about MLeap | 27
  • 28. Conversion Prediction Model #1 ● Some conversions will arrive with a delay (E.g 14 days delay) ● By predicting the num. of conversions before they all arrive, we make our model faster and better ● For this purpose we look at this flow as a poisson process ● A poisson process is mostly used where we count the occurrences of events that happen at a certain rate, but at random 0 1 2 K 28
  • 29. Conversion Prediction Model #2 ● A single event within a poisson process can be modeled using the Exponential Distribution ● Probability estimation using Exponential Distribution is straightforward to calculate: 1 / ( 1+ e^(-x*λ) ) ● λ = 1 / (Avg. time to convert from click) x = Elapsed time from click ● Using only these 2 parameters, we can calculate a probability for each user’s click to become a conversion 0.00.51.01.5 0 1 2 3 4 5 = 0.5 = 0.1 = 1.5 ProbabilityDensity 29
  • 30. ● Airflow is a platform to programmatically author, schedule and monitor workflows ● Our data pipeline is complex, as there are several dependencies affecting each other ● Databricks Airflow Operator to the rescue! ● Databricks have great documentation about it Airflow & Databricks Scheduling | Airflow Databricks Operator 30
  • 32. A/B Testing #1 Our Best Practices ● Decide on one dominant KPI, and 2-3 supporting ones ● Build proper analysis tools for analyzing the tests ● Run an A/B test with a (small) portion of traffic ● Analyzing results using Databricks Dashboards & Scheduling capabilities 3232
  • 33. A/B Testing #2 From Events to A/B testing with Databricks ● Read raw events using spark ● Aggregate raw data to results, and save periodically using Databricks jobs scheduler ● Use SQL, built-in widgets and visual libraries (E.g bokeh) to build a dashboard ● Again, Use Databricks Jobs to run the report every couple of hours and share the link with colleagues 3333
  • 34. 34 Notebook Scheduling Notifications A/B Testing #3 From Events to A/B testing with Databricks
  • 35. A/B Testing - Summary Statistics Variant Main_KPI KPI_2 KPI_3 KPI_4 KPI_5 C 1.001 29.839 8.289 0.673 0.047 B 1.02 31.585 10.285 0.606 0.061 A 0.975 32.261 25.819 0 0.14 35
  • 36. Model Analysis Model A CTR Predictions Distribution Model B CTR Predictions Distribution Model C CTR Predictions Distribution 36
  • 38. Main Insights ● Exploratory Data Analysis is crucial ● There’s a high chance that the first experiment will go wrong. It’s OK, Keep on ● Late conversions = Late results ● Work is not done once deployment is done ● Post-deployment tools are crucial, especially if other teams are supporting your models 38
  • 40. Summary ■ Fyber Overview ■ Offer Wall Overview ■ Our Use-Case Motivation ■ Our solution - how we explored it, what we wanted to achieve ■ A/B testing in a nutshell ■ Main Insights 40
  • 41. Feel free to reach out! Daniel Hen Data Scientist Michael Winer Data Science & BI lead Email | Linkedin | Medium | GitHub Email | Linkedin 41
  • 44. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.