In this session, PayPal will present the techniques used to retain merchants using some of the Machine Learning models using SparkML platform. Retaining merchants directly equates to Dollar value. So, it was very critical for us to identify the right model that trains on our data and predicts merchant behavior giving us insights that help us prevent merchant churn. We will also deep dive on how we captured the right signals filtering the noise that could skew the predictions and some of the challenges we faced in scaling this solution. Lastly, we will see how SparkML orchestrated various events in the pipeline we built thereby enabling us to perform feature engineering, train it, validate and cross-validate it at scale across the different data samples we had.
2. Who are we?
• Data Engineers – ETL pipelines using Spark
• Like all great projects, we started from a hack!
• Data Engineering to Machine Learning
2
3. Agenda 1. Scale at PayPal
2. Understanding Merchant Churn
3. Machine Learning Workflow
4. Learnings
5. Spark ML
3
4. . 200 Countries
. 25 Currencies
. 19 Million Merchants
. 237 Million Active Users
. 8 Billon Transactions per Year
. 6 Billion Events per Day
Scale at PayPal
4
5. Understanding Merchant Churn
Compliance Use Case for CLAC
Story Triggers Impact Insights
Increase in Compliance
Limitations for CLAC in 2017
Regulations mandates
merchants to complete
Compliance verification
Applicable to Merchants
Exceeding $$ in a 12-month
period
Might lead to merchant’s
account being suspended .
Merchant not aware of
limitation
Merchant did not
understand how to
resolve limitation
High impact for Small
merchants
Biggest churn driver for
CLAC in 2017
$M in payments
5
6. Churn Recovery Efforts
Existing pipeline
• Limited Success
• Reactive process
• Account managers reach out to merchants already churned
• Reverse limitation and relaunch merchants takes time
• Large set of merchants for reach-outs
New Merchants Merchants Get Limitation
Account Manager
Relaunched
MerchantsMerchants Churn
6
7. Churn Recovery Efforts
Enhanced pipeline
• Proactive Process
• Use machine learning pipeline to predict Time to reach $$
• Reach out to merchants before limitation is reached
• Mitigate restriction and churn
New Merchants Merchants Likely to Get
Limitation
Account Manager Merchants complete regulation
ML Model
Predict Revenue and Timelines
7
8. ML Platform
Data Models Integration Channel
Metadata
(Segment, Geo, Capacity, Priority, Channel, etc)
Data
Channel
Integration
Model 1 Model 2
Model N
Salesforce
Alerts
Salesforce
SSO
E-mail
…
Feedback Data (Optimization & Learnings)
Performance
Tracking
8
10. We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
10
11. Select Training Data
Ask questions
What datasets we
use for training
the model?
Should we focus
only on initial
transactions?
What data is
relevant for new
merchants?
?
Should we
consider Inflation
and currency
conversion?
What merchants
we should use to
train model ?
11
13. 13
Data Transformation Strategies
Raw Features Merchant Profile
Binning Transaction & Revenue Data
Trendlines Weekly trends in transactions
Binarization Payment Methods / Cross Border
Seasonality Tune weights for Transaction data
Transforming data into features
14. We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
14
18. Data Preparation
Indicator variables
Type 1
Type 2
Type 3
Attribute X
Calculate
most active
type
Count
Count
Count
1 feature
Most
Active
Type
E.g.
Gender
Monthly transaction count
18
22. We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
22
23. Model selection
Choosing the right label
Choosing the right ‘y’
Week Quarter Year No. of days
Classification Regression
23
24. Model selection
Choosing the right model
Decision tree Naïve Bayes Gradient boosting tree Random forests
Low Accuracy
Classification
• Accuracy improved
• Overfitting
• Add more categorical
features • Accuracy improved
• Overfitting persisted • Accuracy improved
• Overfitting reduced
• High time to train
• Accuracy improved
• Overfitting reduced
• Low time to train
Logistic
Regression
24
25. We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
C r o s s v a l i d a t i o n
& h y p e r p a r a m e t e r
t u n i n g
Fine-tune and
reverification of model
25
26. Hyper-parameter tuning and Cross validation
Hyper-parameter Values
Number of trees 5,10,15,20,25
Max Bins 5,10,20,30
Impurity Gini, Entropy
Max Depth 5,10,20,30
Feature Subset Strategy auto
Folds 3
Hyper-parameter values for Random Forest model
26
27. Hyper-parameter tuning and Cross validation
How do we measure we have the right model?
27
Accuracy AUC ROC Precision
Recall AUC PR F1
28. Hyper-parameter tuning and Cross validation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests
Accuracy
auROC auPR
Model comparison
28
29. Hyper-parameter tuning and Cross validation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests
Accuracy
auROC
auPR
Best F1
Model comparison
29
30. We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
C r o s s v a l i d a t i o n
& h y p e r p a r a m e t e r
t u n i n g
Fine-tune and
reverification of model
P i p e l i n e a n d
l e a r n i n g s
View the final
pipeline and
learnings
30
31. Pipeline and Learnings
Final pipeline view
Transaction
Data
Merchant
Profile
Customer
Behavior
Existing
Merchants
New
Merchants
Revenue
Prediction
Algo
Timeline
Prediction
Algo
ML
Model
ML
Model
Timeline
Prediction
Revenue
Prediction
Time-based
merchant
selection
Channel :
Salesforce
Alerts,
Email
Notification
Demo-social
31
32. Pipeline and Learnings
Learnings
• Hypothesis testing
• Outlier removal
• Hyperparameter tuning
• Categorical features vs continuous features
• Time to train
• Accuracy
32
33. We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
C r o s s v a l i d a t i o n
& h y p e r p a r a m e t e r
t u n i n g
Fine-tune and
reverification of model
P i p e l i n e a n d
l e a r n i n g s
View the final
pipeline and
learnings
33
34. Thank you, SparkML!
You’re awesome ..
• Spark ETL -> Spark ML
• Supports many models out of the box
• Scalable for large data
• Easy cross-validation
• Extensive feature transformation suite
...many more
34