Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni

Merchant Churn Prediction using SparkML at
PayPal

Who are we?
• Data Engineers – ETL pipelines using Spark
• Like all great projects, we started from a hack!
• Data Engineering to Machine Learning
2

Agenda 1. Scale at PayPal
2. Understanding Merchant Churn
3. Machine Learning Workflow
4. Learnings
5. Spark ML
3

. 200 Countries
. 25 Currencies
. 19 Million Merchants
. 237 Million Active Users
. 8 Billon Transactions per Year
. 6 Billion Events per Day
Scale at PayPal
4

Understanding Merchant Churn
Compliance Use Case for CLAC
Story Triggers Impact Insights
Increase in Compliance
Limitations for CLAC in 2017
Regulations mandates
merchants to complete
Compliance verification
Applicable to Merchants
Exceeding $$ in a 12-month
period
Might lead to merchant’s
account being suspended .
Merchant not aware of
limitation
Merchant did not
understand how to
resolve limitation
High impact for Small
merchants
Biggest churn driver for
CLAC in 2017
$M in payments
5

Churn Recovery Efforts
Existing pipeline
• Limited Success
• Reactive process
• Account managers reach out to merchants already churned
• Reverse limitation and relaunch merchants takes time
• Large set of merchants for reach-outs
New Merchants Merchants Get Limitation
Account Manager
Relaunched
MerchantsMerchants Churn
6

Churn Recovery Efforts
Enhanced pipeline
• Proactive Process
• Use machine learning pipeline to predict Time to reach $$
• Reach out to merchants before limitation is reached
• Mitigate restriction and churn
New Merchants Merchants Likely to Get
Limitation
Account Manager Merchants complete regulation
ML Model
Predict Revenue and Timelines
7

ML Platform
Data Models Integration Channel
Metadata
(Segment, Geo, Capacity, Priority, Channel, etc)
Data
Channel
Integration
Model 1 Model 2
Model N
Salesforce
Alerts
Salesforce
SSO
E-mail
…
Feedback Data (Optimization & Learnings)
Performance
Tracking
8

We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
10

Select Training Data
Ask questions
What datasets we
use for training
the model?
Should we focus
only on initial
transactions?
What data is
relevant for new
merchants?
?
Should we
consider Inflation
and currency
conversion?
What merchants
we should use to
train model ?
11

Data
Analyze our datasets
PAYMENTS
ACTIVITY
ConsumersMerchants
Demographics
Consumer Spending
Low/mid/high shopper
Country
Visits data
Payment attempts
Successful transactions
Account Identity
Currency
Country
Industry
Cross border
Paypal products
Transaction Amount
Frequency
New users
Repeat users
Transaction Type
12

13
Data Transformation Strategies
Raw Features Merchant Profile
Binning Transaction & Revenue Data
Trendlines Weekly trends in transactions
Binarization Payment Methods / Cross Border
Seasonality Tune weights for Transaction data
Transforming data into features

We’re here
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
14

15
Feature Engineering
Transforming data into features
Multiple Source Stitching
Indicator Variables
Normalization
Feature Selection
Outlier Removal

Data Preparation
Multiple source stitching
Source 1 :
Coverage 20%
Source 2 :
Coverage 30%
Stitch attribute values
based on accuracy
Enriched
feature :
Coverage 70%
Source 3:
Coverage 30%
Industry & Sub-industry enrichment
16

Data Preparation
Indicator variables
Type 1
Type 2
Type 3
Attribute X
Count
Count
Count
3 features
Type 1
count
Type 3
count
Type 2
count
17

Data Preparation
Indicator variables
Type 1
Type 2
Type 3
Attribute X
Calculate
most active
type
Count
Count
Count
1 feature
Most
Active
Type
E.g.
Gender
Monthly transaction count
18

Data Preparation
Indicator variables
Attribute X
Calculate buckets
and assign
indicator
Attribute
bucket
indicator
E.g.
Age
Income
19

Data Preparation
Hypothesis testing
Chi-square Selector
pValue
Top 30 features
All features
20

Data Preparation
Outliers
Dormant Merchants
Restriction placed to
not receive funds
OUTLIERS
Account locked
21

We’re here
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
22

Model selection
Choosing the right label
Choosing the right ‘y’
Week Quarter Year No. of days
Classification Regression
23

Model selection
Choosing the right model
Decision tree Naïve Bayes Gradient boosting tree Random forests
Low Accuracy
Classification
• Accuracy improved
• Overfitting
• Add more categorical
features • Accuracy improved
• Overfitting persisted • Accuracy improved
• Overfitting reduced
• High time to train
• Accuracy improved
• Overfitting reduced
• Low time to train
Logistic
Regression
24

We’re here
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
Let’s discuss the
approach to decide
a model
C r o s s v a l i d a t i o n
& h y p e r p a r a m e t e r
t u n i n g
Fine-tune and
reverification of model
25

Hyper-parameter tuning and Cross validation
Hyper-parameter Values
Number of trees 5,10,15,20,25
Max Bins 5,10,20,30
Impurity Gini, Entropy
Max Depth 5,10,20,30
Feature Subset Strategy auto
Folds 3
Hyper-parameter values for Random Forest model
26

How do we measure we have the right model?
27
Accuracy AUC ROC Precision
Recall AUC PR F1

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests
Accuracy
auROC auPR
Model comparison
28

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests
Accuracy
auROC
auPR
Best F1
Model comparison
29

We’re here
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
Let’s discuss the
approach to decide
a model
t u n i n g
Fine-tune and
P i p e l i n e a n d
l e a r n i n g s
View the final
pipeline and
learnings
30

Pipeline and Learnings
Final pipeline view
Transaction
Data
Merchant
Profile
Customer
Behavior
Existing
Merchants
New
Merchants
Revenue
Prediction
Algo
Timeline
Prediction
Algo
ML
Model
ML
Model
Timeline
Prediction
Revenue
Prediction
Time-based
merchant
selection
Channel :
Salesforce
Alerts,
Email
Notification
Demo-social
31

Pipeline and Learnings
Learnings
• Hypothesis testing
• Outlier removal
• Hyperparameter tuning
• Categorical features vs continuous features
• Time to train
• Accuracy
32

We’re here
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
Let’s discuss the
approach to decide
a model
t u n i n g
Fine-tune and
P i p e l i n e a n d
l e a r n i n g s
View the final
pipeline and
learnings
33

Thank you, SparkML!
You’re awesome ..
• Spark ETL -> Spark ML
• Supports many models out of the box
• Scalable for large data
• Easy cross-validation
• Extensive feature transformation suite
...many more
34

Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni

Semelhante a Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni