SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Merchant Churn Prediction using SparkML at
PayPal
Who are we?
• Data Engineers – ETL pipelines using Spark
• Like all great projects, we started from a hack!
• Data Engineering to Machine Learning
2
Agenda 1. Scale at PayPal
2. Understanding Merchant Churn
3. Machine Learning Workflow
4. Learnings
5. Spark ML
3
. 200 Countries
. 25 Currencies
. 19 Million Merchants
. 237 Million Active Users
. 8 Billon Transactions per Year
. 6 Billion Events per Day
Scale at PayPal
4
Understanding Merchant Churn
Compliance Use Case for CLAC
Story Triggers Impact Insights
Increase in Compliance
Limitations for CLAC in 2017
Regulations mandates
merchants to complete
Compliance verification
Applicable to Merchants
Exceeding $$ in a 12-month
period
Might lead to merchant’s
account being suspended .
Merchant not aware of
limitation
Merchant did not
understand how to
resolve limitation
High impact for Small
merchants
Biggest churn driver for
CLAC in 2017
$M in payments
5
Churn Recovery Efforts
Existing pipeline
• Limited Success
• Reactive process
• Account managers reach out to merchants already churned
• Reverse limitation and relaunch merchants takes time
• Large set of merchants for reach-outs
New Merchants Merchants Get Limitation
Account Manager
Relaunched
MerchantsMerchants Churn
6
Churn Recovery Efforts
Enhanced pipeline
• Proactive Process
• Use machine learning pipeline to predict Time to reach $$
• Reach out to merchants before limitation is reached
• Mitigate restriction and churn
New Merchants Merchants Likely to Get
Limitation
Account Manager Merchants complete regulation
ML Model
Predict Revenue and Timelines
7
ML Platform
Data Models Integration Channel
Metadata
(Segment, Geo, Capacity, Priority, Channel, etc)
Data
Channel
Integration
Model 1 Model 2
Model N
Salesforce
Alerts
Salesforce
SSO
E-mail
…
Feedback Data (Optimization & Learnings)
Performance
Tracking
8
So where do we start from?
9
We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
10
Select Training Data
Ask questions
What datasets we
use for training
the model?
Should we focus
only on initial
transactions?
What data is
relevant for new
merchants?
?
Should we
consider Inflation
and currency
conversion?
What merchants
we should use to
train model ?
11
Data
Analyze our datasets
PAYMENTS
ACTIVITY
ConsumersMerchants
Demographics
Consumer Spending
Low/mid/high shopper
Country
Visits data
Payment attempts
Successful transactions
Account Identity
Currency
Country
Industry
Cross border
Paypal products
Transaction Amount
Frequency
New users
Repeat users
Transaction Type
12
13
Data Transformation Strategies
Raw Features Merchant Profile
Binning Transaction & Revenue Data
Trendlines Weekly trends in transactions
Binarization Payment Methods / Cross Border
Seasonality Tune weights for Transaction data
Transforming data into features
We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
14
15
Feature Engineering
Transforming data into features
Multiple Source Stitching
Indicator Variables
Normalization
Feature Selection
Outlier Removal
Data Preparation
Multiple source stitching
Source 1 :
Coverage 20%
Source 2 :
Coverage 30%
Stitch attribute values
based on accuracy
Enriched
feature :
Coverage 70%
Source 3:
Coverage 30%
Industry & Sub-industry enrichment
16
Data Preparation
Indicator variables
Type 1
Type 2
Type 3
Attribute X
Count
Count
Count
3 features
Type 1
count
Type 3
count
Type 2
count
17
Data Preparation
Indicator variables
Type 1
Type 2
Type 3
Attribute X
Calculate
most active
type
Count
Count
Count
1 feature
Most
Active
Type
E.g.
Gender
Monthly transaction count
18
Data Preparation
Indicator variables
Attribute X
Calculate buckets
and assign
indicator
Attribute
bucket
indicator
E.g.
Age
Income
19
Data Preparation
Hypothesis testing
Chi-square Selector
pValue
Top 30 features
All features
20
Data Preparation
Outliers
Dormant Merchants
Restriction placed to
not receive funds
OUTLIERS
Account locked
21
We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
22
Model selection
Choosing the right label
Choosing the right ‘y’
Week Quarter Year No. of days
Classification Regression
23
Model selection
Choosing the right model
Decision tree Naïve Bayes Gradient boosting tree Random forests
Low Accuracy
Classification
• Accuracy improved
• Overfitting
• Add more categorical
features • Accuracy improved
• Overfitting persisted • Accuracy improved
• Overfitting reduced
• High time to train
• Accuracy improved
• Overfitting reduced
• Low time to train
Logistic
Regression
24
We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
C r o s s v a l i d a t i o n
& h y p e r p a r a m e t e r
t u n i n g
Fine-tune and
reverification of model
25
Hyper-parameter tuning and Cross validation
Hyper-parameter Values
Number of trees 5,10,15,20,25
Max Bins 5,10,20,30
Impurity Gini, Entropy
Max Depth 5,10,20,30
Feature Subset Strategy auto
Folds 3
Hyper-parameter values for Random Forest model
26
Hyper-parameter tuning and Cross validation
How do we measure we have the right model?
27
Accuracy AUC ROC Precision
Recall AUC PR F1
Hyper-parameter tuning and Cross validation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests
Accuracy
auROC auPR
Model comparison
28
Hyper-parameter tuning and Cross validation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests
Accuracy
auROC
auPR
Best F1
Model comparison
29
We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
C r o s s v a l i d a t i o n
& h y p e r p a r a m e t e r
t u n i n g
Fine-tune and
reverification of model
P i p e l i n e a n d
l e a r n i n g s
View the final
pipeline and
learnings
30
Pipeline and Learnings
Final pipeline view
Transaction
Data
Merchant
Profile
Customer
Behavior
Existing
Merchants
New
Merchants
Revenue
Prediction
Algo
Timeline
Prediction
Algo
ML
Model
ML
Model
Timeline
Prediction
Revenue
Prediction
Time-based
merchant
selection
Channel :
Salesforce
Alerts,
Email
Notification
Demo-social
31
Pipeline and Learnings
Learnings
• Hypothesis testing
• Outlier removal
• Hyperparameter tuning
• Categorical features vs continuous features
• Time to train
• Accuracy
32
We’re here
Learning to do Machine learning
E x p l o r e d a t a
Let’s analyze
what kind of
data we have
S t o p c h u r n
We’re done &
merchants are
happy!
D a t a P r e p
Let’s prepare the
data for machine
learning
M o d e l s e l e c t i o n
Let’s discuss the
approach to decide
the ‘y’ and choose
a model
C r o s s v a l i d a t i o n
& h y p e r p a r a m e t e r
t u n i n g
Fine-tune and
reverification of model
P i p e l i n e a n d
l e a r n i n g s
View the final
pipeline and
learnings
33
Thank you, SparkML!
You’re awesome ..
• Spark ETL -> Spark ML
• Supports many models out of the box
• Scalable for large data
• Easy cross-validation
• Extensive feature transformation suite
...many more
34
QUESTIONS?
35

Mais conteúdo relacionado

Mais procurados

Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
Recommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringRecommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringChangsung Moon
 
The Business of Software, Distribution, and System Integration
The Business of Software, Distribution, and System IntegrationThe Business of Software, Distribution, and System Integration
The Business of Software, Distribution, and System IntegrationISA Marketing & Sales Summit
 
Music Recommendation 2018
Music Recommendation 2018Music Recommendation 2018
Music Recommendation 2018Fabien Gouyon
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011idoguy
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningRahul Sahai
 
API Product Management for Product Managers
API Product Management for Product ManagersAPI Product Management for Product Managers
API Product Management for Product ManagersAmancio Bouza
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at SpotifyOguz Semerci
 
Creating an Omnichannel Customer Experience
Creating an Omnichannel Customer ExperienceCreating an Omnichannel Customer Experience
Creating an Omnichannel Customer ExperienceCSI Solutions
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Esh Vckay
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science CompetitionsJeong-Yoon Lee
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogsAlexey Grigorev
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In IndustryXavier Amatriain
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Fernando Amat
 
Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020 Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020 Roelof van Zwol
 
Data Science @ Instacart
Data Science @ InstacartData Science @ Instacart
Data Science @ InstacartSharath Rao
 

Mais procurados (20)

Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
Recommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringRecommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative Filtering
 
The Business of Software, Distribution, and System Integration
The Business of Software, Distribution, and System IntegrationThe Business of Software, Distribution, and System Integration
The Business of Software, Distribution, and System Integration
 
Music Recommendation 2018
Music Recommendation 2018Music Recommendation 2018
Music Recommendation 2018
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
API Product Management for Product Managers
API Product Management for Product ManagersAPI Product Management for Product Managers
API Product Management for Product Managers
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
Creating an Omnichannel Customer Experience
Creating an Omnichannel Customer ExperienceCreating an Omnichannel Customer Experience
Creating an Omnichannel Customer Experience
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
AI in Supply chains
AI in Supply chainsAI in Supply chains
AI in Supply chains
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In Industry
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018
 
Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020 Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020
 
Data Science @ Instacart
Data Science @ InstacartData Science @ Instacart
Data Science @ Instacart
 

Semelhante a Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni

Not Tooling Around: How The Home Depot Uses Machine Learning for Vendor Accou...
Not Tooling Around: How The Home Depot Uses Machine Learning for Vendor Accou...Not Tooling Around: How The Home Depot Uses Machine Learning for Vendor Accou...
Not Tooling Around: How The Home Depot Uses Machine Learning for Vendor Accou...National Retail Federation
 
Driving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle AnalyticsDriving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle AnalyticsPerficient, Inc.
 
Using Analytics to Build Solid Supply and Supplier Management Relationships
Using Analytics to Build Solid Supply and Supplier Management RelationshipsUsing Analytics to Build Solid Supply and Supplier Management Relationships
Using Analytics to Build Solid Supply and Supplier Management RelationshipsHalo BI
 
Big Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Big Data LDN 2017: Advanced Analytics Applied to Marketing AttributionBig Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Big Data LDN 2017: Advanced Analytics Applied to Marketing AttributionMatt Stubbs
 
OpsStars Boston Workshop | Connecting Data to People Across Any Go-to-Market ...
OpsStars Boston Workshop | Connecting Data to People Across Any Go-to-Market ...OpsStars Boston Workshop | Connecting Data to People Across Any Go-to-Market ...
OpsStars Boston Workshop | Connecting Data to People Across Any Go-to-Market ...LeanData
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsVivastream
 
Improving Data Modeling Workflow
Improving Data Modeling WorkflowImproving Data Modeling Workflow
Improving Data Modeling WorkflowLooker
 
Small Business Analytics and Metrics: How and What Do you Measure Up?
Small Business Analytics and Metrics: How and What Do you Measure Up? Small Business Analytics and Metrics: How and What Do you Measure Up?
Small Business Analytics and Metrics: How and What Do you Measure Up? Vivastream
 
How VP of CS preps for the Board
How VP of CS preps for the Board How VP of CS preps for the Board
How VP of CS preps for the Board Gainsight
 
Rplus Retail analytics solution
Rplus Retail analytics solutionRplus Retail analytics solution
Rplus Retail analytics solutionKGS Saravanan
 
Oro Meetup London - Allies: How can we really turn data into profit?
Oro Meetup London - Allies: How can we really turn data into profit?Oro Meetup London - Allies: How can we really turn data into profit?
Oro Meetup London - Allies: How can we really turn data into profit?Oro Inc.
 
Predictable results for high growth sales organizations
Predictable results for high growth sales organizationsPredictable results for high growth sales organizations
Predictable results for high growth sales organizationsConnectLeader_Marketing
 
Predictable Results for High Growth Sales Organizations
Predictable Results for High Growth Sales OrganizationsPredictable Results for High Growth Sales Organizations
Predictable Results for High Growth Sales OrganizationsKen Smith
 
Intacct webinar tech_savvy_cfo_visibility
Intacct webinar tech_savvy_cfo_visibilityIntacct webinar tech_savvy_cfo_visibility
Intacct webinar tech_savvy_cfo_visibilityIntacct Corporation
 
Modern Billing for Modern SaaS companies-original-slides
Modern Billing for Modern SaaS companies-original-slidesModern Billing for Modern SaaS companies-original-slides
Modern Billing for Modern SaaS companies-original-slidesMassimo Talia
 
Making Data Actionable; PDF
Making Data Actionable; PDFMaking Data Actionable; PDF
Making Data Actionable; PDFRich Jones
 
Five Steps to a Martech Power Stack (2021)
Five Steps to a Martech Power Stack (2021)Five Steps to a Martech Power Stack (2021)
Five Steps to a Martech Power Stack (2021)Josh Hill
 

Semelhante a Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni (20)

Not Tooling Around: How The Home Depot Uses Machine Learning for Vendor Accou...
Not Tooling Around: How The Home Depot Uses Machine Learning for Vendor Accou...Not Tooling Around: How The Home Depot Uses Machine Learning for Vendor Accou...
Not Tooling Around: How The Home Depot Uses Machine Learning for Vendor Accou...
 
Driving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle AnalyticsDriving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle Analytics
 
Using Analytics to Build Solid Supply and Supplier Management Relationships
Using Analytics to Build Solid Supply and Supplier Management RelationshipsUsing Analytics to Build Solid Supply and Supplier Management Relationships
Using Analytics to Build Solid Supply and Supplier Management Relationships
 
Big Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Big Data LDN 2017: Advanced Analytics Applied to Marketing AttributionBig Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Big Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
 
OpsStars Boston Workshop | Connecting Data to People Across Any Go-to-Market ...
OpsStars Boston Workshop | Connecting Data to People Across Any Go-to-Market ...OpsStars Boston Workshop | Connecting Data to People Across Any Go-to-Market ...
OpsStars Boston Workshop | Connecting Data to People Across Any Go-to-Market ...
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
 
Improving Data Modeling Workflow
Improving Data Modeling WorkflowImproving Data Modeling Workflow
Improving Data Modeling Workflow
 
Small Business Analytics and Metrics: How and What Do you Measure Up?
Small Business Analytics and Metrics: How and What Do you Measure Up? Small Business Analytics and Metrics: How and What Do you Measure Up?
Small Business Analytics and Metrics: How and What Do you Measure Up?
 
How VP of CS preps for the Board
How VP of CS preps for the Board How VP of CS preps for the Board
How VP of CS preps for the Board
 
IOD 2013 3683 Post_Russell v2 5
IOD 2013 3683 Post_Russell v2 5IOD 2013 3683 Post_Russell v2 5
IOD 2013 3683 Post_Russell v2 5
 
Day 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business AnalyticsDay 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business Analytics
 
Rplus Retail analytics solution
Rplus Retail analytics solutionRplus Retail analytics solution
Rplus Retail analytics solution
 
Oro Meetup London - Allies: How can we really turn data into profit?
Oro Meetup London - Allies: How can we really turn data into profit?Oro Meetup London - Allies: How can we really turn data into profit?
Oro Meetup London - Allies: How can we really turn data into profit?
 
Predictable results for high growth sales organizations
Predictable results for high growth sales organizationsPredictable results for high growth sales organizations
Predictable results for high growth sales organizations
 
Predictable Results for High Growth Sales Organizations
Predictable Results for High Growth Sales OrganizationsPredictable Results for High Growth Sales Organizations
Predictable Results for High Growth Sales Organizations
 
Operations benchmarking survey TCS 8th feb
Operations benchmarking survey TCS 8th febOperations benchmarking survey TCS 8th feb
Operations benchmarking survey TCS 8th feb
 
Intacct webinar tech_savvy_cfo_visibility
Intacct webinar tech_savvy_cfo_visibilityIntacct webinar tech_savvy_cfo_visibility
Intacct webinar tech_savvy_cfo_visibility
 
Modern Billing for Modern SaaS companies-original-slides
Modern Billing for Modern SaaS companies-original-slidesModern Billing for Modern SaaS companies-original-slides
Modern Billing for Modern SaaS companies-original-slides
 
Making Data Actionable; PDF
Making Data Actionable; PDFMaking Data Actionable; PDF
Making Data Actionable; PDF
 
Five Steps to a Martech Power Stack (2021)
Five Steps to a Martech Power Stack (2021)Five Steps to a Martech Power Stack (2021)
Five Steps to a Martech Power Stack (2021)
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Último (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni

  • 1. Merchant Churn Prediction using SparkML at PayPal
  • 2. Who are we? • Data Engineers – ETL pipelines using Spark • Like all great projects, we started from a hack! • Data Engineering to Machine Learning 2
  • 3. Agenda 1. Scale at PayPal 2. Understanding Merchant Churn 3. Machine Learning Workflow 4. Learnings 5. Spark ML 3
  • 4. . 200 Countries . 25 Currencies . 19 Million Merchants . 237 Million Active Users . 8 Billon Transactions per Year . 6 Billion Events per Day Scale at PayPal 4
  • 5. Understanding Merchant Churn Compliance Use Case for CLAC Story Triggers Impact Insights Increase in Compliance Limitations for CLAC in 2017 Regulations mandates merchants to complete Compliance verification Applicable to Merchants Exceeding $$ in a 12-month period Might lead to merchant’s account being suspended . Merchant not aware of limitation Merchant did not understand how to resolve limitation High impact for Small merchants Biggest churn driver for CLAC in 2017 $M in payments 5
  • 6. Churn Recovery Efforts Existing pipeline • Limited Success • Reactive process • Account managers reach out to merchants already churned • Reverse limitation and relaunch merchants takes time • Large set of merchants for reach-outs New Merchants Merchants Get Limitation Account Manager Relaunched MerchantsMerchants Churn 6
  • 7. Churn Recovery Efforts Enhanced pipeline • Proactive Process • Use machine learning pipeline to predict Time to reach $$ • Reach out to merchants before limitation is reached • Mitigate restriction and churn New Merchants Merchants Likely to Get Limitation Account Manager Merchants complete regulation ML Model Predict Revenue and Timelines 7
  • 8. ML Platform Data Models Integration Channel Metadata (Segment, Geo, Capacity, Priority, Channel, etc) Data Channel Integration Model 1 Model 2 Model N Salesforce Alerts Salesforce SSO E-mail … Feedback Data (Optimization & Learnings) Performance Tracking 8
  • 9. So where do we start from? 9
  • 10. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! 10
  • 11. Select Training Data Ask questions What datasets we use for training the model? Should we focus only on initial transactions? What data is relevant for new merchants? ? Should we consider Inflation and currency conversion? What merchants we should use to train model ? 11
  • 12. Data Analyze our datasets PAYMENTS ACTIVITY ConsumersMerchants Demographics Consumer Spending Low/mid/high shopper Country Visits data Payment attempts Successful transactions Account Identity Currency Country Industry Cross border Paypal products Transaction Amount Frequency New users Repeat users Transaction Type 12
  • 13. 13 Data Transformation Strategies Raw Features Merchant Profile Binning Transaction & Revenue Data Trendlines Weekly trends in transactions Binarization Payment Methods / Cross Border Seasonality Tune weights for Transaction data Transforming data into features
  • 14. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning 14
  • 15. 15 Feature Engineering Transforming data into features Multiple Source Stitching Indicator Variables Normalization Feature Selection Outlier Removal
  • 16. Data Preparation Multiple source stitching Source 1 : Coverage 20% Source 2 : Coverage 30% Stitch attribute values based on accuracy Enriched feature : Coverage 70% Source 3: Coverage 30% Industry & Sub-industry enrichment 16
  • 17. Data Preparation Indicator variables Type 1 Type 2 Type 3 Attribute X Count Count Count 3 features Type 1 count Type 3 count Type 2 count 17
  • 18. Data Preparation Indicator variables Type 1 Type 2 Type 3 Attribute X Calculate most active type Count Count Count 1 feature Most Active Type E.g. Gender Monthly transaction count 18
  • 19. Data Preparation Indicator variables Attribute X Calculate buckets and assign indicator Attribute bucket indicator E.g. Age Income 19
  • 20. Data Preparation Hypothesis testing Chi-square Selector pValue Top 30 features All features 20
  • 21. Data Preparation Outliers Dormant Merchants Restriction placed to not receive funds OUTLIERS Account locked 21
  • 22. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning M o d e l s e l e c t i o n Let’s discuss the approach to decide the ‘y’ and choose a model 22
  • 23. Model selection Choosing the right label Choosing the right ‘y’ Week Quarter Year No. of days Classification Regression 23
  • 24. Model selection Choosing the right model Decision tree Naïve Bayes Gradient boosting tree Random forests Low Accuracy Classification • Accuracy improved • Overfitting • Add more categorical features • Accuracy improved • Overfitting persisted • Accuracy improved • Overfitting reduced • High time to train • Accuracy improved • Overfitting reduced • Low time to train Logistic Regression 24
  • 25. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning M o d e l s e l e c t i o n Let’s discuss the approach to decide the ‘y’ and choose a model C r o s s v a l i d a t i o n & h y p e r p a r a m e t e r t u n i n g Fine-tune and reverification of model 25
  • 26. Hyper-parameter tuning and Cross validation Hyper-parameter Values Number of trees 5,10,15,20,25 Max Bins 5,10,20,30 Impurity Gini, Entropy Max Depth 5,10,20,30 Feature Subset Strategy auto Folds 3 Hyper-parameter values for Random Forest model 26
  • 27. Hyper-parameter tuning and Cross validation How do we measure we have the right model? 27 Accuracy AUC ROC Precision Recall AUC PR F1
  • 28. Hyper-parameter tuning and Cross validation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests Accuracy auROC auPR Model comparison 28
  • 29. Hyper-parameter tuning and Cross validation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests Accuracy auROC auPR Best F1 Model comparison 29
  • 30. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning M o d e l s e l e c t i o n Let’s discuss the approach to decide the ‘y’ and choose a model C r o s s v a l i d a t i o n & h y p e r p a r a m e t e r t u n i n g Fine-tune and reverification of model P i p e l i n e a n d l e a r n i n g s View the final pipeline and learnings 30
  • 31. Pipeline and Learnings Final pipeline view Transaction Data Merchant Profile Customer Behavior Existing Merchants New Merchants Revenue Prediction Algo Timeline Prediction Algo ML Model ML Model Timeline Prediction Revenue Prediction Time-based merchant selection Channel : Salesforce Alerts, Email Notification Demo-social 31
  • 32. Pipeline and Learnings Learnings • Hypothesis testing • Outlier removal • Hyperparameter tuning • Categorical features vs continuous features • Time to train • Accuracy 32
  • 33. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning M o d e l s e l e c t i o n Let’s discuss the approach to decide the ‘y’ and choose a model C r o s s v a l i d a t i o n & h y p e r p a r a m e t e r t u n i n g Fine-tune and reverification of model P i p e l i n e a n d l e a r n i n g s View the final pipeline and learnings 33
  • 34. Thank you, SparkML! You’re awesome .. • Spark ETL -> Spark ML • Supports many models out of the box • Scalable for large data • Easy cross-validation • Extensive feature transformation suite ...many more 34