Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal and Saurabh Dutta

Leveraging Spark ML for
Real-Time Credit Card Approvals
Case study from a large financial Institution
Anand Venugopal
Saurabh Dutta
Impetus – StreamAnalytix
#Ent6SAIS

Agenda
• Use case background
• Existing system challenges and new goals
• Solution details and lessons learnt
• Q&A
#Ent6SAIS

Background - Customer
#Ent6SAIS
50M+ Credit cards
50+ countries ~4M per year

Background – Use Case
• Acquire legitimate, responsible customers
• Decision: Approve ? Credit Limit ? APR ?
• Sub-second response time to make a decision
#Ent6SAIS

Problem Statement
Submits
application
Prospective
Customer
Determining Card Eligibility
#Ent6SAIS
Logic
Execution
Decision +
Communication

Core logic: Risk scoring
• Estimate debt repayment risk
• To limit credit risk of the lender
• Ensure individual’s financial well being
#Ent6SAIS

Risk scoring factors
• History
• Credit usage
• Loans
• Other Credit cards
#Ent6SAIS
• Job type
• Income band
• Debt Ratio
• Credit Scores

Decision flow
Receive
Request
Call CB
REST API
Get
Acxiom
Data
Get SOW
Data
CB
Derivations
SOW
Derivations
Decision Line Price
Respond
#Ent6SAIS
Models
Risk Score: 0-1

Multiple model types
• Approve/ Decline
– Segment
– Geography
• Line
– Credit Limit
• Price
– APR
• Decision Tree
– Numerous instances
• Regression
• K-means
9#Ent6SAIS

Decision tree – Approve ? Y/N
1
2 3
4 5 6 7
Salary >= 50,000Salary < 50,000
Other Loans = Y
Other Loans = N
8 9
Debt Ratio < 0.7
Other Loans = Y
Other Loans = N
Debt Ratio > 0.7
#Ent6SAIS

Regression: Credit line ($)
$ 1500
$ 500
Creditlimit
Risk Model Score
#Ent6SAIS

Clustering model: Price
22% APR
#Ent6SAIS
18% APR

Existing system
• Built using traditional technologies
• Microsoft .NET stack
– C#
– MS SQL Server
#Ent6SAIS

Top challenges with existing system
• Everything on single box: not scalable, not flexible
• Model training on limited data: limits accuracy
• Data Scientists work in isolation: silo’ed tools
• Model management: manual and cumbersome
#Ent6SAIS

Primary goals for the new system
• Ease of use for stakeholders (self-service)
• Scale: Build models on huge datasets
• Fast decision response for the end-customer
• Unified, collaborative platform
• Data Lineage / Audit capability
#Ent6SAIS

Proposed tools
• Spark Streaming
• Spark ML
• Kafka
• HDFS
• HBase
• Visual Spark Platform - StreamAnalytix
#Ent6SAIS

Spark Streaming
• Write streaming jobs
• Extension of core Spark API
– Scalable
– High throughput
– Fault tolerance
• Receives input and divides into batches
#Ent6SAIS

Spark ML
• Spark’s Machine learning
module
– DataFrame-based API
• Algorithms
– Classification
– Regression
– Clustering
– Collaborative Filtering
#Ent6SAIS
• Utilities
– Feature Selection
– Feature Transformations
– Hyper Parameter Tuning
– Model Evaluation
– Linear Algebra, Statistics

Spark based architecture - Training
SparkHDFS
Spark ML
Model RepositoryTraining Data Source
#Ent6SAIS
HDFS

Model training pipeline
#Ent6SAIS
Read from HDFS

Data Validation
- Null checks
- Invalid Chars ♡⚐♯♣
#Ent6SAIS
Data Quality

Score = 200
Status = Approved
#Ent6SAIS
Eliminate Outliers

Mean
Median
Most Frequent
#Ent6SAIS
Name Age
Jack 37
Eva ?
Dirk 42
Impute missing values
Model based Imputation
Constant

Feature Selection
Transformation
Model Selection
Hyperparameters
#Ent6SAIS
Core logic of training model

Spark based architecture - Scoring
#Ent6SAIS
User
Session Kafka Spark Streaming
3rd Party
Providers
Bank’s internal
repository
YARN
ML
Models
HBase
Kafka
HDFS

Model scoring pipeline
#Ent6SAIS
Read from Kafka

External WS calls
#Ent6SAIS

Internal DB Lookup
#Ent6SAIS

Conditional Filter and model execution
#Ent6SAIS

Decision based on
model’s score
#Ent6SAIS

#Ent6SAIS
Rejected
Approved
Pending

Line & Price Models
#Ent6SAIS

Lineage/ Audit
#Ent6SAIS
The journey of a single data record through the scoring pipeline

Deployment
Transport Compute Storage Exploration
Kafka Spark
StreamAnalytix
HDFS + Hive BI Tools
- 2 Nodes with Sticky Session
- Load Balancer
- Zookeeper
- Tomcat
- MySQL
- RabbitMQ
#Ent6SAIS

Project Details
• Q4 2017
• 3 months from start to finish
• 3x faster than originally planned
• Team size: 4
• Apache Spark 2.1
• On-premise Hadoop Cluster with YARN
#Ent6SAIS

Learnings
• Consistent data format
• Add timeouts to third
party API calls
• Optimize stragglers
• Avoid excessive logging
#Ent6SAIS
• Checkpointing
• Outlier Analysis
– Using models
• Hyperparameter tuning +
Metric Evaluation
• Caching
– useNodeIdCache

Goals: Recap
• Ease of use for stakeholders (self-service)
• Scale: Build models on huge datasets
• Fast decision response for the end-customer
• Unified, collaborative platform
• Data Lineage / Audit capability
#Ent6SAIS

Q&A
Visit Impetus StreamAnalytix booth #209
#Ent6SAIS

Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal and Saurabh Dutta

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal and Saurabh Dutta

Semelhante a Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal and Saurabh Dutta (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal and Saurabh Dutta