This tech talk deals with how we leveraged Spark Streaming and Spark Machine Learning models to build & operationalize real-time credit card approvals for a banking major. We plan to cover ML capabilities in Spark and how a typical ML pipeline looks like.
We are going to talk about the domain and the use case of how a major credit card provider is using spark to calculate card eligibility in real-time. We’re also going to share the challenges faced by the current system and how spark is a good fit to solve these kinds of problems.
We will then take a deep dive on the different tools that were used to design the solution and the architecture of the system. Here, we will also be sharing of how a spark based workflow was created to address various aspects like reading from Kafka, parsing, data enrichment, model selection, model scoring, rule execution to conclude the recommended output.
Finally, we’re also going to talk about the key challenges, learning and recommendations when building such a system and taking it to production.
Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal and Saurabh Dutta
1. Leveraging Spark ML for
Real-Time Credit Card Approvals
Case study from a large financial Institution
Anand Venugopal
Saurabh Dutta
Impetus – StreamAnalytix
#Ent6SAIS
2. Agenda
• Use case background
• Existing system challenges and new goals
• Solution details and lessons learnt
• Q&A
#Ent6SAIS
4. Background – Use Case
• Acquire legitimate, responsible customers
• Decision: Approve ? Credit Limit ? APR ?
• Sub-second response time to make a decision
#Ent6SAIS
10. Decision tree – Approve ? Y/N
1
2 3
4 5 6 7
Salary >= 50,000Salary < 50,000
Other Loans = Y
Other Loans = N
8 9
Debt Ratio < 0.7
Other Loans = Y
Other Loans = N
Debt Ratio > 0.7
#Ent6SAIS
13. Existing system
• Built using traditional technologies
• Microsoft .NET stack
– C#
– MS SQL Server
#Ent6SAIS
14. Top challenges with existing system
• Everything on single box: not scalable, not flexible
• Model training on limited data: limits accuracy
• Data Scientists work in isolation: silo’ed tools
• Model management: manual and cumbersome
#Ent6SAIS
15. Primary goals for the new system
• Ease of use for stakeholders (self-service)
• Scale: Build models on huge datasets
• Fast decision response for the end-customer
• Unified, collaborative platform
• Data Lineage / Audit capability
#Ent6SAIS
17. Spark Streaming
• Write streaming jobs
• Extension of core Spark API
– Scalable
– High throughput
– Fault tolerance
• Receives input and divides into batches
#Ent6SAIS
35. Deployment
Transport Compute Storage Exploration
Kafka Spark
StreamAnalytix
HDFS + Hive BI Tools
- 2 Nodes with Sticky Session
- Load Balancer
- Zookeeper
- Tomcat
- MySQL
- RabbitMQ
#Ent6SAIS
36. Project Details
• Q4 2017
• 3 months from start to finish
• 3x faster than originally planned
• Team size: 4
• Apache Spark 2.1
• On-premise Hadoop Cluster with YARN
#Ent6SAIS
37. Learnings
• Consistent data format
• Add timeouts to third
party API calls
• Optimize stragglers
• Avoid excessive logging
#Ent6SAIS
• Checkpointing
• Outlier Analysis
– Using models
• Hyperparameter tuning +
Metric Evaluation
• Caching
– useNodeIdCache
38. Goals: Recap
• Ease of use for stakeholders (self-service)
• Scale: Build models on huge datasets
• Fast decision response for the end-customer
• Unified, collaborative platform
• Data Lineage / Audit capability
#Ent6SAIS