In general, fraud is the common painful area in the telecom sector, and detecting fraud is like finding a needle in the haystack due to volume and velocity of data. There are 2 key factors to detect fraud:
(1). Speed: If you can’t detect in time, you’re doomed to loose because they’ve already got what they need. Simbox detection is one of the use case for this situation. Frauders use it to bypass interconnection fee. In this use case we’re talking about our real time architecture using Spark SQL to detect simbox within 5 minutes.
(2). Accuracy: Frauders changes their method all the time. But our job is finding their behaviour using machine learning algorithms accurately. Anomaly detection is one of the use case for this situation. In this use case we’re talking about data mining architecture to make fraud models using Spark ML within 1 hour. We also discuss some ML algorithm performance on Spark such as K-means, three sigma rule, T-digest and so on. In order to accomplish these factors, we processes 8-10 billion records which size is 4-5 TB every day. Our solution combines end-to-end data ingestion, processing, and mining the high volume data to detect some use cases of fraud in near real time using CDR and IPTDR to save millions, and better user experience.
7. Wangiri
• Wangiri (One ring-cut)
• Premium services
• Both local and
international services
8. Wangiri
Problem: Someone
makes a call, then
subscriber will be curious
about it and calls back but
he/she didn’t know it’s a
premium service so that
next bill become
expensive which is
unexpected for subscriber
9. Wangiri
Solution: Design a system that detect these numbers
to react
• There’s no historical data so it’s an analytical
solution
• Python + Spark = Pyspark
• Send an email to alert
10. Wangiri
• Process CDRs in near real time(nrt) using Spark
• Generate features using these data in NRT
• 1TB data/day
• Wangiri Detection Rate > %50
12. Anomaly Detection
Problem: Identify the anomalies of the international
calls and determine if there is a situation where action
should be taken
13. Solution: Designing a system that can detect
unexpected changes. Whenever there is a change,
produce an alarm so that business can quickly examine
and take action.
Anomaly Detection
18. 68-95-99.7 Rule
• Implemented in Spark and SQL
• Too much false positive alarm
• Current data should be more weighted than
historical data
• Not enough data to explain anomaly
22. K-Means Clustering
• Number of cluster is predefined
• It’s impractical to define number of cluster for each
country
23. S-H-ESD
• Global(Seasonally) and Local(trend) Analysis
• R Library Open sourced by Twitter
• Prototype using R
• Ported to python to get production-ready version
• Easier to explain, successful results
37. Simbox - Machine Learning
Results:
• Oversampling or under sampling doesn’t change
the accuracy so much
• Random forest and decision tree handles
imbalanced data set
• The accuracy can be improved