Fraud detection is a popular application of Machine Learning. But is not that obvious and not that common as it seems. I'll tell how QuantUp implemented it for WARTA insurance company (a subsidiary of Talanx International AG).
The models developed gave between 10% and 30% of reduction of losses. The project was not a simple one because of the complex process of handling claims and using really rich dataset. The tools applied were R (modeling) and DataWalk (data peparation). You will learn what is important in development of such solutions in general, what was difficult in this particular project, and how to overcome possible difficulties in similar projects.
3. What do you think about…
…when you think about fraud detection with ML?
• Deep Learning, xgboost, Autoencoder, severe class imbalance?
• How to apply the model to real decision taking, sample
representativeness, having historical frauds identified and
marked, potential features, goal function?
3
5. About Warta
• 2nd biggest insurer in Poland
• Full offering: Life and non-life insurances
• Member of Talanx Group
• Award winning innovator, e.g.
• InsurTech Congress award for implementing anti-fraud
solution (http://media.warta.pl/pr/356330/warta-doceniona-
za-wdrozenie-platformy-datawalk)
• First comprehensive mobile app for claim handling process.
5
6. Project scope
Warta’s reasons to start:
• Looking for comprehensive anti-fraud solution for non-life insurances
• Convinced to improve anti-fraud KPIs
• Readiness to replace existing technologies: IBM, Statistica.
Chosen solution:
• DataWalk – data gathering, data linking, expert scoring and fraud investigations.
• QuantUp – Machine Learning algorithms to improve suspicious claim selection.
6
7. Integration with DataWalk
DataWalk is a Big Data software platform for connecting numerous
large data sets, both external and internal, into a single repository
for fast visual analysis.
DataWalk can be used for:
• Fast data modelling
• Fraud hypothesis prototyping
• Fraud scoring
• Fraud investigations
Analyze your data
10x faster
Increasing the
effectiveness of
anti-fraud rules up
to 80%
Pre-configured rules
and scores
30-90 days return on investment
Demo movie: https://youtu.be/h45mheDH4uU 7
8. Integration with DataWalk
• DataWalk enables easy to use, graphical
interface (Universe Viewer) to interpret and
link data, as well as to create and maintain
ABTs.
• Well prepared and easy to update data model
is a fundamental issue in ABT creation and
predictive model credibility.
8
10. Goal & results
• Goal:
• Improving of detection of probable frauds for further investigation
• Probable / doubtful / suspicious claim: suspected to be a fraud but not proven to be one
• Finding and proving are two different things
• Result: improvement of order of 30% (comparing to past simple models)
10
11. Important business questions
• How to choose claims for investigation to:
• detect highest number of fraud attempts?
• detect highest amount of fraudulent claims?
• detect highest amount with limited resources and time for detection?
• be able to prove highest number / amount of fraud attempts?
• This requirement is translated into a suitable goal function for a model
• and should affect the optimization criterion.
11
12. Claim case 1: Rules / human
• Description: A driver hit the rear side of a victim's car. The car was pushed to the crossroads
area and there was a collision with a third car (Mercedes). The police was called.
• Rules:
• airbags inflated
• similar age of both drivers
• difference of cars' age >=11 years
• historical loss coefficient >=5
• Result: Not refused to pay because of fraud attempt: the description was consistent with
the damages
12
13. Claim case 2: Model
• Description: I (victim) was driving a left lane. The second driver (a culprit) was driving a right
lane (the same direction). He wanted to change the lane, haven't seen my car and hit my
car. Its rear left side damaged my car's right front side.
• Analysis: no clear evidence
• only one year of cars' age difference
• no age information for the second driver
• insurance policy was not new
• no claim history for drivers and cars
• Result: Refused to pay because of fraud attempt: no correlation between description and
damages – not possible to be a real claim (verified)
13
14. How to build a model?
• Preparation of the predictors (can be complex because of aggregation of data from many
sources) in a form of ABT
• Having the target variable in the historical data
• Build a predictive model
14
15. Important
• Checking if modeling is possible (the process of claim handling influences the historical
data): 0% vs. 100% checked
• Definition of new predictors
• Detection of false predictors
• Data enhancement: historical aggregates, textual, external
15
16. Inside
• Historical information about all collission parties
• Extraction of information from text notes
• Avoiding false predictors
• Boosted trees
• with a non-standard goal function
• and careful hyperparameter optimization
• Reduction of number of predictors to make the model more simple and robust
• Handling new values, e.g. car model
16
17. Pure analytics vs. business
ROC for less and more complex models
These results don’t reflect the real values and are used for illustrative purposes17
22. Non-standard goal functions
• Claim amount turned out to be a strong predictor
• The amount could decide about verification: high claims first
• Even independently of predictors / model!
Amount acordingly to the model ordering
22
23. False predictors
Ranking (VIP-alike): iterative removing of the best feature and rebuilding of the model:
1. Active features: all
2. Build a model using active features
3. Calculate AUC and a features ranking
4. Deactivate the best feature accordingly to the rating
5. Go to 2 until all features are inactive.
6. Plot and conclude
23
26. Project results
First quarter of using the full-scope solution
• Detection Rate Improvement in 1st quarter: +60%
• True Positives > 80%
• ROI = less than 2 months (!)
• Predictive models responsible for 30-40% of the
final fiscal results.
https://m.bankier.pl/wiadomosc/Polowanie-na-
dawcow-polis-czyli-na-nas-7599390.html
BENEFICIARIES
OF DATAWALK & R IMPLEMENTATION
Vice President Claims
• Extremely positive project ROI.
• Reduction of technology providers
• Results accomplished 6x faster and ~20x cheaper
than similar project at key competitor.
• Warta strengthens position of market innovator
in claim handling area.
Head of Anti-Fraud Department
• Impressively improved business results.
• Higher satisfaction and trust in analytics among
team members.
• Knowledge accommodation and knowledge
sharing within the team.
Head Analysts
• Full control over analytical environment.
• Access to all data without engaging IT.
• Expert scoring, machine learning and
investigations in one place.
• Possibility to test new fraud schemas.
26
27. Summary
• Predictive models alone gave a fraction of the total ROI
• The business goal is not always just directly maximizing losses, income etc.
• It’s pretty common for DS/ML projects to get additional profit as a side effect
• ROI for such projects should be measurable and high (but not neccessarily fast) for carefully
chosen business cases
• Predictive models can be significantly improved not spending much (hyperparameters
tuning, goal function, methods etc.)
• There are pitfalls to avoid!
• Usually you don’t need fancy hardware / software (PCs + R!)
27
29. Business & Analytics
• Find a good business case (volume big enough)
• State the business goal and carefully translate it into analytics: use the right goal function
• Correct process of model building
• Controlled implementation
• Measuring model effectiveness comparing to no model / previous situation – using right
KPIs (not always simple, not always possible)
29
30. Process & Data
• Check if modeling is possible with supervised models (fraud flags stored; correct, and
representative sample; good data coverage)
• Data preparation is the most important factor
• Use many data sources
• Data enhancement: aggregates from historical data, textual, external
• Cost of data preparation!
• Detection of false predictors: if not detected then the model is degraded in production (it is
arduous for wide data)
30
31. Data sources
• ”Plain” data: basic
• Complete data related to the loss, claim, and parties involved
• Flags of historical frauds
• ”Plain” data: enhanced
• Using ZIP codes and additional statistics, e.g. fraction of forest area, unemployment rate
• Weather data
• Analysis of connections (SNA)
• Tekst (words from a list, n-grams, others)
• Analysis of neighbourhood using maps
31
32. What influences model quality?
• Solving the right business problem
• Sample representativeness
• Goal function in line with the business goal
• Right model complexity and the correct model building process
• Costs of misclassifications, e.g. false alarm rate
• Black box predictions explanations proving fraud attempts improving of actionability
• It’s pretty hard to get everything in a single model
• Validation of the model and carefully testing its implementation
32
33. Methods
• Commitees / ensembles of trees / boosted trees – good results, possible to use different
goal functions, variable importance, handling NA’s use this!
• Deep Neural Networks – for data complex enough but still having the same structure
• Manual feature extraction not neccessary
• Any (almost) goal function
• Recurrent Neural Networks – working directly on events not on aggregates from ABT
• Using black box model’s prediction explanations (LIME and its friends) – to improve
actionability
33
34. How to improve a model?
• Average model
• vs. human / rules: +10-30%
• Good model
• vs. average model: +10-50% (depending on measurement)
• predictive power driven by data
• Incorrect model
• vs. human / rules: +0% (or losses)
• works in a computer only
Assuming that the goal function and actionability remain unchanged
35
35. About me
• Commercial experience in DS / ML: > 20 years, ~ 100 projects, ~ 3,000 hours of workshops
• Translating a business problem into an analytics problem + choosing adequate means to
solve the latter
• Founder & owner of QuantUp DS / ML firm
• Contact me if you need:
• During the conference
• After the conference: artur@quantup.pl
36