Mais conteúdo relacionado
Semelhante a Deep Learning for Fraud Detection (20)
Mais de DataWorks Summit/Hadoop Summit (20)
Deep Learning for Fraud Detection
- 2. © 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & others
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning
- 3. © 2014 MapR Technologies 3
Goals for Today
• Explore the state of the art for deep-learning and fraud detection
• Separate at least some of the wheat from the chaff
• Provide some realistic guidance for getting results
- 4. © 2014 MapR Technologies 4
Goals for Today
• Explore the state of the art for deep-learning and fraud detection
• Separate at least some of the wheat from the chaff
• Provide some realistic guidance for getting results
• Play with cool stuff !
- 5. © 2014 MapR Technologies 5
Agenda
• Motivation
• What are neural networks and deep learning?
• It can be simpler than you think
• But, no free lunch / you get what you pay / other clever aphorism
• Some experiments
• Where to go from here
- 6. © 2014 MapR Technologies 6
Motivation For Advanced Modeling in Fraud
• Neural networks have completely dominated credit card fraud
detection since late 80’s
– Random forest, tree ensembles often used in other kinds of fraud and
churn models
• The reason is rule-based systems simply don’t work
– Well, they do work at first
– Fraudsters change tactics, you add rules, interaction mayhem ensues
• And learning algorithms really do work
– Fraudsters change tactics, you add features and retrain
- 8. © 2014 MapR Technologies 8
So learning is good
But good learning is hard
- 9. © 2014 MapR Technologies 9
So learning is good
But good learning is hard
And finding good features is
really hard
- 10. © 2014 MapR Technologies 10
Some Sample Features
• Charge size relative to previous averages for card
• Charge size relative to previous average for merchant
• Known merchant or not
• Doubled transaction
• AVS or CVV2 mismatch
- 11. © 2014 MapR Technologies 11
Some Sample Features
• Charge size relative to previous averages for card
• Charge size relative to previous average for merchant
• Known merchant or not
• Doubled transaction
• Address Verification System or CVV2 mismatch
- 12. © 2014 MapR Technologies 12
Some Sample Features
• Charge size relative to previous averages for card
• Charge size relative to previous average for merchant
• Known merchant or not
• Doubled transaction
• Address Verification System or Card Verification Value mismatch
- 13. © 2014 MapR Technologies 13
Some Sample Features
• Charge size relative to previous averages for card
• Charge size relative to previous average for merchant
• Known merchant or not
• Doubled transaction
• Address Verification System or Card Verification Value mismatch
• Unusual region for card
• Unusual time-of-day relative to history
• Magstripe use if chip available
• (hundreds more)
- 14. © 2014 MapR Technologies 14
Sequence Based Features
• Plausible pattern matching (rent a car, pay for gas at airport)
• Probe transactions (gas in wrong place, pizza, big charge)
• Previous transaction at compromised merchant
• Card velocity
- 15. © 2014 MapR Technologies 15
Key Problems
• Good guys need data … that means that fraudsters get first
chance at bat
• Good guys are careful and test systems before releasing
• Bad guys have many low-risk transactions and can change
methods quickly
• In some areas, fraudster adapt techniques in hours
- 16. © 2014 MapR Technologies 16
Making up features is easy
Finding features that add
real lift is very hard
- 17. © 2014 MapR Technologies 17
What are neural networks and deep learning?
• Start simple … imagine we have 20 features, 0 or 1
– Let’s yell “Fraud” if any of the features is a 1
– Houston, we have a model
• But this model isn’t any better than a rule
• Also doesn’t have any interesting Greek letters
- 18. © 2014 MapR Technologies 18
Real-world Intrudes
• We assumed all features are equally good
– What if some are kind of poor or weak?
• Can we weight different features more or less?
– Can we learn these weights from data?
- 19. © 2014 MapR Technologies 19
Real-world Intrudes
• We assumed all features are equally good
– What if some are kind of poor or weak?
• Can we weight different features more or less?
– Can we learn these weights from data?
- 20. © 2014 MapR Technologies 20
Learning Works
• Yes. We can learn these models
• How we measure error is important
• We must have good features
• Even good features may need transformation
– Take logs of times and monetary values
– Subtract means, scale, bin values
- 21. © 2014 MapR Technologies 21
Not Good Enough
• We need combinations of models
• Simple linear combinations aren’t subtle enough
• Enter multi-level models
– Can we learn a model that uses combinations of inputs
– Where each of those combinations is a model that we learn?
- 22. © 2014 MapR Technologies 22
Yes, Virginia, There IS a Santa Claus
Each circle is a sum
and a (soft) threshold
Arrows are multiplication
by a learned weight
- 23. © 2014 MapR Technologies 23
Errors on Output Can Propagate
Each circle is sends
error to each arrow
Arrows weight back-
propagating errors
Inputs
Hidden layer
- 24. © 2014 MapR Technologies 24
Success!
Triumph!
World domination!
- 25. © 2014 MapR Technologies 25
World domination!
With some reservations
because features are hard
- 26. © 2014 MapR Technologies 26
Turtles All the Way Down – We Wish
• This learning works well for just a few layers
• This is still a big deal …
– with cool features, we can build real systems
• With many layers, the learning no longer converges
• Well … until recently
- 27. © 2014 MapR Technologies 27
Model Learning in an Ideal World
• If we could just learn the features
– Maybe unsupervised, maybe supervised
– And at the same time learn the model
• Presumably we could build models quicker
• And more easily
• And we wouldn’t have to dirty our minds with
pedestrian domain knowledge
- 28. © 2014 MapR Technologies 28
Example 1 – (not very) Deep Auto-encoder
• Let’s take an example where we can learn features
• Data is EKG traces
• We want to find anomalies
– No supervised training
- 29. © 2014 MapR Technologies 29
Spot the Anomaly
Anomaly?
- 31. © 2014 MapR Technologies 31
Where’s Waldo?
This is the real
anomaly
- 32. © 2014 MapR Technologies 32
Normal Isn’t Just Normal
• What we want is a model of what is normal
• What doesn’t fit the model is the anomaly
• For simple signals, the model can be simple …
• The real world is rarely so accommodating
x ~ m(t)+ N(0,e)
- 42. © 2014 MapR Technologies 42
Windows on the World
• The set of windowed signals is a nice model of our original signal
• Clustering can find the prototypes
– Fancier techniques available using sparse coding
• The result is a dictionary of shapes
• New signals can be encoded by shifting, scaling and adding
shapes from the dictionary
- 43. © 2014 MapR Technologies 43
Most Common Shapes (for EKG)
- 44. © 2014 MapR Technologies 44
Reconstructed signal
Original
signal
Reconstructed
signal
Reconstruction
error
< 1 bit / sample
- 45. © 2014 MapR Technologies 45
An Anomaly
Original technique for finding
1-d anomaly works against
reconstruction error
- 46. © 2014 MapR Technologies 46
Close-up of anomaly
Not what you want your
heart to do.
And not what the model
expects it to do.
- 47. © 2014 MapR Technologies 47
A Different Kind of Anomaly
- 48. © 2014 MapR Technologies 48
Some k-means Caveats
• But Eamonn Keogh says that k-means can’t work on time-series
• That is silly … and kind of correct, k-means does have limits
– Other kinds of auto-encoders are much more powerful
• More fun and code demos at
– https://github.com/tdunning/k-means-auto-encoder
http://www.cs.ucr.edu/~eamonn/meaningless.pdf
- 49. © 2014 MapR Technologies 49
The Limits of Clustering as Auto-encoder
• Clustering is like trying to tile your sample distribution
• Can be used to approximate a signal
• Filling d dimensional region with k clusters should give
• If d is large, this is no good
e » 1/ kd
- 50. © 2014 MapR Technologies 50
0 500 1000 1500 2000
−2−1012
Time series training data (first 2000 samples)
Time
Test data
Reconstruction
Error
- 51. © 2014 MapR Technologies 51
0 500 1000 1500 2000
0.000.050.100.15
Reconstruction error for time−series data
Centroids
MAVError
Training data
Held−out data
- 52. © 2014 MapR Technologies 52
Moral For Auto-encoders
• The simplest auto-encoders can be good models
• For more complex spaces/signals, more elaborate models may
be required
– Winner take (absolutely) all may be problematic
– In particular, models that allow sparse linear combination may be better
• Consider deep learning, recurrent networks, denoising
- 53. © 2014 MapR Technologies 53
How Does Clustering Do Reconstruction?
x1 x2
...
xn-1 xn
Input
For normalized cluster centroids,
dot-product and distance are equivalent
- 54. © 2014 MapR Technologies 54
How Does Clustering Do Reconstruction?
x1 x2
...
xn-1 xn
Input
Winner takes all with k-means
- 55. © 2014 MapR Technologies 55
How Does Clustering Do Reconstruction?
x1 x2
...
xn-1 xn
x'1 x'2
...
x'n-1 x'n
Input
Hidden layer
(clusters)
Reconstruction
Dot-product scales
centroid to reconstruct
- 56. © 2014 MapR Technologies 56
AKA - Neural Network
x1 x2
...
xn-1 xn
x'1 x'2
...
x'n-1 x'n
Input
Hidden layer
(clusters)
Reconstruction
- 57. © 2014 MapR Technologies 57
What If … We Had More Layers?
...
...
...
...
... ... ... ... ...
... ... ... ... ...
A
B
A'
- 58. © 2014 MapR Technologies 58
Other Thoughts
• What if we allow more than one cluster to be active?
– k-sparse learning!
- 59. © 2014 MapR Technologies 59
Other Thoughts
• What if we allow more than one cluster to be active?
– k-sparse learning!
- 60. © 2014 MapR Technologies 60
Other Thoughts
• What if we allow more than one cluster to be active?
– k-sparse learning!
- 61. © 2014 MapR Technologies 61
Other Thoughts
• What if we allow more than one cluster to be active?
– k-sparse learning!
• Well, almost
- 62. © 2014 MapR Technologies 62
The Point of Deep Learning
• It isn’t just many hidden layers in a neural network
• The goal is to eliminate feature engineering by learning features
as well as the classifier
- 63. © 2014 MapR Technologies 63
Experiment 3 – Card Velocity
• Most features so far are inherent in the data
• Few are true sequence features
• Card velocity is a pure combination
– Starting point can be anywhere
– The issue is where the next point is relative to starting point
- 64. © 2014 MapR Technologies 64
Card Velocity
Non-fraud steps are
reasonable in terms
of distance and time
Fraudulent use of card
by multiple attackers
results in big, fast jumps
- 65. © 2014 MapR Technologies 65
Synthetic Data Example
• Generate random point
• Take four small steps
• If fraud, second step can be large
• Result is five positions, each in 3-d on surface of a sphere
– Data shape is N x (5 x 3)
• Add secondary features containing step size … N x 4
- 66. © 2014 MapR Technologies 66
The Truth is Out There
• With the right feature (step-size),
it is trivial to spot the fraud
• Here we show the step size
between positions
• Fraud cases take a big jump that
others don’t
• But they can be anywhere
- 67. © 2014 MapR Technologies 67
But Dimensionality Bites Hard
• With the step-size feature, learning succeeds instantly with the
simplest models and gets nearly perfect accuracy
• Without the step-size feature, learning with TensorFlow gets
modest accuracy after substantial learning cost (work in
progress, could do better with lots more tuning)
• The problem is that there are two many combinations of 15
variables, we need a very specific combination of three pair-wise
diffs combined non-linearly into a distance
- 68. © 2014 MapR Technologies 68
104
105
106
1
0
0.2
0.4
0.6
0.8
Data Size
AUCorPrecision
AUC
Precision
- 69. © 2014 MapR Technologies 69
We have a
bona fide revolution
But old tricks still pay
- 70. © 2014 MapR Technologies 70
Greenfield Problem Landscape
- 71. © 2014 MapR Technologies 71
Mature Problem Landscape
- 72. © 2014 MapR Technologies 72
Summary
• There is too much to say in 40 minutes, let’s talk some more at
the MapR booth
• Deep learning, especially with systems like TensorFlow have
huge promise
• Deep learning trades learning architecture engineering for
feature engineering
• There are powerful middle grounds
- 74. © 2014 MapR Technologies 74
Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 - 2016
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-
world-hadoop
http://bit.ly/mapr-tsdb-
ebook
http://bit.ly/ebook-
anomaly
http://bit.ly/recommend
ation-ebook
- 75. © 2014 MapR Technologies 75
Streaming Architecture
by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly)
Free copies at book
signing today
http://bit.ly/mapr-ebook-streams
- 77. © 2014 MapR Technologies 77
Q&A
@mapr maprtech
tdunning@maprtech.com
Engage with us!
MapR
maprtech
mapr-technologies