SlideShare uma empresa Scribd logo
1 de 41
1
Finding Patterns in Data
Breaches
Luther Martin
October 21, 2010
Overview
Attempt at humor
 Getting in the right frame of mind to think
about statistics
 A reminder of some concepts from statistics
 What we can learn from data breaches
 What this tells us
 Some generalizations that might or might not
be accurate
2
System development lifecycle (SDLC)
3
Security development lifecycle (SDLC)
4
Estimating some numbers
 What’s the probability of an exploitable
vulnerability existing in your web server right
now?
 What’s the probability of your web server
being hacked in the next 12 months?
 If you don’t encrypt email, what’s the
probability of it being intercepted and read on
the Internet?
 Too hard?
5
Some easier questions
 What’s the current mortgage foreclosure
rate?
 What’s the current fraud loss rate in the US
for payment (credit and debit) cards?
 What’s the current charge-off rate in the US
for credit card loans?
6
The foreclosure rate
Currently about 1 in
381 per month, or
about 3 percent
per year
http://www.realtytrac.com/trend
center/
7
Payment card fraud loss rate
8
http://www.kansascityfed.org/Publicat/Econrev/pdf/10q2Sullivan.pdf
The charge-off rate for credit cards
9
http://www.federalreserve.gov/releases/chargeoff/
More about statistics
 We described each of these using only one
number
 An average
 That’s not the whole story
 The average person has less than 2 legs!
 1.99…< 2
 Most people have an above-average number
of legs!
10
Even more about statistics
 It’s often useful to have a second number that
tells how much variation we have in our data
 Sets of data can both have the same
average, but be very different
 Same mean, different variance
11
The normal distribution
 The so-called normal distribution (“bell
curve”) appears again and again in statistics
 Many things end up with a normal distribution
when you might not expect it
12
The Central Limit Theorem
 If you add random values together you
tend to get a normal distribution
 Proof by picture:
13
Why a known distribution is useful
 If we know that we have data that follows a particular
probability distribution we can predict what we’ll see
in the future with fairly good accuracy
 If you flip a fair coin 100 times then
 You’ll get about 50 Heads
 There’s about a 73 percent chance of getting 45 to 55 Heads
 There’s about a 2 percent chance of getting more than 60
Heads
 What this doesn’t do is predict how any particular flip of the
coin will turn out
14
One more review of math: logarithms
 Logarithms are exponents
 So if we have these numbers:
10, 100, 1,000, 10,000, …
or 101, 102, 103, 104, …
 Then their logarithms are
1, 2, 3, 4, …
 Note that multiplying corresponds to adding
exponents (logs): 102 x 103 = 102+3 = 105
15
Logarithms naturally occur in lots of ways
 Human perception of sound (or light) is
roughly proportional to logarithm of the sound
level rather than the sound level
 If you double the sound pressure level it doesn’t double how
loud it sounds to us
 Instead, double the logarithm of the sound pressure level
 That’s why decibels are used to measure
sound levels, etc
 So logarithms may be annoying but they’re
also useful in some cases
16
Another use for logarithms
 Logarithms are also a good way to handle big ranges
in numbers
 Radio: transmit kilowatts (1,000 Watts), receive
milliwatts (0.001 Watts)
 Hard to plot big ranges on one graph
 Very small numbers look just like zero
 Taking logarithms makes a big range easier to
handle
 3 to -3 instead of 1,000 to 0.001
17
What about data breaches?
 The most comprehensive data is that
maintained by the Open Security Foundation
 www.datalossdb.org
 Currently has information on close to 3,000
data breaches
 Probably the most useful source of
information on data breaches
 What patterns can we find in the OSF’s data?
18
TotalAffected
0
20
40
60
80
100
120
140
1/1/2006
5/1/2006
9/1/2006
1/1/2007
5/1/2007
9/1/2007
1/1/2008
5/1/2008
9/1/2008
1/1/2009
5/1/2009
9/1/2009
1/1/2010
5/1/2010
Millions
Data breaches since 2006
19
VA
TJX
HMRC
HPS
NARA
Making the range of values smaller
20
Log(TotalAffected)
0
1
2
3
4
5
6
7
8
9
1/1/2006
4/1/2006
7/1/2006
10/1/2006
1/1/2007
4/1/2007
7/1/2007
10/1/2007
1/1/2008
4/1/2008
7/1/2008
10/1/2008
1/1/2009
4/1/2009
7/1/2009
10/1/2009
1/1/2010
4/1/2010
Sort these values to get…
21
Log(TotalAffected)
0
1
2
3
4
5
6
7
8
9
1 125 249 373 497 621 745 869 993 1117 1241 1365 1489 1613
The log of breach size matches a
normal distribution very well
22
mean 3.2, standard deviation 1.2
What does this tell us?
 We may be able to understand the process
that leads to data breaches
 We may be able to predict some things about
future data breaches
 We may be able to find a good metric for
industry-wide efforts to reduce data breaches
 We really need comprehensive data to find
patterns that might be there
 Very small breaches are as important as very big
Understanding the process
 Just like we get a normal distribution from
adding several random values together, we
get a lognormal distribution when we multiply
several random values together
 Multiplying corresponds to adding exponents
(logs)
 This suggests that what we see for data
breaches may be explained by a layered
model of security
Abstract layered model of security
The general case: if we have
1. The security provided by two technologies when
they’re both used is greater or equal to the security
of each of the components when they’re used by
themselves
2. If two technologies are independent then the
security provided by the two technologies when
they’re used together is equal to the sum of the
security provided by each of the technologies
3. The security provided by any technology is non-
negative
26
It’s more than just data breaches
 Note that this model of the effect of bypassing
layers of security leading to multiplying the
hacker’s success doesn’t just apply to data
breaches
 It also applies to any other aspect of
information security
 When we learn how to quantify other types of
security incidents we’ll probably find that the
damage from them also follows a lognormal
distribution
Then we have to have…
 A measure of security that works that way
has to essentially be a logarithm
 Measuring security breaches in terms of
logarithms may end up making more sense
that measuring security breaches directly
 We see it with data breaches
 We’ll probably see it for other types of losses
once we learn how to quantify those losses
28
Does this interpretation make sense?
Other places where the lognormal distribution appears:
 The concentration of gold or uranium in ore deposits
 The latency period of bacterial food poisoning
 The age of the onset of Alzheimer's disease
 The amount of air pollution in Los Angeles
 The abundance of fish species
 The size of ice crystals in ice cream
 The number of words spoken in a telephone conversation
 The length of sentences written by George Bernard Shaw or
Gilbert K. Chesterton
http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf
29
What can we predict?
 There’s about a 1 percent chance of any breach
exposing 1 million or more records
 There’s about a 0.1 percent chance of any breach
exposing 10 million or more records
 We can expect about 68 percent of breaches to
expose between 100 and 25,000 records
 We can expect about 95 percent of breaches to
expose between 6 and 400,000 records
 Etc.
30
What can we NOT predict?
 How many data breaches we should expect
to see in the next 12 months
 Whether or not any particular business will
suffer a data breach in the next 12 months
 Whether or not your business will suffer a
data breach in the next 12 months
 Etc.
31
Other patterns: Benford’s law
 Benford’s law tells us that the leading digits in
data tend to not be evenly distributed
 Probability of leading digit being n is
P(n) = log(1+1/n)
n 1 2 3 4 5 6 7 8 9
P(n) 0.30 0.18 0.12 0.10 0.08 0.07 0.06 0.05 0.05
Why Benford’s law might make sense
 Consider what happens with exponential
growth
 Start with 1 and multiply by 1.1 at each step:
1, 1.10 1.21, 1.33, 1.46, 1.61, 1.77, 1.95,
2.14, 2.36, 2.59, 2.85, 3.14, 3.45, 3.80,
4.18, 4.59, 5.05, 5.56, 6.12, 6.73, 7.40,
8.14, 8.95, 9.85, 10.83, …
 Note that 1 is the most common, etc.
Benford’s law for breaches
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9
OSF Data
Benford's Law
Other patterns
 There are other patterns that we can find
 But they’re really just ways to repackage the
exponential growth idea
 No real new ideas
 Zipf’s law
 Pareto’s principle
35
Zipf’s law
 Zipf’s law
 Order the data from biggest to smallest
 Then the total contribution from any entry is inversely
proportional to its position in the table
 Second entry is about 1/2 of the first one
 Third entry is about 1/3 of the first one
 The nth entry is about 1/n of the first one
 R2 = 0.873
36
Pareto’s principle
 Sometimes known as the “80-20 rule”
 Very similar to the others that we’ve mentioned
 In general have k% of the population accounts for
(100 - k)% of something for some k between 50 and
100
 For k = 80 we get the 80-20 rule
 Empirically, most data seem to cluster around k being
in the middle of this range
 It’s yet another power law
37
Bottom line
 It certainly looks like it’s possible to find some
interesting structure in the data that’s available for
data breaches
 The size of data breaches seems to follow a very well
defined pattern
 We may see this same pattern in other part of
information security when we learn how to quantify
other types of losses due to security breaches
 We need lots of data to see the patterns in it
 Data on small breaches is as important as the big ones
38
Practical implications (so what?)
 Developing metrics
 Developing ROI models
 Pricing insurance
 Are we winning or are hackers winning?
 Any time when quantifying a loss is useful
 Etc.
Some useful references
The OSF’s data breach database
http://datalossdb.org/
E. Limpert, W. Stahel and M. Abbt, “Lognormal Distributions across
the Sciences: Keys and Clues”
http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf
The Voltage corporate blog
http://superconductor.voltage.com
CSO Magazine article on finding patterns in data breaches
http://www.csoonline.com/article/501584/data-breaches-
patterns-and-their-implications
40
Finding Patterns in Data Breaches

Mais conteúdo relacionado

Semelhante a Finding Patterns in Data Breaches

Week 10 Discussion Information Security and Digital Crime and .docx
 Week 10 Discussion Information Security and Digital Crime and .docx Week 10 Discussion Information Security and Digital Crime and .docx
Week 10 Discussion Information Security and Digital Crime and .docx
aryan532920
 
Data Science at Intersection of Security and Privacy
Data Science at Intersection of Security and PrivacyData Science at Intersection of Security and Privacy
Data Science at Intersection of Security and Privacy
Tarun Chopra
 
2016 Data Breach Investigations Report
2016 Data Breach Investigations Report2016 Data Breach Investigations Report
2016 Data Breach Investigations Report
Sneha Kiran
 

Semelhante a Finding Patterns in Data Breaches (20)

Safety, Sanctuary and Security
Safety, Sanctuary and SecuritySafety, Sanctuary and Security
Safety, Sanctuary and Security
 
The Future of Advanced Analytics
The Future of Advanced AnalyticsThe Future of Advanced Analytics
The Future of Advanced Analytics
 
Week 10 Discussion Information Security and Digital Crime and .docx
 Week 10 Discussion Information Security and Digital Crime and .docx Week 10 Discussion Information Security and Digital Crime and .docx
Week 10 Discussion Information Security and Digital Crime and .docx
 
2019 Data Breach Investigations Report (DBIR)
2019 Data Breach Investigations Report (DBIR)2019 Data Breach Investigations Report (DBIR)
2019 Data Breach Investigations Report (DBIR)
 
Knowledge And Patterns
Knowledge And PatternsKnowledge And Patterns
Knowledge And Patterns
 
Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1
 
Data Science at Intersection of Security and Privacy
Data Science at Intersection of Security and PrivacyData Science at Intersection of Security and Privacy
Data Science at Intersection of Security and Privacy
 
A Primer on Big Data taken by the book: "Big Data" by Schoenberger and Cukier
A Primer on Big Data taken by the book: "Big Data" by Schoenberger and CukierA Primer on Big Data taken by the book: "Big Data" by Schoenberger and Cukier
A Primer on Big Data taken by the book: "Big Data" by Schoenberger and Cukier
 
A SURVEY ON PRIVACY PRESERVING ASSOCIATION RULE MINING
A SURVEY ON PRIVACY PRESERVING ASSOCIATION RULE MININGA SURVEY ON PRIVACY PRESERVING ASSOCIATION RULE MINING
A SURVEY ON PRIVACY PRESERVING ASSOCIATION RULE MINING
 
assignment of statistics 2.pdf
assignment of statistics 2.pdfassignment of statistics 2.pdf
assignment of statistics 2.pdf
 
2020 Data Breach Investigations Report (DBIR)
2020 Data Breach Investigations Report (DBIR)2020 Data Breach Investigations Report (DBIR)
2020 Data Breach Investigations Report (DBIR)
 
Meeting 15. network security
Meeting 15. network securityMeeting 15. network security
Meeting 15. network security
 
Cyber Security Conference - Rethinking cyber-threat
Cyber Security Conference - Rethinking cyber-threatCyber Security Conference - Rethinking cyber-threat
Cyber Security Conference - Rethinking cyber-threat
 
Qualitative Legal Prediction - Prof. Daniel Katz
Qualitative Legal Prediction - Prof. Daniel KatzQualitative Legal Prediction - Prof. Daniel Katz
Qualitative Legal Prediction - Prof. Daniel Katz
 
Rp dbir 2016_report_en_xg
Rp dbir 2016_report_en_xgRp dbir 2016_report_en_xg
Rp dbir 2016_report_en_xg
 
2016 Data Breach Investigations Report
2016 Data Breach Investigations Report2016 Data Breach Investigations Report
2016 Data Breach Investigations Report
 
Verizon Data Breach Investigation Report
Verizon Data Breach Investigation ReportVerizon Data Breach Investigation Report
Verizon Data Breach Investigation Report
 
Rp dbir 2016_report_en_xg
Rp dbir 2016_report_en_xgRp dbir 2016_report_en_xg
Rp dbir 2016_report_en_xg
 
2016 data breach investigations report
2016 data breach investigations report2016 data breach investigations report
2016 data breach investigations report
 
Verizon DBIR-2016
Verizon DBIR-2016Verizon DBIR-2016
Verizon DBIR-2016
 

Mais de Rochester Security Summit

IPv6 Can No Longer Be Ignored
IPv6 Can No Longer Be IgnoredIPv6 Can No Longer Be Ignored
IPv6 Can No Longer Be Ignored
Rochester Security Summit
 
GRC– The Way Forward
GRC– The Way ForwardGRC– The Way Forward
GRC– The Way Forward
Rochester Security Summit
 
State Data Breach Laws - A National Patchwork Quilt
State Data Breach Laws - A National Patchwork QuiltState Data Breach Laws - A National Patchwork Quilt
State Data Breach Laws - A National Patchwork Quilt
Rochester Security Summit
 
You Know You Need PCI Compliance Help When…
You Know You Need PCI Compliance Help When…You Know You Need PCI Compliance Help When…
You Know You Need PCI Compliance Help When…
Rochester Security Summit
 
Business Impact and Risk Assessments in Business Continuity and Disaster Reco...
Business Impact and Risk Assessments in Business Continuity and Disaster Reco...Business Impact and Risk Assessments in Business Continuity and Disaster Reco...
Business Impact and Risk Assessments in Business Continuity and Disaster Reco...
Rochester Security Summit
 

Mais de Rochester Security Summit (14)

IPv6 Can No Longer Be Ignored
IPv6 Can No Longer Be IgnoredIPv6 Can No Longer Be Ignored
IPv6 Can No Longer Be Ignored
 
Radio Reconnaissance in Penetration Testing
Radio Reconnaissance in Penetration TestingRadio Reconnaissance in Penetration Testing
Radio Reconnaissance in Penetration Testing
 
Real Business Threats!
Real Business Threats!Real Business Threats!
Real Business Threats!
 
Dealing with Web Application Security, Regulation Style
Dealing with Web Application Security, Regulation StyleDealing with Web Application Security, Regulation Style
Dealing with Web Application Security, Regulation Style
 
Application Threat Modeling
Application Threat ModelingApplication Threat Modeling
Application Threat Modeling
 
Maximizing ROI through Security Training (for Developers)
Maximizing ROI through Security Training (for Developers)Maximizing ROI through Security Training (for Developers)
Maximizing ROI through Security Training (for Developers)
 
Dissecting the Hack: Malware Analysis 101
Dissecting the Hack: Malware Analysis 101 Dissecting the Hack: Malware Analysis 101
Dissecting the Hack: Malware Analysis 101
 
GRC– The Way Forward
GRC– The Way ForwardGRC– The Way Forward
GRC– The Way Forward
 
A Plan to Control and Protect Data in the Private and Public Cloud
A Plan to Control and Protect Data in the Private and Public CloudA Plan to Control and Protect Data in the Private and Public Cloud
A Plan to Control and Protect Data in the Private and Public Cloud
 
State Data Breach Laws - A National Patchwork Quilt
State Data Breach Laws - A National Patchwork QuiltState Data Breach Laws - A National Patchwork Quilt
State Data Breach Laws - A National Patchwork Quilt
 
You Know You Need PCI Compliance Help When…
You Know You Need PCI Compliance Help When…You Know You Need PCI Compliance Help When…
You Know You Need PCI Compliance Help When…
 
Business Impact and Risk Assessments in Business Continuity and Disaster Reco...
Business Impact and Risk Assessments in Business Continuity and Disaster Reco...Business Impact and Risk Assessments in Business Continuity and Disaster Reco...
Business Impact and Risk Assessments in Business Continuity and Disaster Reco...
 
Losing Control to the Cloud
Losing Control to the CloudLosing Control to the Cloud
Losing Control to the Cloud
 
Firewall Defense against Covert Channels
Firewall Defense against Covert Channels Firewall Defense against Covert Channels
Firewall Defense against Covert Channels
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Finding Patterns in Data Breaches

  • 1. 1 Finding Patterns in Data Breaches Luther Martin October 21, 2010
  • 2. Overview Attempt at humor  Getting in the right frame of mind to think about statistics  A reminder of some concepts from statistics  What we can learn from data breaches  What this tells us  Some generalizations that might or might not be accurate 2
  • 5. Estimating some numbers  What’s the probability of an exploitable vulnerability existing in your web server right now?  What’s the probability of your web server being hacked in the next 12 months?  If you don’t encrypt email, what’s the probability of it being intercepted and read on the Internet?  Too hard? 5
  • 6. Some easier questions  What’s the current mortgage foreclosure rate?  What’s the current fraud loss rate in the US for payment (credit and debit) cards?  What’s the current charge-off rate in the US for credit card loans? 6
  • 7. The foreclosure rate Currently about 1 in 381 per month, or about 3 percent per year http://www.realtytrac.com/trend center/ 7
  • 8. Payment card fraud loss rate 8 http://www.kansascityfed.org/Publicat/Econrev/pdf/10q2Sullivan.pdf
  • 9. The charge-off rate for credit cards 9 http://www.federalreserve.gov/releases/chargeoff/
  • 10. More about statistics  We described each of these using only one number  An average  That’s not the whole story  The average person has less than 2 legs!  1.99…< 2  Most people have an above-average number of legs! 10
  • 11. Even more about statistics  It’s often useful to have a second number that tells how much variation we have in our data  Sets of data can both have the same average, but be very different  Same mean, different variance 11
  • 12. The normal distribution  The so-called normal distribution (“bell curve”) appears again and again in statistics  Many things end up with a normal distribution when you might not expect it 12
  • 13. The Central Limit Theorem  If you add random values together you tend to get a normal distribution  Proof by picture: 13
  • 14. Why a known distribution is useful  If we know that we have data that follows a particular probability distribution we can predict what we’ll see in the future with fairly good accuracy  If you flip a fair coin 100 times then  You’ll get about 50 Heads  There’s about a 73 percent chance of getting 45 to 55 Heads  There’s about a 2 percent chance of getting more than 60 Heads  What this doesn’t do is predict how any particular flip of the coin will turn out 14
  • 15. One more review of math: logarithms  Logarithms are exponents  So if we have these numbers: 10, 100, 1,000, 10,000, … or 101, 102, 103, 104, …  Then their logarithms are 1, 2, 3, 4, …  Note that multiplying corresponds to adding exponents (logs): 102 x 103 = 102+3 = 105 15
  • 16. Logarithms naturally occur in lots of ways  Human perception of sound (or light) is roughly proportional to logarithm of the sound level rather than the sound level  If you double the sound pressure level it doesn’t double how loud it sounds to us  Instead, double the logarithm of the sound pressure level  That’s why decibels are used to measure sound levels, etc  So logarithms may be annoying but they’re also useful in some cases 16
  • 17. Another use for logarithms  Logarithms are also a good way to handle big ranges in numbers  Radio: transmit kilowatts (1,000 Watts), receive milliwatts (0.001 Watts)  Hard to plot big ranges on one graph  Very small numbers look just like zero  Taking logarithms makes a big range easier to handle  3 to -3 instead of 1,000 to 0.001 17
  • 18. What about data breaches?  The most comprehensive data is that maintained by the Open Security Foundation  www.datalossdb.org  Currently has information on close to 3,000 data breaches  Probably the most useful source of information on data breaches  What patterns can we find in the OSF’s data? 18
  • 20. Making the range of values smaller 20 Log(TotalAffected) 0 1 2 3 4 5 6 7 8 9 1/1/2006 4/1/2006 7/1/2006 10/1/2006 1/1/2007 4/1/2007 7/1/2007 10/1/2007 1/1/2008 4/1/2008 7/1/2008 10/1/2008 1/1/2009 4/1/2009 7/1/2009 10/1/2009 1/1/2010 4/1/2010
  • 21. Sort these values to get… 21 Log(TotalAffected) 0 1 2 3 4 5 6 7 8 9 1 125 249 373 497 621 745 869 993 1117 1241 1365 1489 1613
  • 22. The log of breach size matches a normal distribution very well 22 mean 3.2, standard deviation 1.2
  • 23. What does this tell us?  We may be able to understand the process that leads to data breaches  We may be able to predict some things about future data breaches  We may be able to find a good metric for industry-wide efforts to reduce data breaches  We really need comprehensive data to find patterns that might be there  Very small breaches are as important as very big
  • 24. Understanding the process  Just like we get a normal distribution from adding several random values together, we get a lognormal distribution when we multiply several random values together  Multiplying corresponds to adding exponents (logs)  This suggests that what we see for data breaches may be explained by a layered model of security
  • 25. Abstract layered model of security
  • 26. The general case: if we have 1. The security provided by two technologies when they’re both used is greater or equal to the security of each of the components when they’re used by themselves 2. If two technologies are independent then the security provided by the two technologies when they’re used together is equal to the sum of the security provided by each of the technologies 3. The security provided by any technology is non- negative 26
  • 27. It’s more than just data breaches  Note that this model of the effect of bypassing layers of security leading to multiplying the hacker’s success doesn’t just apply to data breaches  It also applies to any other aspect of information security  When we learn how to quantify other types of security incidents we’ll probably find that the damage from them also follows a lognormal distribution
  • 28. Then we have to have…  A measure of security that works that way has to essentially be a logarithm  Measuring security breaches in terms of logarithms may end up making more sense that measuring security breaches directly  We see it with data breaches  We’ll probably see it for other types of losses once we learn how to quantify those losses 28
  • 29. Does this interpretation make sense? Other places where the lognormal distribution appears:  The concentration of gold or uranium in ore deposits  The latency period of bacterial food poisoning  The age of the onset of Alzheimer's disease  The amount of air pollution in Los Angeles  The abundance of fish species  The size of ice crystals in ice cream  The number of words spoken in a telephone conversation  The length of sentences written by George Bernard Shaw or Gilbert K. Chesterton http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf 29
  • 30. What can we predict?  There’s about a 1 percent chance of any breach exposing 1 million or more records  There’s about a 0.1 percent chance of any breach exposing 10 million or more records  We can expect about 68 percent of breaches to expose between 100 and 25,000 records  We can expect about 95 percent of breaches to expose between 6 and 400,000 records  Etc. 30
  • 31. What can we NOT predict?  How many data breaches we should expect to see in the next 12 months  Whether or not any particular business will suffer a data breach in the next 12 months  Whether or not your business will suffer a data breach in the next 12 months  Etc. 31
  • 32. Other patterns: Benford’s law  Benford’s law tells us that the leading digits in data tend to not be evenly distributed  Probability of leading digit being n is P(n) = log(1+1/n) n 1 2 3 4 5 6 7 8 9 P(n) 0.30 0.18 0.12 0.10 0.08 0.07 0.06 0.05 0.05
  • 33. Why Benford’s law might make sense  Consider what happens with exponential growth  Start with 1 and multiply by 1.1 at each step: 1, 1.10 1.21, 1.33, 1.46, 1.61, 1.77, 1.95, 2.14, 2.36, 2.59, 2.85, 3.14, 3.45, 3.80, 4.18, 4.59, 5.05, 5.56, 6.12, 6.73, 7.40, 8.14, 8.95, 9.85, 10.83, …  Note that 1 is the most common, etc.
  • 34. Benford’s law for breaches 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 7 8 9 OSF Data Benford's Law
  • 35. Other patterns  There are other patterns that we can find  But they’re really just ways to repackage the exponential growth idea  No real new ideas  Zipf’s law  Pareto’s principle 35
  • 36. Zipf’s law  Zipf’s law  Order the data from biggest to smallest  Then the total contribution from any entry is inversely proportional to its position in the table  Second entry is about 1/2 of the first one  Third entry is about 1/3 of the first one  The nth entry is about 1/n of the first one  R2 = 0.873 36
  • 37. Pareto’s principle  Sometimes known as the “80-20 rule”  Very similar to the others that we’ve mentioned  In general have k% of the population accounts for (100 - k)% of something for some k between 50 and 100  For k = 80 we get the 80-20 rule  Empirically, most data seem to cluster around k being in the middle of this range  It’s yet another power law 37
  • 38. Bottom line  It certainly looks like it’s possible to find some interesting structure in the data that’s available for data breaches  The size of data breaches seems to follow a very well defined pattern  We may see this same pattern in other part of information security when we learn how to quantify other types of losses due to security breaches  We need lots of data to see the patterns in it  Data on small breaches is as important as the big ones 38
  • 39. Practical implications (so what?)  Developing metrics  Developing ROI models  Pricing insurance  Are we winning or are hackers winning?  Any time when quantifying a loss is useful  Etc.
  • 40. Some useful references The OSF’s data breach database http://datalossdb.org/ E. Limpert, W. Stahel and M. Abbt, “Lognormal Distributions across the Sciences: Keys and Clues” http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf The Voltage corporate blog http://superconductor.voltage.com CSO Magazine article on finding patterns in data breaches http://www.csoonline.com/article/501584/data-breaches- patterns-and-their-implications 40

Notas do Editor

  1. ISSA Journal, March 2008, https://www.issa.org/Library/Journals/2008/March/Martin%20-%20The%20Information%20Security%20Life%20Cycle.pdf
  2. Offer free VSN to the person with the best answer
  3. NY 1 in 1,660 (0.7%); CA 1 in 194 (6%); NV 1 in 84 (13%); WY 1 in 2,621 (0.4%)
  4. About $0.09 per $100 in the US; much less in other countries
  5. Historically about 4%; now about 10%
  6. So Lake Wobegon is almost possible – all but one child can be above average
  7. Bell Atlantic and Dean’s troposcatter system
  8. HM Revenue & Customs, National Archives and Records Admin
  9. Frank Benford, “The law of anomalous numbers,” Proceedings of the American Philosophical Society, Vol. 78, pp. 551-72, 1938.