1. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
From Data to Decisions Makers
A Behind the Scenes Look at Building The
Most Respected Report In Cybersecurity
2. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
ABOUT ME
(Briefly)
3. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
• DBIR team manager/author (more on this in a bit)
• Former cyber risk director for a Fortune 100
insurance company
• Serial #rstats Tweeter (@hrbrmstr), blogger
(rud.is/b & @ddsecblog) & regular helper on
StackOverflow
• Author of and contributor to 14 CRAN packages
• Co-author of Data-Driven Security (@ddsecbook)
• Co-host of the Data-Driven Security Podcast
(@ddsecpodcast)
• Die-hard ggplot2 advocate, widgeteer, heavily
addicted cartographer & shameless user of the
forward assignment operator ←4EVA→
4. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
WHAT IS THE DBIR?
5. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
The Verizon Data Breach
Investigations Report (DBIR)
“The Verizon Data Breach Investigations Report
(DBIR) is an annual publication that provides
analysis of information security incidents, with a
specific focus on data breaches.”
http://searchsecurity.techtarget.com/definition/Verizon-Data-Breach-Investigations-Report-DBIR
verizonenterprise.com/DBIR
6. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
WHO IS THE DBIR?
7. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
Wade Baker Dave Hylender Marc Spitler Jay Jacobs
Kevin Thompson Suzanne Widup Bhaskar Karambelkar Gabriel Bassett
8. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
The DBIR
• Started in 2008
• Cited by virtually every other cybersecurity report
by the 3❡
• Read by individual contributors up through senior
leadership at virtually every global enterprise
• A lot of fun to work on
9. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
#RSAC
#DBIR
2008 2009 2010 2011 2012 2013 2014 2015
1 1 2 3 6
18
50
70
10. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
WHAT DOES THIS HAVE TO
DO WITH ?
11. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
200,000
12. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
Vocabulary for
Event
Recording and
Incident
Sharing
veriscommunity.net
vcdb.org
13. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
14. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
verisr
github.com/vz-risk/verisr
15. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
library(verisr)
vcdb <- json2veris(jsondir)
summary(vcdb) # too big to show
getenum(vcdb, "actor")
## enum x
## 1 external 955
## 2 internal 535
## 3 partner 100
## 4 unknown 85
getenum(vcdb, "actor", add.n=TRUE, add.freq=TRUE)
## enum x n freq
## 1 external 955 1643 0.581
## 2 internal 535 1643 0.326
## 3 partner 100 1643 0.061
## 4 unknown 85 1643 0.052
16. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
17. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
vz-risk.github.io/dbir/2015/19/
18. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
19. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
20. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
• 200m successful vulnerability exploits across 20,000 enterprises
• 170m malware events across over 10,000 enterprises
• 6 months of malware traffic data from 30+m mobile devices
• Live botnet traffic from compromised organizations
• Millions of Indicators of Compromise
• Details of all Denial of Service activity for 2014
21. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
• 200m successful vulnerability exploits across 20,000 enterprises
• 170m malware events across over 10,000 enterprises
• 6 months of malware traffic data from 30+m mobile devices
• Live botnet traffic from compromised organizations
• Millions of Indicators of Compromise
• Details of all Denial of Service activity for 2014
22. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
PUTTING IT ALL TOGETHER
Getting the data
23. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
24. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
PUTTING IT ALL TOGETHER
Creating, organizing and sharing analyses
25. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
.R .Rmd .json .Rdata
26. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
1. Assign areas to each researcher
2. For “standard VERIS” analyses, generate reports from core Rmd
3. Have “Findings Review” collaborative meetings where we peer-review the work
4. (Repeat step 3 after refinement of findings)
5. Decide on final sections for the report and assign authors
6. Add rough draft visualizations to the findings
7. Lock in content
8. Refine visualizations
9. Finalize text content
10. Work with Marketing & Graphics
27. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
FIGURATIVELY SPEAKING
28. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
• Create one “Master Rmd” for all
visualization figures using canned data from
outputs of analyses, having one master
(giant) HTML document version and multiple
individual PDF versions to give to the
creative staff to work with
Why PDF? Complex ggplot2 SVGs crash
Illustrator and the fonts are horrible (they
get converted to polygons).
29. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
• When you decide you want to use a figure
from the analysis spend the time to make it
look as amazing (and final) as possible to
save $$, save time down the road and to
avoid seeing your creations on @wtfviz
30. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
LESSONS LEA NED
31. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
R Markdown (Rmd) makes it super
amazingly awesomely easy to
document, iterate, modify & share
analyses.
spinning is cool too.
32. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
ggplot2 makes is super amazingly
awesomely straightforward to make
“camera ready” visualizations
(PDF vs SVG)
33. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
Do not upgrade your analysis stack or
experiment with RStudio during the
core analysis phase
34. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
Packages (even for analyses) > loosely
connected documents and scripts
35. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
Source code control & data versioning
control is extremely important
36. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
A fellow researcher must be able to
reproduce your analyses with the same
data & Rmd and understand your
reasoning in the annotation
37. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
Freezing or at least recording versions
of packages you use may be vitally
important to your ability to reproduce
at a later date (store them in version
control with analyses or perhaps
embed in a container like Docker)
38. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
ABOUT THE COVER
39. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
40. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
41. Bob Rudis • Managing Principal & Senior Data Scientist
bob@rudis.net
What is the DBIR? It may cause you some concern that before the DBIR (and if I’m being honest with everyone) even since it most decisions in cybersecurity are not made through what most of you here would call “data science”. Most things in cyber are based on expert opinions often unencumbered by facts.
But I can’t talk about how we use R for the DBIR without first introducing you to the “we”.
I may be the only one standing up here talking about the DBIR but it’s a team effort. The spiffy looking gentleman in the upper left is Wade Baker. He started this whole thing and we affectionately refer to him as the godfather of the DBIR. We do not perform author attribution in the report proper since the process of publishing this report involves so many individuals that we would need scrolling movie credits then inevitably leave off someone. However, if we exclude internal marketing and production management, these folks are the ones who put their hearts and souls into the analyses, visualizations and main report production.
The DBIR’s history goes back to 2008 when the first one was published. Back then, it was comprised of transcribed incidents from the Verizon Incident Response team. The data was crunched in excel and I’m embarrassed to admit just how riddled with pie charts that inaugural issue was. However, it was the first report in cybersecurity that provided real data about the actors who commit cybercrimes, the actions they take when committing cybercrime, the assets those actions were taken against and the attributes of the impacted data elements. It’s valued since it’s not a survey conducted by a vendor with a vested interest in the outcome but is actual real data provided by many (now 70) contributors.
So, speaking of contributors…As the DBIR evolved more contributors came on which made using Excel a bit difficult. This is where R comes into play. The covers you see here aren’t just pretty pictures (tho they are pretty pictures). We’ll cover more about the covers (heh) in a bit.
When the number of incidents was small (in the low hundreds) working with the data in Excel was fairly straightforward. But, thanks to regular contribution from the secret service, department of health and human services and over 60 other global organizations the number of incidents in the corpus has now hit 200K. Now excel can handle 200K rows, but there are some things that make the analysis a bit trickier.
The VERIS (Vocabulary for Event Recording and Incident Sharing) Framework Is a taxonomy that standardizes how security incidents are described and categorized. The schema and record format is in JSON and organized into the Actor/Action/Asset/Attribute categories I mentioned earlier. You can see real incidents encoded in this format over at vcdb.org where we have a corpus of public breaches encoded.
If we limit this to just the top level categories there are 315 top-level combinations, but it’s possible to record multiple actors, actions, assets and attributes per incident. Think of a phishing incident where there’s a phishing email that then causes social engineering that eventually has malware deployed on a system with an actor then looking for other systems to break and eventually steal data from or corrupt data. It’s possible to actually have over 2,000 enumeration details associated with an incident. Given the nested structure of JSON and the limitations in excel, the decision to move to R was not a tough one to make.
To help standardize the analysis of the incident records, the verisr package was created (by Jay). While not in CRAN, the package is available on the VZ RISK team’s github repository (with a forked copy in Jay’s github) and can be used to analyze VCDB incidents or incidents that organizations encode in VERIS format. It makes heavy use of the data.table package. One interesting fact is that the incident and breach corpus fits on a cheesy thumb drive that you might get at a vendor booth at a tech conference. We have an entire chapter on VERIS and verisr in our book Data-Driven Security with examples of how to use the package to analyze different aspects of incidents.
Here’s a small example of what the verisr pkg can do. There are many helper functions that make it easy to slice & dice the data for any given analysis.
With verisr and a corpus of breach data to work with, you can do things like compare the time it takes an attacker to compromise an organization vs the time it takes an org to discover a breach. Unlike the vast majority of security reports that would be glad to declare victory at 2014 being the closest compromise vs discover gap, the trends lines paint a slightl different picture.
Whenever ggplot2 was used to make a chart I’ve included the sticker on the page. Every visualization you see in this presentation and in the report is 99% ggplot2. Font issues (more on that in a bit) and some required style guide restrictions (how legends appear, for example) make up the 1%. We probably saved $12-15K (based on the hourly rate) in post-production costs by providing high quality & pre-styled charts to the production team.
And, because VERIS uses the North American Industry Classification System (NAICS) for granular recording what industry an org is in, you can do really cool things like cluster incidents by selected enumeration details at a broad or discrete level. This particular chart encodes # of incidents in a given industry as circle size and clusters incidents with similar attack profiles closer together. We usually look at the industries at the 2-digit NAICS level (since the report has a broad audience) but for this particular analysis we wanted to see if industries further down the NAICS tree were clustered within their higher level category or across them. The exercise was pretty illuminating and you can look for yourself in the report or hit the URL on this page. We exported the data from R and made an interactive D3 visualization that you can explore.
Last year, we a number of clustering techniques to try to classify the breaches into categories. It’s virtually impossible to do this by hand anymore and the analysis ended up putting each incident into one of nine buckets. When we looked at the core attributes of each bucket we were easily able to give them an easy name to remember since the VERIS enumerations that made up each category made sense to the domain experts performing the analysis. We dubbed them the Nefarious Nine (+1 which lumps the ones that had no classification into a catchall category). By doing this we are also able to provide a snapshot into a current and multi-year view. This heatmap attempts to show the most prevalent pattern in a given industry (with a 3 year history) and lets you compare across industries to spot similarities and differences. We’re working on a way to do this for more discrete NAICS codes without making the chart gibberish (it may have to be an interactive version to accomplish that though).
This year we had an opportunity to do more than work on breaches. A handful of new partners (vendors and service) provided over 12TB of incident and vulnerability data to us to analyze and make part of the report.
Note the mistake about using density plots
Note the mistake about using density plots
We encode our own incidents and others code their incidents into a Survey Gizmo form. Yes. Survey Gizmo. It’s cheaper than building and maintaining an app (we tried!), is more creator-friendly than google forms has built-in user management, has an API (more on that in a bit) and is fine from a security standpoint since we have codenames for all participants and uniquely identifying components of an incident are forbidden from being entered. We sometimes have to fly to an org to help them encode and transport incidents in locked briefcases (no handcuffs I’m afraid). We use node.js for JSON schema validation for each record and to do some minor cleanup of each incident (if necessary). We can use V8 now to keep all that activity in R.
For some of the new, large data we received we ended up having to use postgresql and elasticsearch for some of it and most of it was downloaded across secure internet connections.
We used an internal gitlab instance on an annoyingly secure private network enclave as the source of authority for the JSON records for the incident records and to hold the R scripts and Rmd files for analysis. The VCDB incidents are on github (go play!) so we used that as well. We were keeping a leaderboard for github incident encoding at one point as well.
We used Slack for virtually all team collaboration (which is one reason I wrote the slackr package) and uses gpg tools to share anything remotely sensitive. We also received alerts about SLA issues both for outages and survey completion times from SurveyGizmo (their API is pretty decent).
We used Room.co for video chats since it has more secure point-to-point websockets and Google Hangouts records everything even if you didn’t ask it to.
All the analyses were done in RStudio cuz RStudio & Kevin Ushey (et al) rocks.
Notice I did not say easy
There is a cover contest each year where we usually add hidden text to one of the covers (with pictorial clues on the cover and some in the text of the report) that send folks on a cryptographic and puzzle-infused scavenger hunt to eventually know where to send a coded message two. The first 3 folks or teams to do so win prizes (like iPads) or can have a donation made in their name to a charity of their choice. The 2014 cover is the first time there was an actual data-driven cover completely generated in R. There’s an explanation on the back of the 2014 report that talks about the clustering used there. The base was done in ggplot2 and igraph was used to generate the graphs on top (layered by hand in illustrator). It’s 100% driven by data from that year’s report and shows the universe of breaches quite nicely IMO.
We based this year’s cover on Joy Division’s “Unknown Pleasures” album cover. The cover was entirely generated with ggplot2 with only minor editing by the graphics team
Rather than use hidden text, we used R to encode bits onto the back as “waveforms”. The winning teams ended up transcribing the bits by hand. I have R code that can read the PDF encoded lines and determine 1/0 from it (like 4 lines of R). The bits make what look like gibberish unless you recognize what bitly short urls look like after the slash.