I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACHINE LEARNING IN CYBERSECURITY
1. By Tiago Henriques, Filipa Rodrigues
Florentino Bexiga, Ana Barbosa
I, for one, welcome our
new Cyber Overlords!
An introduction to the use of
data science in cybersecurity
2. WHO ARE WE?
MACHINE LEARNING AND CYBERSECURITY
IMAGE WORKFLOW
IMAGE ANALYSIS IN DETAIL
DATA VISUALISATION
Agenda
3. Tiago is the CEO and Data necromancer at
BinaryEdge however he gets to meddle in the
intersection of data science and cybersecurity
by providing his team with lovely problems that
they solve on a daily basis.
Tiago Henriques
Presenter
4. Florentino is the Data MacGyver at
BinaryEdge. On a daily basis he needs to
deploy infrastructure used to analyse big
and realtime data. When not doing that, he
can be found creating models to analyse
data. Give him an orange, he’ll give you a
skynet. Why an orange you ask? He’s
hungry and likes oranges, there!
Florentino Bexiga
Presenter
5. Filipa is the Data Diva at BinaryEdge, she
dances the macarena with numbers to get
them to tell her all their dirty secret.
Filipa Rodrigues
Presenter
6. Ana is the Data Ferret at BinaryEdge.
She is small and hides between the 110th
and 111th characters of the ascii code to
see and show data in that unique
perspective of someone who can’t reach
the box of cookies stored on top of the
capitol 'I'
Ana Barbosa
Presenter
9. How we got here....
200 port scan of the entire internet/ month
1,400,000,000 scanning events/ month *
746,000 torrents monitored and increasing
1,362,225,600 torrent events/ month
* at a minimum
10. Worldwide distribution of IPs running services
<= 100
Number of IPs found
>= 1,000,000
100,000 < #found < 1,000,000
10,000 < #found <= 100,000
1,000 < #found <= 10,000
100 < #found <= 1,000
12. Data Science & Machine Learning
How many IP addresses did job X had vs. job Y?
What is the average duration of the scans?
Can we extract more from all the screenshots we get?
Can we have a more optimized job distribution?
We can only identify X% of services because we’re
using static signatures, can we do better?
Can we find similar images?
MULTIPLE WILD QUESTIONS APPEAR... ...ONE COMMON ANSWER
DATA SCIENCE
&
MACHINE LEARNING
13. Data Science & Machine Learning
DATA SCIENCE MACHINE LEARNING
INITIAL ANALYSIS AND CLEAN UP
EXPLORATORY DATA ANALYSIS
DATA VISUALISATION
KNOWLEDGE DISCOVERY
CLASSIFICATION
CLUSTERING
SIMILARITY MATCHING
REGRESSION
IDENTIFICATION
14. Problems and Limitations of
Machine Learning in CyberSecurity
Lots of adversarial scenarios – Attacks to the classifiers, goes against the foundation of
machine learning
Prediction – Scenarios and data too volatile, not enough proper sources of data
Lack of data in quantity and quality to train models
15. Good use cases
further work needs to be done, but will allow to move antivirus from a static/
signature based system into a much improved dynamic/ learning based
system
If a computer is hacked certain behaviors will change, if constant data is being
monitored and fed into a system the hack could be detected
detection of vulnerable patterns during development
sentiment analysis applied to emails, tweets, social networks of employees
PATTERN DETECTION/OUTLIER
DETECTION (IDS/IPS)
ANTIVIRUS
ANTI-SPAM
SMARTER FUZZERS
SOURCE CODE ANALYSIS
INTERNAL ATTACKERS
16. metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forums
sub-reddits
internal
external
phone
email
linked urls
likes
topics
BGP
AS
whois
AS membership
AS peer
list of IPs
shared
infrastructure
co-hosted
sites
contact
geolocation
office
locations
social
networks
phone
portscan
dns
torrents
binaryedge.io2016
domains
AXFR
MX records
screenshots
web
services
http https
webserver
framework
headers
cookies
certificate
configuration
authorities
entities
SMB
VNC
RDP
users
appsfiles
peers torrent name
OCR
SW
banners
image
classifier
vulnerabilities
data points
29. INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
Image Workflow
PULL MESSAGE
FROM QUEUE
IS THERE
A NEW IMAGE?
DECRYPT AND STORE IMAGE
METADATA ON A DATABASE
YES
NO
GENERATE IMAGE SIGNATURE
FOR SIMILARITY COMPARISON
FINISH
MESSAGE QUEUE
30. Image Workflow
PULL MESSAGE
FROM QUEUE
DOES THE
IMAGE HAVE ANY
INFORMATION?
PERFORM SIMPLE
ENTROPY FILTERING
YES
NO
FINISH
MESSAGED QUEUE
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
31. PULL MESSAGE
FROM QUEUE
ENHANCE IMAGE WITH
APPLICATION OF SOME FILTERS
RUN FACE AND LOGO DETECTION
AND OCR ALGORITHMS
STORE RESULTS
IN DATABASE
PERFORM ADDITIONAL
ACTIONS WITH THE RESULTS
Image Workflow
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
32. Image Workflow
[{"BreachDate": "2013-10-04", "DataClasses": ["Email addresses",
"Password hints", "Passwords", "Usernames"], "Title": "Adobe", "IsAc-
tive": true, "Description": "In October 2013, 153 million Adobe accounts
were breached with each containing an internal ID, username, email,
<em>encrypted</em> password and a password hint in plain text. The
password cryptography was poorly done and <a href="http://stric-
ture-group.com/files/adobe-top100.txt" target="_blank">many were
quickly resolved back to plain text</a>. The unencrypted hints also <a
href="http://www.troyhunt.com/2013/11/adobe-creden-
tials-and-serious.html" target="_blank">disclosed much about the
passwords</a> adding further to the risk that hundreds of millions of
Adobe customers already faced.", "Domain": "adobe.com", "Added-
Date": "2013-12-04T00:00:00Z", "PwnCount": 152445165, "IsRetired":
false, "IsVerified": true, "LogoType": "svg", "IsSensitive": false, "Name":
"Adobe"}]
Email
DataLeak API
35. Data Visualization
EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS
“a multidisciplinary recipe of art, science, math, technology, and many other interesting ingredients.”
Andy Kirk, “Data Visualization: a successful design process”
36. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
DATA TYPE
RELEVANCE
FILTER
What is the most interesting?
What is most important?
Audience’s Profile
What is the most relevant information in the context?
Show all values or just a few?
Define periods?
Define a threshold?
Hierarchical
Relational
Temporal
Spatial
Categorical
Exploration
Data Visualization
37. Representation
Experimentation is important
Conceive ideas
Storyboarding
Do multipe iterations
Prototype
Test
design can be used in the future
Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
69,543,915 25,436,974 7,008,108 3,475,472 1,287,446 1,043,331
951,629 854,817 789,515 759,115 490,290 288,885
266,827 257,105 219,025 198,898 186,286 141,474
HowmanyopenportsdoesanIPhave?
NumberofIPswithXopenportsport
NumberofIPs
38. Representation
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Distribution of IP addresses running encrypted and unencrypted services
MARKS
Points
Areas
Lines
ATTTRIBUTES Position
Connections/ Patterns
Size/ Color
REPRESENT RECORDS
EMPHASIZE THE MOST IMPORTANT
ASPECTS OF THE DATA on port 443
on port 80
51,467,779
HTTP
28,671,263
IPs running
HTTP services
IPs running
HTTPS services
16,519,503IPs running both
HTTP and HTTPS services
HTTP
&
HTTPS
HTTPS
Data Visualization
39. Data Visualization
Representation
PRECISION IN DESIGN
Geometric Calculations
Truncated axis
Scales
MAKE IT UNDERSTANDABLE
Reference lines
Markers
MAKE IT APPEALING
Minimise the clutter
Priority: preserve function
Top 10Web Servers for theWeb
Most common web servers found on port 80
Apache httpd
AkamaiGHost
Micorosft IIS httpd
nginx
lighttpd
Huawei HG532e ADSL modem http admin
Microsoft HTTPAPI httpd
Technicolor DSL modem http admin
Mbedthis-Appweb
micro_httpd
2 4 6 80 10 12 millions
11,493,552
8,361,080
4,843,769
3,860,883
2,031,741
1,539,629
952,300
699,202
694,393
678,657
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
40. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
CVSS SCORES
LOW
MEDIUM
HIGH
0.0
10.0
4.0
7.0
SEVERITY
CVSS: CommonVulnerability Scoring System
Data Visualization
42. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Overview of protocols used for email, according to encryption used
Email Protocols
ENCRYPTED UNENCRYPTED
POP3 POP3S SMTP SMTPS IMAP IMAPS
4,572,161 3,742,289 3,531,071 2,971,159 4,131,737 3,703,364
10,416,812 12,234,969
SERVICE
COUNT
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
Data Visualization
43. Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Big Data Technologies
Changes in amount of data exposed without security
MongoDB Memcached Redis 2 TB
644.3 TB
Aug 2015 Jan 2016 July 2016
724.7 TB 627.7 TB
13.2 TB
11.3 TB
710.9 TB 12.0 TB
598.7 TB 27.5 TB 1.5 TB
1.8 TB
619.8 TB
Data Visualization
44. Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Heartbleed
Countries with higher number of IPs vulnerable to Heartbleed
Russia
5,264
Republic of Korea
4,564
China
6,790
United States
23,649
Italy
2,508
Germany
6,382
France
5,622
Netherlands
2,779United Kingdom
3,459
Japan
2,484
Data Visualization
45. Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
VNC wordcloud
loginwindows
edition
2016
delete
ctrl
server
press
microsoft
system
welcome
your help
file
linux
google
kernel
from
ubuntu
46. Details
ANNOTATION
Titles and subtitles
Labels
Legends
TYPOGRAPHY
Use fonts that are easy to read
Don’t use fonts that are considered sloppy
SSH Banners
SSH-2.0-OpenSSH_5.3
SSH-2.0-OpenSSH_6.6.1p1
SSH-2.0-OpenSSH_6.6.1
SSH-2.0-OpenSSH_4.3
SSH-2.0-OpenSSH_6.0p1
SSH-2.0-OpenSSH_6.7p1
SSH-2.0-dropbear_2014.63
SSH-2.0-OpenSSH_5.5p1
SSH-2.0-ROSSSH
SSH-2.0-OpenSSH_5.9p1
202,361
352,978
436,700449,570
462,616
537,667
555,779
604,579
1,501,749
2,632,270
count
banner
Most common SSH Banners found
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Data Visualization
47. Details
ANNOTATION
Titles and subtitles
Labels
Legends
TYPOGRAPHY
Use fonts that are easy to read
Don’t use fonts that are considered sloppy
SSH
-2.0-O
penSSH
_5.3
SSH
-2.0-O
penSSH
_6.6.1p1
SSH
-2.0-O
penSSH
_6.6.1
SSH
-2.0-O
penSSH
_4.3
SSH
-2.0-O
penSSH
_6.0p1
SSH
-2.0-O
penSSH
_6.7p1
SSH
-2.0-dropbear_2014.63
SSH-2.0-OpenSSH_5.5p1
SSH
-2.0-RO
SSSH
SSH
-2.0-O
penSSH
_5.9p1
202,361
352,978
436,700449,570
462,616
537,667
555,779
604,579
1,501,749
2,632,270
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Data Visualization
49. Tools
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
BALANCE
Automation
Programming Language
to create plots
Fine tunning in illustrator
(make it better for the audience)
Hand-editing process
Human error
Originality
Automated Analysis
Illustrator (or other tool) to
create visualization solution
Human error
Data Visualization
50. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
DOCUMENT EVERY STEP OF THE PROCESS
Calculations
Choices of visualisations
Choices of data points
REVIEW EVERYTHING
What could have been done differently?
What could be better?
TAKE CONSTRUCTIVE FEEDBACK
Even if it means to start over
A visualization can be used in the future
Data Visualization