Using Data Science for Cybersecurity

background image: 960x540 pixels - send to back of slide and set to 80% transparency
Using Data Science for
Cybersecurity
Anirudh Kondaveeti, Principal Data Scientist, Pivotal
Jeff Kelly, Principal Product Marketing Manager, Pivotal

Today’s Speakers
2
Using Data Science for Cybersecurity
Anirudh Kondaveeti
Principal Data Scientist, Pivotal
Jeff Kelly
Product Marketing, Pivotal
Moderator Presenter

cover this square with an image (540 x 480 pixels)
●  Cybercrime costs average US
enterprise $17m per year*
●  Cost grew at 15% CAGR over last three
years
●  Any given cybercrime can cost
significantly more
●  Target’s 2014 hack cost company
approximately $162m
●  Costs not just financial, also reputational
Cost of Cybercrime on
the Rise
*Source: 2016 Cost of Cyber Crime Study & the Risk of Business Innovation,
Ponemon Institute

●  Amateur hackers giving way to
professionals
●  Developing new, more sophisticated,
methods
●  Professional hackers make their
services available for a fee
●  Costs to commit cybercrime dropping
●  Average subscription fee for a one hour/
month DDoS package is roughly $38*
Hackers Growing More
Sophisticated
*Source: Q2 2015 Global DDoS Threat Landscape, Incapsula

●  Defending the perimeter no longer
enough
●  No 100%, fool-proof way to keep bad
actors out
●  Some threats come from within
●  The idea of a perimeter becoming
obsolete with mobile, cloud, IoT
●  Need better methods for threat
detection inside the network
Perimeter Defense
Inadequate

Data Science for Cybersecurity

Security must move beyond signature-based matching
•  Necessary defense direction: Find the Unknown
•  Need an advanced platform: Security is a Big Data problem
•  Multiple decentralized sources of traditional or unconventional data
•  Need a platform for better BI, reporting, and cross-source correlation
•  Develop intelligence: Security is an Advanced Analytics problem
BI and
Compliance-
driven
Investigation-
driven
Behavior-
metrics
Investigation-
driven
Data-science
driven
Background

Advanced Persistent Threat (APT)
A handful of users are
targeted by two
phishing attacks: one
user opens Zero day
payload
(CVE-02011-0609)
The user machine is
accessed remotely
by Poison Ivy tool
Attacker elevates
access to important
user, service and
admin accounts, and
specific systems
Data is acquired from
target servers and
staged for exfiltration
Data is exfiltrated via
encrypted files over ftp to
external, compromised
machine at a hosting
provider
Phishing and
Zero Day Attack
Back Door
Lateral
Movement
Data Gathering Exfiltrate
1 2 3 4 5
APT Kill Chain

What: Identify anomalous user-level access to hosts
How: Look at People & Machines
•  Users (User Behavior Models)
•  Network, Servers (User Peer Models)
Scenarios:
Network reconnaissance from remote adversary on hijacked device
Ill-intentioned activities by legitimate employee
Access policy abuse
Business values:
Immediate security alert generation
Enhanced SIEM alert queue prioritization
Focused monitoring
Future integration with other analytic models for 360° attack view
Lateral Movement Detection

Data Computing Appliance
Logs
Active Directory Activity
Active Directory Metadata
Server Information
Structured
ExternalTables
Semi-structured
Regression Based Model
Cluster Based Model
Recommendation System
Based
User Behavioral Model
Anomalous Users
Greenplum
DIA
LDAP Activity
Lateral Movement Detection (LMD) – Flow Diagram

Model to identify users with unusual
variation in the number of servers
accessed over time
Build a regression model for each user
(Y = aX + b)
No. of servers accessed each week (Y)
~ Week Index (X)
Find the slope of the regression line for
each user (a)
Identify users who have a high positive
or negative slope to find users with
unusual activity
NumberofServers
Week of the year
Regression plot of number of servers for a user
Regression-Based Model

Build historical behavioral profile for each user
based on following features:
•  Servers accessed
•  IP addresses logged in from
•  Geographical information of login
Models stress individual user/job log-in
frequency
Multiple Feature Generations reduce false
alarms:
•  Aggregate servers to respective server group
•  Incorporate server criticality
•  Assign more weight to less popular servers and IP
addresses
•  E.g. print servers are low-weighted
•  Use recommendation engine to suggest servers to users
based on job roles and peers
Servers
s1s2s3s4s5s6s7s8s9s10
Typically uses only
a few servers
Begins logging
into a lot of
new servers
User Behavior Models (UBM)

Week1 Week2 . Week10 Week 11 . Week15
server1 2 3 1 0 . 0
server2 4 7 1 3 . 7
server3 0 2 0 0 . 0
. . . . . .
server25 1 3 5 8 . 1
PCA Model Built per User (Training Data) Testing Data
User behavior matrix is created using ‘x’ weeks of history for a user. The current week is
used as test data.
PCA is dimensionality reduction technique used to capture the components set of
multidimensional vector which account for most of the variance.
Principal dimensions are calculated from the training data.
Principal Component Analysis (PCA) Scoring

Reconstruction Error
Training Data
(User Behavior
Matrix)
Run PCA
Principal
Dimensions
Reconstruct
Project onto
Principal
Dimensions
Test Vector
(User data for
new week)
Reconstructed
Test Vector
Difference
between
two vectors
Anomaly
Score
Ref: A Lakhina, M Crovella, C Diot, Diagnosing network-wide traffic anomalies

Oversampling PCA
Reference and Image Source: YR Yeh, ZY Lee, YJ Lee, Anomaly Detection via Over-sampling Principal Component Analysis
Training Data
(User Behavior
Matrix)
Run PCA
Oversampled
Test Data
Training Data
(User Behavior
Matrix)
Run PCA
First Principal Vector
Difference
in angle
between them
Anomaly
Score
First Principal Vector
after oversampling
Test Data

R Code to find the Principal Components (using SVD)
SQL & R
User1
Data
User2
Data
User3
Data
User4
Data
User5
Data
User1
Model
User2
Model
User3
Model
User4
Model
User5
Model
PLR wrapper over the R Code to run in parallel
Parallelized PCA using PL/R

Users rate items
To recommend items to a particular user A
•  Find other users U similar to A
•  Identify the set of items I accessed by U
•  Recommend these items I to A
Users = Employees
Items = Servers accessed
Image Source: http://dataconomy.com/2015/03/an-introduction-to-recommendation-engines/
Recommendation System-Based Model

Ÿ  Historical profile for each user
based on number of days per
week for a particular server
weighted by
recommendations
Ÿ  AD Logs, LDAP data (job title,
dept, etc)
Ÿ  Heat Map (Top figure)
–  X-Axis : Week Index
–  Y-Axis : Server
–  Value: Number of days per
week weighted by
recommendations
Ÿ  Outlier Plot (Bottom Figure)
–  X-Axis : Week Index
–  Y-Axis : Outlier Score
Heat map before recommendations Heat map after recommendations
Servers g3 & g4 are recommended, hence
weight is decreased
Outlier score in test week decreases because the
new servers that the user accesses are
recommended for his job profile
g1g2g3g4g5
g1g2g3g4g5
Recommendation System-Based Model

Using historical windows events data to
build graphs* of typical user behavior
•  Which machines does the user log into?
•  Which machines does the user log in from?
•  How often?
•  In which order?
Ask if this behavior is typical
•  Is it typical for this user?
•  Is it typical for someone in a particular department?
•  Is this typical for someone in the user’s job role?
Graph models are sensitive to direction,
order, and frequency
34.23.123.4
Typical Behavior
Anomalous Behavior
DB with financial
information
34.23.123.51
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.4
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.51
*Reference: Alexander D. Kenta, Lorie M. Liebrockb, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network.
Graph Model

Challenge:
•  Cybersecurity threats, data privacy, data protection and fraudulent
behavior going undetected, leaving customer vulnerable to security
risks, loss of money
•  Need to gain timely insight into unusual/suspicious internal behavior
to allow for proper action
•  Tools in place cannot be customized to leverage historical security
data and allow for predictive analytics
Solution:
•  Leveraged Data Science to show use cases analyzing their active
directory data, identifying fraud, unapproved file sharing, etc.
•  Utilized Big Data Suite, specifically Greenplum + MADlib + R to
store and analyze data with potential to build out Hadoop data lake
with HDB (aka HAWQ)
Pivotal Solution includes: Pivotal
Greenplum, Pivotal HDB, Apache MADlib
Fortune 100 Companies Leverage Pivotal to Tackle
Enterprise-wide Security Risks with Analytics

•  Pivotal Data Science expertise and partnership with customers to
identify high-value use cases to solve and build data science center
of excellence for security analytics
•  Tight integration to Analytical Tools that run in-database and
across all of the data, to cover the most possible use cases
•  Scalable Solution that can grow as data needs grow, leveraging
commodity hardware to keep costs low as data volume increases
•  Join key Pivotal customers in the Security Advisory Council for
collaboration and knowledge sharing
Why Pivotal for Security Analytics

Additional Resources
& Next Steps
Read: Pivotal Data Science Blog
https://blog.pivotal.io/channels/data-science-pivotal
Strategic: Pivotal Data Science Analytics Road
mapping Engagement https://pivotal.io/contact
Tune in: Next data science webinar: “Using Data
Science to Detect Healthcare Fraud, Waste, and
Abuse,” March 14, 2017
https://pivotal.io/resources/1/webinars
Hands on:
Pivotal Greenplum Sandbox
https://network.pivotal.io/products/pivotal-gpdb
Apache MADlib (incubating)
http://madlib.incubator.apache.org/

Questions?

Using Data Science for Cybersecurity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Using Data Science for Cybersecurity

Similar to Using Data Science for Cybersecurity (20)

More from VMware Tanzu

More from VMware Tanzu (20)

Recently uploaded

Recently uploaded (20)

Using Data Science for Cybersecurity