Using Hadoop to Detect Security Risks and Predict Breaches
1. Caspida – Karthik KannanCaspida Inc.
Threat Detection Using
Hadoop
KARTHIK KANNAN
FOUNDER, CMO
2. Caspida – Karthik Kannan
Title
Using Hadoop and Machine Learning to
Detect Security Risks and Vulnerabilities,
and Predict Breaches in your Enterprise
Environment
3. Caspida – Karthik Kannan
Topics
Security challenges
Today’s approaches – limitations
Why is security a Big Data problem?
Hadoop and ML in other industries
Security – with Hadoop and ML
Some examples
Where do you go from here?
4. Caspida – Karthik Kannan
Security
Unique use case that applies horizontally
Incident analysis
Anomaly detection
Queries at scale
Predetermined metrics
Needs to be dynamically self-learning
7. Caspida – Karthik Kannan
Verizon Data Breach Report
2014 DBIR data shows attackers are getting better and faster at what they
do, more quickly than organizations can address the threats.
http://www.verizonenterprise.com/DBIR/2014/?utm_source=ContextualAds&utm_medium=ResultLinks&utm_c
ampaign=DBIR2014
8. Caspida – Karthik Kannan
Market Data
Courtesy: Mary Meeker, KPCB, Internet Trends Report, 2014
9. Caspida – Karthik Kannan
Current Methods Fail
Limited scale, manual, no dynamicity
Signatures Rules
Malware-Detection
10. Caspida – Karthik Kannan
Why is Security a Big Data Problem?
Variety of security events
− New sources, new relationships, new entities
Analysis sophistication
− Dynamic correlations, sequences, non-
contiguous patterns
Context – time
− Months & years, not days
Good reference/reading:
11. Caspida – Karthik Kannan
The Right Tools for the Right Purpose
Protecting the perimeter and defending against known
attacks (signatures)
Discovery unauthorized use of SaaS/cloud apps and
policy enablement for Shadow IT
SIEMs collect data, use extensive human-generated
rules, rely on manual analysis and provide static alerts
Firewall, IPS, malware, AV
Cloud security
SIEM
Lacking dynamic, self-learning methods that are
needed to detect sophisticated attacks?
12. Caspida – Karthik Kannan
Mobile
There is an
App for Everything!
SMSPhone
MMS
IM
Mobile App Stores
Mobile Device Mgmt (MDM) Mobile App Mgmt (MAM)
Cloud
SaaS
Monitoring
SaaS
Encryption
Web Mail CRM/ERP SaaS Apps (Salesforce, …) Custom Apps/TestDev Clouds
Desktop
Password
Hashing Antivirus Anti-Malware SW
OS security
layering
OS-level
Sandboxing
Disk
Encryption
Productivity Apps/Development/Test
Security in the Technology Evolution
Application-specific
Attacks
(Facebook wall, Browser)
Attackers
AttackTypes
DDoS
(Zombies etc.)
Password Guessing
Filesystems / DBs
Misconfigurations
Viruses
Malware/Spyware
Keyloggers
Sniffing
Governments
Special Interest Groups
Polymorphic
APT
Botnets
Web App Attacks
(XSS, etc.)
Phishing
Enterprise
Firewalls
Multi-Factor
Authentication
IDSAntivirus
Malware
Sandboxing
Threat
Feeds
SIEMVPN
Corporate Email
Finance Apps
Corporate Storage/Filers Collaboration Tools/ECM Cloud Apps
Time2000 20131990 2010
AttackSophistication
13. Caspida – Karthik Kannan
Stages of an attack
Research Infiltrate Capture Exfiltrate
Market-
place
86% of enterprises
focus on step 2 only
Studies show that companies save up to $4M/year when they have
security intelligence systems that focus on all stages
1 2 3 4 5
14. Caspida – Karthik Kannan
ML + Statistical Models
Visualization
Models
Data Lake
Standard models: K-
means, Random Forest,
Nearest-neighbor,
Gaussian, Bayesian etc.
Custom models: user
patterns/behavior, time-
oriented, data attributes-
specific, SaaS, mobile
15. Caspida – Karthik Kannan
Algorithms
Time Series Analysis
− Good when dealing with time
series
− Examples:
Linear Regression
Parametric (ARIMA/FARIMA)
Forecasting: Holt-Winters
Classification Models
− Good to find which categories
things are falling under
− Examples:
Logistic Regression
Decision Trees
Decision Tables
Neural Networks
K-Nearest Neighbors
Ensemble Models (Random
Forests)
Grouping Models
− Used for finding global patterns at scale
− Examples:
K-Means Clustering
Random graph walks
Inference Models
− Important when trying to infer value of
a feature from a context
− Example
Association Rules
Bayesian Networks
Simplification Models
− Important when we need to decrease
number of features analyzed
− Examples
Principal Component Analysis (PCA)
Low-Rank Approximation
Single-Value Decomposition (SVD)
16. Caspida – Karthik Kannan
Data Sources: Information Value Pyramid
Network Packets: L2-L4
Network Packets: L7
Generic System Logs
Application
Logs
Lower Volume; Concentration of Information
No need to decipher semantics of information
Top-Down view with Correlation on important signals
OS logs on system events, processes’ health
Need additional deciphering of information
High-Volume of Source Data
Can capture malware code for analysis
Problems with encrypted traffic
High-Volume of Source Data
Analysis only based on
signatures and packet statistics
17. Caspida – Karthik Kannan
Advanced Persistent Threat (APT) Kill Chain
A handful set of
users targeted by
phishing attacks
The user
downloads the
malware which
finds a back
door to access
the system
Attacker
attempts to
move other
systems and
accounts by
elevating
privileges
accordingly
Data is gathered
from different
systems and
staged for
exfiltration
Data is sent out
via multiple
channels
(encrypted over
FTP, DNS back
channels etc.)
Lateral
Movement
Phishing and
Zero Day Attack
Back Door
Data
Gathering
Exfiltrate
18. Caspida – Karthik Kannan
Ideal Hadoop-based solution
Data Sources Data Lake Data Science
19. Caspida – Karthik Kannan
Machine Learning in Industries
eCommerce: identify
shopper behavior and
predict buying patterns,
inventory planning,
recommendations
− AggregateKnowledge
− RichRelevance
− Amazon
AdTech: identify
mobile/online users,
model their preferences,
and render appropriate
advertisements to the
right audience
− AdMob (Google)
− MoPub (Twitter)
− Efficient Frontier
(Adobe)
20. Caspida – Karthik Kannan
Types of Security Analytics
Breach
− Phishing attack
− DDoS attack
− Watering hole attack
Exploitation
− Lateral movement
− Domain account misuse
Exfiltration
− Privileged data leakage
− Anomalous login activity
Debilitation
− App or DB server load/activity patterns
− Web server patterns
Monitoring
− Metrics management
21. Caspida – Karthik Kannan
Data Sources & Analysis
Source Information obtained
1 Web server Incoming, outgoing traffic, IP addresses,
times, session durations
2 Domain controllers User IDs accessing specific IP addresses,
times, durations
3 IAM servers Apps, servers, other protected services
users are accessing, times, durations
4 Content servers Detailed transactional histories, customer
account data, ACLs
5 Messaging server events Email stats, attachment info, external
communications (IPs, frequencies)
+ correlations – across time and events to produce network of related users, apps, servers and other
critical services that may be affected by threats
+ machine learning algorithms – dynamic models driving automatic insights into malicious, external,
APT, SaaS, mobile or network threats in repeatable fashion
+ search/queries – to sharpen insights and threat intelligence by drilling down into desired dimension
such as time window, geography, criticality etc.
22. Caspida – Karthik Kannan
Anatomy of an attack
IP Location
200.55.12.68 Brazil
58.202.85.1 China
220.12.98.41 US/SC
119.56.128.25 China
… …
IP Location
200.55.12.68 Brazil
58.202.85.1 China
220.12.98.41 US/SC
119.56.128.25 China
… …
UID2
UID1
UID3
UID5
UID4
UID6
UID7 UID9
UID1 UID8
Svr1
Svr2
App1
App2
DB1 FTP1
Identification of suspicious
IP originations, destinations
IP addresses, geo-spatial
information collection Network of correlations for suspected IPs;
which users are accessing them the most?
Identification of
suspicious users
Correlations of suspected users with apps,
databases and other sensitive services
1 2 3
456
Timeline of malicious behavior,
e.g., sending emails or
communicating with CnC
Actions
IP1 IP2
UID8
UID1 UID8
DB2
23. Caspida – Karthik Kannan
Network traffic
Behavioral
threat
models
Network
Traffic:
PCAP, Netflow
• Switches
• Routers
• Firewalls*
• IPS’*
• Web gateways*
• Proxy server*
• Any other
network device*
* optional
Sources
• Traffic monitoring & analysis:
• Which IP is communicating
with which external or internal
destination
• Traffic volume, frequency
• Correlate with IAM (for user ID – IP
mapping)
• Max traffic contributors –
users, apps, IP addresses
• Correlate with Web server (for URL
traffic analysis per user)
• Correlate with Messaging server (for
email source/recipient analysis)
• Correlate with Firewall (for external
traffic analysis per IP, user)
• Correlate with App, DB servers (for
internal app transactional analysis)
• External threat (e.g., bad IP address
list) feeds
Threat Intelligence
24. Caspida – Karthik Kannan
Examples
Ground-speed violations
− detect user logins that are geographically spaced apart
but fall within seconds/minutes of each other
Lateral movement
− accounts moving from one server/device to another to
explore and list content on each location before
deciding which to exfiltrate
Domain admin creations
− auto creation of admin accounts by spurious account;
e.g., r00t, adm1n etc.
25. Caspida – Karthik Kannan
Where do you start?
Need a data lake
Analytics:
− ML
− Statistical
Actions