Using Hadoop to Detect Security Risks and Predict Breaches

Caspida – Karthik KannanCaspida Inc.
Threat Detection Using
Hadoop
KARTHIK KANNAN
FOUNDER, CMO

Caspida – Karthik Kannan
Title
 Using Hadoop and Machine Learning to
Detect Security Risks and Vulnerabilities,
and Predict Breaches in your Enterprise
Environment

Topics
 Security challenges
 Today’s approaches – limitations
 Why is security a Big Data problem?
 Hadoop and ML in other industries
 Security – with Hadoop and ML
 Some examples
 Where do you go from here?

Security
Unique use case that applies horizontally
 Incident analysis
 Anomaly detection
 Queries at scale
 Predetermined metrics
Needs to be dynamically self-learning

Today's Security Challenges
 Target credit card breaches
 Snowden insider attack
 RSA security breach
 Twitter hacking

CIO Survey : Top Concerns

Verizon Data Breach Report
2014 DBIR data shows attackers are getting better and faster at what they
do, more quickly than organizations can address the threats.
http://www.verizonenterprise.com/DBIR/2014/?utm_source=ContextualAds&utm_medium=ResultLinks&utm_c
ampaign=DBIR2014

Market Data
Courtesy: Mary Meeker, KPCB, Internet Trends Report, 2014

Current Methods Fail
Limited scale, manual, no dynamicity
Signatures Rules
Malware-Detection

Why is Security a Big Data Problem?
 Variety of security events
− New sources, new relationships, new entities
 Analysis sophistication
− Dynamic correlations, sequences, non-
contiguous patterns
 Context – time
− Months & years, not days
Good reference/reading:

The Right Tools for the Right Purpose
Protecting the perimeter and defending against known
attacks (signatures)
Discovery unauthorized use of SaaS/cloud apps and
policy enablement for Shadow IT
SIEMs collect data, use extensive human-generated
rules, rely on manual analysis and provide static alerts
Firewall, IPS, malware, AV
Cloud security
SIEM
Lacking dynamic, self-learning methods that are
needed to detect sophisticated attacks?

Mobile
There is an
App for Everything!
SMSPhone
MMS
IM
Mobile App Stores
Mobile Device Mgmt (MDM) Mobile App Mgmt (MAM)
Cloud
SaaS
Monitoring
SaaS
Encryption
Web Mail CRM/ERP SaaS Apps (Salesforce, …) Custom Apps/TestDev Clouds
Desktop
Password
Hashing Antivirus Anti-Malware SW
OS security
layering
OS-level
Sandboxing
Disk
Encryption
Productivity Apps/Development/Test
Security in the Technology Evolution
Application-specific
Attacks
(Facebook wall, Browser)
Attackers
AttackTypes
DDoS
(Zombies etc.)
Password Guessing
Filesystems / DBs
Misconfigurations
Viruses
Malware/Spyware
Keyloggers
Sniffing
Governments
Special Interest Groups
Polymorphic
APT
Botnets
Web App Attacks
(XSS, etc.)
Phishing
Enterprise
Firewalls
Multi-Factor
Authentication
IDSAntivirus
Malware
Sandboxing
Threat
Feeds
SIEMVPN
Corporate Email
Finance Apps
Corporate Storage/Filers Collaboration Tools/ECM Cloud Apps
Time2000 20131990 2010
AttackSophistication

Stages of an attack
Research Infiltrate Capture Exfiltrate
Market-
place
86% of enterprises
focus on step 2 only
Studies show that companies save up to $4M/year when they have
security intelligence systems that focus on all stages
1 2 3 4 5

ML + Statistical Models
Visualization
Models
Data Lake
Standard models: K-
means, Random Forest,
Nearest-neighbor,
Gaussian, Bayesian etc.
Custom models: user
patterns/behavior, time-
oriented, data attributes-
specific, SaaS, mobile

Algorithms
 Time Series Analysis
− Good when dealing with time
series
− Examples:
 Linear Regression
 Parametric (ARIMA/FARIMA)
 Forecasting: Holt-Winters
 Classification Models
− Good to find which categories
things are falling under
− Examples:
 Logistic Regression
 Decision Trees
 Decision Tables
 Neural Networks
 K-Nearest Neighbors
 Ensemble Models (Random
Forests)
 Grouping Models
− Used for finding global patterns at scale
− Examples:
 K-Means Clustering
 Random graph walks
 Inference Models
− Important when trying to infer value of
a feature from a context
− Example
 Association Rules
 Bayesian Networks
 Simplification Models
− Important when we need to decrease
number of features analyzed
− Examples
 Principal Component Analysis (PCA)
 Low-Rank Approximation
 Single-Value Decomposition (SVD)

Data Sources: Information Value Pyramid
Network Packets: L2-L4
Network Packets: L7
Generic System Logs
Application
Logs
Lower Volume; Concentration of Information
No need to decipher semantics of information
Top-Down view with Correlation on important signals
OS logs on system events, processes’ health
Need additional deciphering of information
High-Volume of Source Data
Can capture malware code for analysis
Problems with encrypted traffic
High-Volume of Source Data
Analysis only based on
signatures and packet statistics

Advanced Persistent Threat (APT) Kill Chain
A handful set of
users targeted by
phishing attacks
The user
downloads the
malware which
finds a back
door to access
the system
Attacker
attempts to
move other
systems and
accounts by
elevating
privileges
accordingly
Data is gathered
from different
systems and
staged for
exfiltration
Data is sent out
via multiple
channels
(encrypted over
FTP, DNS back
channels etc.)
Lateral
Movement
Phishing and
Zero Day Attack
Back Door
Data
Gathering
Exfiltrate

Ideal Hadoop-based solution
Data Sources Data Lake Data Science

Machine Learning in Industries
 eCommerce: identify
shopper behavior and
predict buying patterns,
inventory planning,
recommendations
− AggregateKnowledge
− RichRelevance
− Amazon
 AdTech: identify
mobile/online users,
model their preferences,
and render appropriate
advertisements to the
right audience
− AdMob (Google)
− MoPub (Twitter)
− Efficient Frontier
(Adobe)

Types of Security Analytics
 Breach
− Phishing attack
− DDoS attack
− Watering hole attack
 Exploitation
− Lateral movement
− Domain account misuse
 Exfiltration
− Privileged data leakage
− Anomalous login activity
 Debilitation
− App or DB server load/activity patterns
− Web server patterns
 Monitoring
− Metrics management

Data Sources & Analysis
Source Information obtained
1 Web server Incoming, outgoing traffic, IP addresses,
times, session durations
2 Domain controllers User IDs accessing specific IP addresses,
times, durations
3 IAM servers Apps, servers, other protected services
users are accessing, times, durations
4 Content servers Detailed transactional histories, customer
account data, ACLs
5 Messaging server events Email stats, attachment info, external
communications (IPs, frequencies)
+ correlations – across time and events to produce network of related users, apps, servers and other
critical services that may be affected by threats
+ machine learning algorithms – dynamic models driving automatic insights into malicious, external,
APT, SaaS, mobile or network threats in repeatable fashion
+ search/queries – to sharpen insights and threat intelligence by drilling down into desired dimension
such as time window, geography, criticality etc.

Anatomy of an attack
IP Location
200.55.12.68 Brazil
58.202.85.1 China
220.12.98.41 US/SC
119.56.128.25 China
… …
IP Location
200.55.12.68 Brazil
58.202.85.1 China
220.12.98.41 US/SC
119.56.128.25 China
… …
UID2
UID1
UID3
UID5
UID4
UID6
UID7 UID9
UID1 UID8
Svr1
Svr2
App1
App2
DB1 FTP1
Identification of suspicious
IP originations, destinations
IP addresses, geo-spatial
information collection Network of correlations for suspected IPs;
which users are accessing them the most?
Identification of
suspicious users
Correlations of suspected users with apps,
databases and other sensitive services
1 2 3
456
Timeline of malicious behavior,
e.g., sending emails or
communicating with CnC
Actions
IP1 IP2
UID8
UID1 UID8
DB2

Network traffic
Behavioral
threat
models
Network
Traffic:
PCAP, Netflow
• Switches
• Routers
• Firewalls*
• IPS’*
• Web gateways*
• Proxy server*
• Any other
network device*
* optional
Sources
• Traffic monitoring & analysis:
• Which IP is communicating
with which external or internal
destination
• Traffic volume, frequency
• Correlate with IAM (for user ID – IP
mapping)
• Max traffic contributors –
users, apps, IP addresses
• Correlate with Web server (for URL
traffic analysis per user)
• Correlate with Messaging server (for
email source/recipient analysis)
• Correlate with Firewall (for external
traffic analysis per IP, user)
• Correlate with App, DB servers (for
internal app transactional analysis)
• External threat (e.g., bad IP address
list) feeds
Threat Intelligence

Examples
 Ground-speed violations
− detect user logins that are geographically spaced apart
but fall within seconds/minutes of each other
 Lateral movement
− accounts moving from one server/device to another to
explore and list content on each location before
deciding which to exfiltrate
 Domain admin creations
− auto creation of admin accounts by spurious account;
e.g., r00t, adm1n etc.

Where do you start?
 Need a data lake
 Analytics:
− ML
− Statistical
 Actions

Thank you!

Using Hadoop to Detect Security Risks and Predict Breaches

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Using Hadoop to Detect Security Risks and Predict Breaches

Semelhante a Using Hadoop to Detect Security Risks and Predict Breaches (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Using Hadoop to Detect Security Risks and Predict Breaches