Video (at YouTube) - http://bit.ly/19TNSTF
Big Data Security Analytics, Data Science and Machine Learning are a few of the new buzzwords that have invaded out industry of late. Most of what we hear are promises of an unicorn-laden, silver-bullet panacea by heavy-handed marketing folks, evoking an expected pushback from the most enlightened members of our community.
This talk will help parse what we as a community need to know and understand about these concepts and help understand where the technical details and actual capabilities of those concepts and also where they fail and how they can be exploited and fooled by an attacker.
The talk will also share results of the author's current ongoing research (on MLSec Project) of applying machine learning techniques to information secuirty monitoring.
Ensuring Technical Readiness For Copilot in Microsoft 365
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
1. Applying
Machine
Learning
to
Network
Security
Monitoring
Alexandre
Pinto
Chief
Data
Scien4st
|
MLSec
Project
@alexcpsec
@MLSecProject!
2. WARNING!
• This
is
a
talk
about
BUILDING
not
breaking
– NO
systems
were
harmed
on
the
development
of
this
talk.
– This
is
NOT
about
1337
Android
Malware
• Only
thing
we
are
likely
to
break
here
is
the
4me
limit
on
the
talk
• This
talk
includes
more
MATH
than
the
daily
recommended
intake
by
the
FDA.
• All
stunts
described
in
this
talk
were
performed
by
trained
professionals.!
3. Who's
Alex?
• 13
years
in
Informa4on
Security,
done
a
liRle
bit
of
everything.
• Past
7
or
so
years
leading
security
consultancy
and
monitoring
teams
in
Brazil,
London
and
the
US.
– If
there
is
any
way
a
SIEM
can
hurt
you,
it
did
to
me.
• Researching
machine
learning
and
data
science
in
general
for
the
past
year
or
so
and
presen4ng
about
the
intersec4on
of
it
and
Infosec
throughout
the
year.
• Created
MLSec
Project
in
July
2013
to
give
structure
to
the
research
being
done.
4. Agenda
• Defini4ons
• Big
Data
• Data
Science
• Machine
Learning
•
•
•
•
•
Y
U
DO
DIS?
Network
Security
Monitoring
PoC
||
GTFO
Feature
Intui4on
How
to
get
started?
8. (Security)
Data
ScienEst
• “Data
Scien4st
(n.):
Person
who
is
beRer
at
sta4s4cs
than
any
so`ware
engineer
and
beRer
at
so`ware
engineering
than
any
sta4s4cian.”
-‐-‐
Josh
Willis,
Cloudera
Data
Science
Venn
Diagram
by
Drew
Conway!
9. Enter
Machine
Learning
• “Machine
learning
systems
automa4cally
learn
programs
from
data”
(*)
• You
don’t
really
code
the
program,
but
it
is
inferred
from
data.
• Intui4on
of
trying
to
mimic
the
way
the
brain
learns:
that's
where
terms
like
ar#ficial
intelligence
come
from.
!
(*)
CACM
55(10)
-‐
A
Few
Useful
Things
to
Know
about
Machine
Learning
(Domingos
2012)
13. ConsideraEons
on
Data
Gathering
• Models
will
(generally)
get
beRer
with
more
data
– But
we
always
have
to
consider
bias
and
variance
as
we
select
our
data
points
– Also
adversaries
–
we
may
be
force
fed
“bad
data”,
find
signal
in
weird
noise
or
design
bad
(or
exploitable)
features
• “I’ve
got
99
problems,
but
data
ain’t
one”!
Domingos,
2012
Abu-‐Mostafa,
Caltech,
2012
15. Y
U
DO
DIS?
• Common
reac4ons
from
Security
Professionals:
• “Eh,
cool…”
*blank
stare*
*walks
away*
• “Are
you
high,
bro?”
• “Why
aren’t
you
doing
some
cool
research
like
Android
Malware?”
17. Security
ApplicaEons
of
ML
• Fraud
detec4on
systems:
– Is
what
he
just
did
consistent
with
past
behavior?
• Network
anomaly
detec4on
(?):
– More
like
bad
sta4s4cal
analysis
– Did
not
advance
a
lot,
IMO
• Predic4ng
likelihood
of
aRack
actors
– Create
different
predic4ve
models
and
chain
them
to
gain
more
confidence
in
each
step.!
• SPAM
filters
18. ConsideraEons
on
Data
Gathering
• Adversaries
-‐
Exploi4ng
the
learning
process
• Understand
the
model,
understand
the
machine,
and
you
can
circumvent
it
• Something
InfoSec
community
knows
very
well
• Any
predic4ve
model
on
InfoSec
will
be
pushed
to
the
limit
• Again,
think
back
on
the
way
SPAM
engines
evolved.!
20. CorrelaEon
Rules:
A
Primer
• Rules
in
a
SIEM
solu4on
invariably
are:
– “Something”
has
happened
“x”
4mes;
– “Something”
has
happened
and
other
“something2”
has
happened,
with
some
rela4onship
(4me,
same
fields,
etc)
between
them.
• Configuring
SIEM
=
iterate
on
combina4ons
un4l:
– Customer
or
management
is
foole..
I
mean
sa4sfied;
– Consul4ng
money
runs
out
• Behavioral
rules
(anomaly
detec4on)
helps
a
bit
with
the
“x”s,
but
s4ll,
very
laborious
and
4me
consuming.!
21. Kinds
of
Network
Security
Monitoring
• Alert-‐based:
– “Tradi4onal”
log
management
– SIEM
– Using
“Threat
Intelligence”
(i.e
blacklists)
for
about
a
year
or
so
– Lack
of
context
– Low
effec4veness
– You
get
the
results
handed
over
to
you
• Explora4on-‐based:
– Network
Forensics
tools
(2/3
years
ago)
– Elas4c
Search
based
LM
systems
– High
effec4veness
– Lots
of
people
necessary
– Lots
of
HIGHLY
trained
people
• Big
Data
Security
Analy4cs
(BDSA):
– Run
explora4on-‐based
monitoring
on
Hadoop
– More
like
Big
Data
Security
Monitoring
(BDSM)
25. PoC
||
GTFO
• We
developed
a
set
of
algorithms
to
detect
malicious
behavior
from
log
entries
of
firewall
blocks
• Over
6
months
of
data
from
SANS
DShield
(thanks,
guys!)
• A`er
a
lot
of
sta4s4cal-‐based
math
(true
posi4ve
ra4o,
true
nega4ve
ra4o,
odds
likelihood),
it
could
pinpoint
actors
that
would
be
13x-‐18x
more
likely
to
aRack
you.
• Today
more
like
30x
on
the
SANS
data,
and
finding
around
80%
of
“badness”
in
par4cipant
deployments.!
26. Feature
IntuiEon:
IP
Proximity
• Assump4ons
to
aggregate
the
data
• Correla4on
/
proximity
/
similarity
BY
BEHAVIOR
• “Bad
Neighborhoods”
concept:
– Spamhaus
x
CyberBunker
– Google
Report
(June
2013)
– Moura
2013
• Group
by
Geoloca4on
• Group
by
Netblock
(/16,
/24)
• Group
by
ASN
– (thanks,
Team
Cymru)!
27. 0
10
MULTICAST
AND
FRIENDS
You
are
here!
CN,
BR,
TH
Map
of
the
Internet
• (Hilbert
Curve)
• Block
port
22
• 2013-‐07-‐20
CN
127
RU
28. Feature
IntuiEon:
Temporal
Decay
• Even
bad
neighborhoods
renovate:
– ARackers
may
change
ISPs/proxies
– Botnets
may
be
shut
down
/
relocate
– A
liRle
paranoia
is
Ok,
but
not
EVERYONE
is
out
to
get
you
(at
least
not
all
at
once)!
• As
days
pass,
let's
forget,
bit
by
bit,
who
aRacked
• Last
4me
I
saw
this
actor,
and
how
o`en
did
I
see
them!
29. MLSec
Project
• Behavior:
block
on
port
22
• Trial
inference
on
100k
IP
addresses
per
Class
A
subnet
• Logarithm
scale:
brightest
4les
are
10
to
1000
4mes
more
likely
to
aRack.
30. Feature
IntuiEon:
DNS
features
• Who
resolves
to
this
IP
address?
• Number
of
domains
that
resolve
to
the
IP
address
• Distribu4on
of
their
life4me
• Entropy,
size,
ccTLDs
• Registrar
informa4on
• Reverse
DNS
informa4on…
• History
of
DNS
registra4on…
• (Thanks,
DNSDB!)
31. Training
the
Model
• YAY!
We
have
a
bunch
of
numbers
per
IP
address/domain!
• How
do
you
define
what
is
malicious
or
not?
• “Advanced
exper4se
in
both
informa4on
security
and
data
science
will
be
a
necessary
ingredient
in
enabling
accurate
discrimina4on
between
malicious
and
benign
ac4vity.
“
-‐
Anton
Chuvakin,
Gartner
• Kinda
easy
for
security
tools
(if
you
trust
them)
• Web
applica4on
logs
need
deeper
sta4s4cal
analysis
• Not
normal
/
standard
devia4on
thing
!
32. How
do
I
get
started
on
this?
• Programming
is
a
must
(Python
/
R)
• Sta4s4cal
knowledge
keeps
you
from
making
dumb
mistakes
• Specific
machine
learning
courses
and
books:
– Coursera
(ML/
Data
Analysis
/
Data
Science)
• Prac4ce,
Prac4ce,
Prac4ce:
– Explore
your
data!
–
(Security
Onion)
– Kaggle
– KDD,
VAST,
VizSec!
33. MLSec
Project
• Sign
up,
send
logs,
receive
reports
generated
by
machine
learning
models!
• Working
with
several
companies
on
trying
out
these
models
on
their
environment
with
their
data
• We
are
hiring
(KINDA)
• Visit
h]ps://www.mlsecproject.org
,
message
@MLSecProject
or
just
e-‐mail
me.!
34. MLSec
Project
-‐
Current
Research
• Inbound
aRacks
on
exposed
services
(DEFCON/BH
2013):
– Informa4on
from
inbound
connec4ons
on
firewalls,
IPS,
WAFs
– Feature
extrac4on
and
supervised
learning
• Malware
Distribu4on
and
Botnets:
– Informa4on
from
outbound
connec4ons
on
firewalls,
DNS
and
Web
Proxy
– Ini4al
labeling
provided
by
intelligence
feeds
and
AV/an4-‐malware
– Semi-‐supervised
learning
involved
• Kill-‐chain
Ensemble
Models:
– Increased
precision
by
composing
different
behaviors
– Web
server
path
-‐>
go
through
Firewall,
then
IPS,
then
WAF
– Early
confirma4on
of
aRack
failure
or
success
35. Thanks!
• Q&A?
• Feedback?
Alexandre
Pinto
@alexcpsec
@MLSecProject
hRps://www.mlsecproject.org/
"
Essen4ally,
all
models
are
wrong,
but
some
are
useful."
-‐
George
E.
P.
Box