2. Agenda
• What
is
Apache
Hadoop?
• Log
Processing
• Catching
`Osama’
• Extract
Transform
Load
(ETL)
• AnalyBcs
in
HBase
• Machine
Learning
• Final
Thoughts
Copyright
2011
Cloudera
Inc.
All
rights
reserved
3. Exploding
Data
Volumes
• Online
• Web-‐ready
devices
• Social
media
Complex, Unstructured
• Digital
content
• Smart
grids
• Enterprise
Relational
• TransacBons
• R&D
data
• OperaBonal
(control)
data
Digital
universe
grew
by
62%
last
year
to
2,500
exabytes
of
new
informaBon
in
800K
petabytes
and
will
grow
to
1.2
2012
with
Internet
as
primary
driver
“zeabytes”
this
year
Source:
An
IDC
White
Paper
-‐
sponsored
by
EMC.
As
the
Economy
Contracts,
the
Digital
Universe
Expands.
May
2009
Copyright
2011
Cloudera
Inc.
All
rights
reserved
4. Origin
of
Hadoop
How
does
an
elephant
sneak
up
on
you?
Hadoop
wins
Terabyte
sort
benchmark
Releases
Open
Source,
CDH3
and
Publishes
MapReduce
Cloudera
MapReduce,
&
HDFS
Runs
4,000
Enterprise
Open
Source,
GFS
Paper
project
Node
Hadoop
Web
Crawler
created
by
Cluster
project
Launches
SQL
Doug
Cucng
created
by
Support
for
Doug
Cucng
Hadoop
2002
2003
2004
2005
2006
2007
2008
2009
2010
Copyright
2011
Cloudera
Inc.
All
rights
reserved
5. What
is
Apache
Hadoop?
Open
Source
Storage
and
Processing
Engine
•
Consolidates
Everything
•
Move
complex
and
relaBonal
data
into
a
single
repository
•
Stores
Inexpensively
•
Keep
raw
data
always
available
MapReduce
•
Use
commodity
hardware
•
Processes
at
the
Source
•
Eliminate
ETL
bolenecks
Hadoop
Distributed
•
Mine
data
first,
govern
later
File
System
(HDFS)
Copyright
2011
Cloudera
Inc.
All
rights
reserved
6. What
is
Apache
Hadoop?
The
Standard
Way
Big
Data
Gets
Done
• Hadoop
is
Flexible:
• Structured,
unstructured
• Schema,
no
schema
• High
volume,
merely
terabytes
• All
kinds
of
analyBc
applicaBons
• Hadoop
is
Open:
100%
Apache-‐licensed
open
source
• Hadoop
is
Scalable:
Proven
at
petabyte
scale
• Benefits:
• Controls
costs
by
storing
data
more
affordably
per
terabyte
than
any
other
plalorm
• Drives
revenue
by
extracBng
value
from
data
that
was
previously
out
of
reach
Copyright
2011
Cloudera
Inc.
All
rights
reserved
7. What
is
Apache
Hadoop?
The
Importance
of
Being
Open
No
Lock-‐In
-‐
Investments
in
skills,
services
&
hardware
are
preserved
regardless
of
vendor
choice
Community
Development
-‐
Hadoop
&
related
projects
are
expanding
at
a
rapid
pace
Rich
Ecosystem
-‐
Dozens
of
complementary
somware,
hardware
and
services
firms
Copyright
2011
Cloudera
Inc.
All
rights
reserved
8. Agenda
• What
is
Apache
Hadoop?
• Log
Processing
• Catching
`Osama’
• Extract
Transform
Load
(ETL)
• AnalyBcs
in
HBase
• Machine
Learning
• Final
Thoughts
Copyright
2011
Cloudera
Inc.
All
rights
reserved
9. Log
Processing
A
Perfect
Fit
• Common
uses
of
logs
• Find
or
count
events
(grep)
grep
“ERROR”
file
grep
-‐c
“ERROR”
file
• Calculate
metrics
(performance
or
user
behavior
analysis)
awk
‘{sums[$1]+=$2;
counts[$1]+=1}
END
{for(k
in
counts)
{print
sums[k]/counts
[k]}}’
• InvesBgate
user
sessions
grep
“USER”
files
…
|
sort
|
less
10. Log
Processing
A
Perfect
Fit
• Shoot…too
much
data
• Homegrown
parallel
processing
omen
done
on
per
file
basis,
cause
it’s
easy
• No
parallelism
on
a
single
large
file
Task
0
access_log
Task
1
Task
2
access_log
access_log
11. Log
Processing
A
Perfect
Fit
• MapReduce
to
the
rescue!
• Processing
is
done
per
unit
of
data
Task
0
Task
1
Task
2
Task
3
access_log
0-‐64MB
64-‐128MB
128-‐192MB
192-‐256MB
Each
task
is
responsible
for
a
unit
of
data
12. Log
Processing
A
Perfect
Fit
• Network
or
disk
are
bolenecks
• Reading
100GB
of
data
• 14
minutes
with
1GbE
network
connecBon
• 22
minutes
on
standard
disk
drive
access_log
ited
Bandwidth
is
lim
grep
13. Log
Processing
A
Perfect
Fit
• Hadoop
to
the
rescue!
• Eliminates
network
boleneck,
data
is
on
local
disk
• Data
is
read
from
many,
many
disks
in
parallel
Physical
Machines
NodeA
NodeX
NodeY
NodeZ
Task
0
Task
1
Task
2
Task
3
0-‐64MB
64-‐128MB
128-‐192MB
192-‐256MB
14. Log
Processing
A
Perfect
Fit
• Hadoop
currently
scales
to
4,000
nodes
• Goal
for
next
release
is
10,000
nodes
• Nodes
typically
have
12
hard
drives
• A
single
hard
drive
has
throughput
of
about
75MB/second
• 12
Hard
Drives
*
75
MB/second
*
4000
Nodes
=
3.4
TB/second
• That’s
bytes,
not
bits
• That’s
enough
bandwidth
to
read
1PB
(1000
TB)
in
5
minutes
15. Agenda
• What
is
Apache
Hadoop?
• Log
Processing
• Catching
`Osama’
• Extract
Transform
Load
(ETL)
• AnalyBcs
in
HBase
• Machine
Learning
• Final
Thoughts
Copyright
2011
Cloudera
Inc.
All
rights
reserved
16. Catching
`Osama’
Embarrassingly
Parallel
• You
have
a
few
billion
images
of
faces
with
geo-‐tags
• Tremendous
storage
problem
• Tremendous
processing
problem
• Bandwidth
• CoordinaBon
17. Catching
`Osama’
Embarrassingly
Parallel
• Store
the
images
in
Hadoop
• When
processing,
Hadoop
will
read
the
images
from
local
disk,
thousands
of
local
disks
spread
throughout
the
cluster
• Use
Map
only
job
to
compare
input
images
against
`needle’
image
18. Catching
`Osama’
Embarrassingly
Parallel
Tasks
have
copy
of
`needle’
Map
Task
0
Map
Task
1
Store
images
in
Sequence
Files
Output
faces
`matching’
needle
19. Agenda
• What
is
Apache
Hadoop?
• Log
Processing
• Catching
`Osama’
• Extract
Transform
Load
(ETL)
• AnalyBcs
in
HBase
• Machine
Learning
• Final
Thoughts
Copyright
2011
Cloudera
Inc.
All
rights
reserved
20. Extract
Transform
Load
(ETL)
Everyone
is
doing
it
• One
of
the
most
common
use
cases
I
see
is
replacing
ETL
processes
• Hadoop
is
a
huge
sink
of
cheap
storage
and
processing
• Aggregates
built
in
Hadoop
and
exported
• Apache
Hive
provides
SQL
like
querying
on
raw
data
21. Extract
Transform
Load
(ETL)
Everyone
is
doing
it
`Real’
Time
System
(Website)
Data
Warehouse
Business
Intelligence
ApplicaBons
Online
AnalyBcal
DB
DB
ETL
Much
blood
shed,
here
22. Extract
Transform
Load
(ETL)
Everyone
is
doing
it
`Real’
Time
System
(Website)
Data
Warehouse
Business
Intelligence
ApplicaBons
Online
AnalyBcal
DB
DB
Import Hadoop
Export
23. Extract
Transform
Load
(ETL)
Everyone
is
doing
it
`Real’
Time
System
(Website)
Data
Warehouse
Business
Intelligence
ApplicaBons
Online
AnalyBcal
DB
DB
Apache Hadoop
Sqoop
Apache
Sqoop
24. Agenda
• What
is
Apache
Hadoop?
• Log
Processing
• Catching
`Osama’
• Extract
Transform
Load
(ETL)
• AnalyBcs
in
HBase
• Machine
Learning
• Final
Thoughts
Copyright
2011
Cloudera
Inc.
All
rights
reserved
25. AnalyScs
in
HBase
Scaling
writes
• AnalyBcs
is
omen
simply
counBng
things
• Facebook
chose
HBase
to
store
it’s
massive
counter
infrastructure
(more
later)
• How
might
one
implement
a
counter
infrastructure
in
HBase?
26. AnalyScs
in
HBase
Scaling
writes
User
&
Content
Type
Counters
`Like’
buon
IMG
request
sends
HTTP
request
to
User
Content
Counter
Facebook
servers
which
brock@me.com
NEWS
5431
increments
several
counters
brock@me.com
TECH
79310
brock@me.com
SHOPPING
59
tom@him.com
SPORTS
94214
Individual
Page
Counters
URL
Counter
com.cloudera/blog/…
154
com.cloudera/downloads/…
923621
com.cloudera/resources/…
2138
27. AnalyScs
in
HBase
Scaling
writes
Individual
Page
Counters
Host
is
reversed
in
URL
as
part
of
the
key
URL
Counter
com.cloudera/blog/…
154
com.cloudera/downloads/…
923621
com.cloudera/resources/…
2138
• Data
is
physically
stored
in
sorted
order
• Scanning
all
`com.cloudera’
counters
results
in
sequenBal
I/O
28. Facebook
AnalyScs
Scaling
writes
• Real-‐Bme
counters
of
URLs
shared,
links
“liked”,
impressions
generated
• 20
billion
events/day
(200K
events/sec)
• ~30
second
latency
from
click
to
count
• Heavy
use
of
incrementColumnValue
API
for
consistent
counters
• Tried
MySQL,
Cassandra,
seled
on
HBase
hp://Bny.cloudera.com/hbase-‐„-‐analyBcs
29. Agenda
• What
is
Apache
Hadoop?
• Log
Processing
• Catching
`Osama’
• Extract
Transform
Load
(ETL)
• AnalyBcs
in
HBase
• Machine
Learning
• Final
Thoughts
Copyright
2011
Cloudera
Inc.
All
rights
reserved
33. Machine
Learning
Apache
Mahout
• Apache
Mahout
implements
• CollaboraBve
Filtering
• ClassificaBon
• Clustering
• Frequent
itemset
• More
coming
with
the
integraBon
of
MapReduce.Next
34. Agenda
• What
is
Apache
Hadoop?
• Log
Processing
• Catching
`Osama’
• Extract
Transform
Load
(ETL)
• AnalyBcs
in
HBase
• Machine
Learning
• Final
Thoughts
Copyright
2011
Cloudera
Inc.
All
rights
reserved
35. Final
Thoughts
Use
the
right
tool
• Other
use
cases
• OpenTSDB
an
open
distributed,
scalable
Time
Series
Database
(TSDB)
• Building
Search
Indexes
(canonical
use
case)
• Facebook
Messaging
• Cheap
and
Deep
Storage,
e.g.
archiving
emails
for
SOX
compliance
• Audit
Logging
• Non-‐Use
Cases
• Data
processing
is
handled
by
one
beefy
server
• Data
requires
transacBons
36. About
the
Presenter
• Brock
Noland
• brock@cloudera.com
• hp://twier.com/brocknoland
• TC-‐HUG
hp://tch.ug