An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.
4. 100%
Open
Source
–
Democra/zed
Access
to
Data
The
leaders
of
Hadoop’s
development
We
do
Hadoop
Drive
Innova/on
in
the
plaForm
–
We
lead
the
roadmap
Community
driven,
Enterprise
Focused
5. We
do
Hadoop
successfully.
Support
Training
Professional
Services
12. So
we
save
the
data
because
we
think
we
need
it,
but
oTen
we
really
don’t
know
what
to
do
with
it.
13. We
put
away
data,
delete
it,
tweet
it,
compress
it,
shred
it,
wikileak-‐it,
put
it
in
a
database,
put
it
in
SAN/
NAS,
put
it
in
the
cloud,
hide
it
in
tape…
14. You
need
value
from
your
data.
You
need
to
make
decisions
from
your
data.
23. Another
EDW
Analy/cal
DB
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
The
solu/on?
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
OLTP
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
24. Another
EDW
Analy/cal
DB
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
OLTP
Ummm…you
dropped
something
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
27. Wait,
you’ve
seen
this
before.
…
Data
Data
Data
Analy/cs
Sausage
Factory
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
…
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
30. “Prices,
Stupid
passwords,
and
Boring
Sta/s/cs.”
-‐
Hans
Rosling
h)p://www.youtube.com/watch?v=hVimVzgtD6w
31. Your
data
silos
are
lonely
places.
EDW
Accounts
Customers
Web
Proper/es
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
32. …
Data
likes
to
be
together.
EDW
Accounts
Customers
Data
Data
Web
Proper/es
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
33. CDR
Data
Data
Data
Machine
Data
Facebook
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Weather
Data
Twi^er
Data
Data
likes
to
socialize
too.
Data
Data
EDW
Data
Data
Data
Data
Data
Data
Accounts
Data
Web
Proper/es
Data
Data
Data
Customers
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
34. New
types
of
data
don’t
quite
fit
into
your
pris/ne
view
of
the
world.
Logs
Data
Data
Data
Data
Data
Data
Data
Machine
Data
Data
Data
Data
Data
Data
Data
Data
My
Li^le
Data
Empire
Data
?
Data
?
Data
Data
Data
Data
Data
?
?
Data
Data
35. To
resolve
this,
some
people
take
hints
from
Lord
Of
The
Rings...
37. ETL
Data
Data
Data
ETL
ETL
ETL
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
Data
Data
…but
that
has
its
problems
too.
ETL
Data
Data
Data
ETL
ETL
ETL
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
Data
Data
38. What
if
the
data
was
processed
and
stored
centrally?
What
if
you
didn’t
need
to
force
it
into
a
single
schema?
We
call
it
a
Data
Lake.
BI
&
Analy/cs
Data
Data
Data
Data
Sources
Data
Data
Data
Data
Lake
Schema
Schema
Schema
Schema
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Process
Data
Process
Data
Data
Data
Data
Data
Sources
Data
Data
Data
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
39. A
Data
Lake
Architecture
enables:
-‐
Landing
data
without
forcing
a
single
schema
-‐
Landing
a
variety
and
large
volume
of
data
efficiently
-‐
Retaining
data
for
a
long
period
of
/me
with
a
very
low
$/TB
-‐
A
plaForm
to
feed
other
Analy/cal
DBs
-‐
A
plaForm
to
execute
next
gen
data
analy/cs
and
processing
applica/ons
(SAS,
Informa/ca,
Graph
Analy/cs,
Machine
Learning,
SAP,
etc…)
40. In
most
cases,
more
data
is
be^er.
Work
with
the
popula/on,
not
just
a
sample.
41. Town/City
Middle
Income
Band
Your
view
of
a
client
today.
Female
Age:
25-‐30
Male
Product
Category
Preferences
42. GPS
coordinates
Looking
to
start
a
business
Walking
into
Starbucks
right
now…
Spent
25
minutes
looking
at
tea
cozies
Unhappy
with
his
cell
phone
plan
$65-‐68k
per
year
Your
view
with
more
data.
Pregnant
Tea
Party
Hippie
A
depressed
Toronto
Maple
Leaf’s
Fan
Gene
Expression
for
Risk
Taker
Male
Female
Age:
27
but
feels
old
Product
recommenda/ons
Thinking
about
a
new
house
Products
leT
in
basket
indicate
drunk
amazon
shopper
43. Pick
up
all
of
that
data
that
was
prohibi/vely
expensive
to
store
and
use.
56. If
you
could
design
a
system
that
would
handle
this,
what
would
it
look
like?
57. It
would
probably
need
a
highly
resilient,
self-‐healing,
cost-‐efficient,
distributed
file
system…
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
58. It
would
probably
need
a
completely
parallel
processing
framework
that
took
tasks
to
the
data…
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
59. It
would
probably
run
on
commodity
hardware,
virtualized
machines,
and
common
OS
plaForms
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
60. It
would
probably
be
open
source
so
innova/on
could
happen
as
quickly
as
possible
63. HDFS
stores
data
in
blocks
and
replicates
those
blocks
block1
Processing
Processing
Processing
Storage
Storage
Storage
block2
block2
Processing
Processing
Processing
block1
Storage
Storage
Storage
block3
block2
Processing
Storage
block3
Processing
Processing
block1
Storage
Storage
block3
64. If
a
block
fails
then
HDFS
always
has
the
other
copies
and
heals
itself
block1
Processing
Processing
Processing
block3
Storage
Storage
Storage
block2
block2
Processing
Processing
Processing
block1
Storage
Storage
Storage
block3
block2
Processing
Storage
block3
Processing
Processing
block1
Storage
Storage
X
65. MapReduce
is
a
programming
paradigm
that
completely
parallel
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Data
Data
Reducer
Data
Data
Data
Reducer
Data
Data
Data
66. MapReduce
has
three
phases:
Map,
Sort/Shuffle,
Reduce
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Mapper
Key,
Value
Key,
Value
Key,
Value
Reducer
Key,
Value
Key,
Value
Key,
Value
Mapper
Reducer
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Reducer
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Key,
Value
Key,
Value
Key,
Value
67. MapReduce
applies
to
a
lot
of
data
processing
problems
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Data
Data
Reducer
Data
Data
Data
Reducer
Data
Data
Data
68. MapReduce
goes
a
long
way,
but
not
all
data
processing
and
analy/cs
are
solved
the
same
way
69. Some/mes
your
data
applica/on
needs
parallel
processing
and
inter-‐
process
communica/on
Data
Data
Data
Data
Data
Data
Process
Data
Data
Data
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Process
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
71. Some/mes
your
machine
learning
data
applica/on
needs
to
process
in
memory
and
iterate
Data
Data
Data
Data
Data
Data
Process
Data
Data
Data
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Process
Process
Process
Process
Process
Data
Data
Data
Data
Data
Data
76. Tez
provides
a
layer
for
abstract
tasks,
these
could
be
mappers,
reducers,
customized
stream
processes,
in
memory
structures,
etc
77. Tez
can
chain
tasks
together
into
one
job
to
get
Map
–
Reduce
–
Reduce
jobs
suitable
for
things
like
Hive
SQL
projec/ons,
group
by,
and
order
by
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
TezMap
TezMap
TezReduce
TezReduce
Data
Data
Data
TezMap
TezReduce
TezReduce
Data
Data
Data
TezReduce
TezReduce
TezMap
TezMap
Data
Data
Data
78. Tez
can
provide
long-‐running
containers
for
applica/ons
like
Hive
to
side-‐step
batch
process
startups
you
would
have
with
MapReduce
83. YARN
abstracts
resource
management
so
you
can
run
more
than
just
MapReduce
MapReduce
V2
MapReduce
V?
STORM
Giraph
Tez
YARN
HDFS2
MPI
HBase
…
and
more
Spark
89. Falcon
Late
Data
Arrival
Data
Set
Archival
Data
Data
Set
Set
Lineage
Hadoop
Data
Set
Audit
RetenAon
Policy
ReplicaAon
Data
Monitoring
Set
Hadoop
Data
Set
Data
Process
Set
Management