2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Adam
Muise
–
Solu/on
Architect,
Hortonworks

ELEPHANT
AT
THE
DOOR:

MODERN
DATA
ARCHITECTURE

100%
Open
Source
–

Democra/zed
Access
to

Data

The
leaders
of
Hadoop’s

development

We
do
Hadoop

Drive
Innova/on
in

the
plaForm
–
We

lead
the
roadmap

Community
driven,

Enterprise
Focused

We
do
Hadoop
successfully.

Support

Training

Professional
Services

What
is
Hadoop?

What
is
everyone
talking
about?

“Big
Data”
is
the
marke/ng
term

of
the
decade
in
IT

What
lurks
behind
the
hype
is

the
democra/za/on
of
Data.

Data
fuels
analy/cs.
Analy/cs

fuels
business
decisions.

So
we
save
the
data
because
we

think
we
need
it,
but
oTen
we

really
don’t
know
what
to
do

with
it.

We
put
away
data,
delete
it,
tweet

it,
compress
it,
shred
it,
wikileak-‐it,

put
it
in
a
database,
put
it
in
SAN/
NAS,
put
it
in
the
cloud,
hide
it
in

tape…

You
need
value
from
your
data.
You

need
to
make
decisions
from
your

data.

So
what
are
the
problems
with

Big
Data?

Let’s
talk
challenges…

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume
Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume
Volume
Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume
Volume
Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume
Volume

Volume

Volume
Volume

Volume
Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume
Volume
Volume

Volume

Volume

Volume
Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume
Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume
Volume
Volume

Volume

Storage,
Management,
Processing

all
become
challenges
with
Data
at

Volume

Tradi/onal
technologies
adopt
a

divide,
drop,
and
conquer
approach

Another
EDW

Analy/cal
DB

Data

Data
Data

Data
Data

Data

Data

Data
Data

Data

Data
Data

Data
Data

Data

Data

Data
Data

The
solu/on?

EDW

Data

Data
Data

Data
Data

Data

Data

Data
Data

OLTP

Data

Data
Data

Data
Data

Data

Data

Data
Data

Yet
Another
EDW

Data

Data
Data

Data
Data

Data

Data

Data
Data

Another
EDW

Analy/cal
DB

Data

Data
Data

Data
Data

Data

Data

Data
Data

Data

Data
Data

Data
Data

Data

Data

Data
Data

OLTP

Ummm…you

dropped
something

EDW

Data

Data
Data

Data
Data

Data

Data

Data
Data

Data

Data
Data

Data
Data

Data

Data

Data
Data

Yet
Another
EDW

Data

Data
Data

Data
Data

Data

Data

Data
Data

Data

Data

Data
Data

Data

Data

Data
Data

Data
Data
Data
Data

Data
Data

Data
Data

Data
Data

Data

Data
Data
Data

Data
Data
Data

Data
Data
Data

Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data

Data
Data
Data

Data
Data
Data

Data

Data
Data

Data
Data

Data
Data
Data
Data
Data
Data

Data
Data

Data

Data

Data

Data

Data

Data
Data

Data
Data

Data

Data
Data

Data
Data
Data
Data
Data

Data

Data
Data

Analyzing
the
data
usually
raises

more
interes/ng
ques/ons…

…which
leads
to
more
data

Wait,
you’ve
seen
this
before.

…

Data

Data

Data

Analy/cs
Sausage
Factory

Data

Data
Data

Data
Data

Data
Data

Data

Data
Data

Data
Data

Data
Data

Data

Data
Data

Data

Data
Data

Data
Data

Data

Data
Data
Data

Data
Data

Data

Data
Data
Data
Data

Data
Data
Data

Data
Data
Data

Data
Data
Data
Data

Data

Data

Data

Data

Data
Data
Data
Data

Data
Data

…

Data

Data

Data
Data
Data

Data

Data
Data
Data

Data
Data
Data

Data

What
keeps
us
from
our
Data?

“Prices,
Stupid
passwords,
and

Boring
Sta/s/cs.”

-‐
Hans
Rosling

h)p://www.youtube.com/watch?v=hVimVzgtD6w

Your
data
silos
are
lonely
places.

EDW

Accounts

Customers

Web
Proper/es

Data

Data

Data

Data

Data
Data

Data
Data

Data
Data

Data
Data

Data

Data

Data
Data

Data
Data
Data

Data
Data
Data

Data

Data
Data
Data

Data

Data

Data
Data

Data
Data

Data
Data

Data
Data

…
Data
likes
to
be
together.

EDW

Accounts

Customers

Data

Data

Web
Proper/es

Data
Data
Data
Data

Data

Data
Data
Data

Data
Data

Data

Data

Data

Data
Data
Data
Data
Data

Data

Data
Data

Data

Data
Data
Data
Data

Data

Data

Data
Data

Data
Data

Data
Data

CDR

Data

Data
Data
Machine
Data

Facebook

Data

Data
Data

Data

Data

Data
Data
Data

Data
Data

Data

Data
Data

Data
Data
Data
Data
Data
Data
Data

Data
Data

Data

Data

Data

Data
Data

Data

Data
Data

Data

Data
Data

Weather
Data

Twi^er

Data

Data
likes
to
socialize
too.
Data
Data

EDW

Data
Data

Data

Data

Data
Data

Accounts

Data

Web
Proper/es

Data
Data

Data

Customers

Data
Data
Data
Data

Data
Data

Data

Data

Data
Data

Data
Data

Data
Data
Data
Data
Data

Data
Data

Data

Data
Data
Data
Data
Data
Data

Data

Data

Data
Data
Data
Data

New
types
of
data
don’t
quite
ﬁt
into

your
pris/ne
view
of
the
world.

Logs

Data
Data

Data

Data

Data
Data

Data

Machine
Data

Data
Data

Data

Data

Data
Data

Data

My
Li^le
Data
Empire

Data

?
Data

?
Data
Data

Data

Data
Data

?
?

Data

Data

To
resolve
this,
some
people
take

hints
from
Lord
Of
The
Rings...

…and
create
One-‐Schema-‐To-‐
Rule-‐Them-‐All…

EDW

Data

Data
Data

Data
Data

Schema

Data

Data

Data
Data

ETL

Data

Data

Data

ETL

ETL

ETL

EDW

Data

Data
Data

Data
Data

Schema

Data

Data

Data
Data

…but
that
has
its
problems
too.

ETL

Data

Data

Data

ETL

ETL

ETL

EDW

Data

Data
Data

Data
Data

Schema

Data

Data

Data
Data

What
if
the
data
was
processed
and

stored
centrally?
What
if
you
didn’t

need
to
force
it
into
a
single

schema?

We
call
it
a
Data
Lake.

BI
&
Analy/cs

Data

Data

Data

Data
Sources

Data

Data

Data

Data
Lake

Schema

Schema

Schema

Schema

Data

Data

Data
Data

Data

Data

Data

Data

Data
Data

Data
Data

Data

Data

Data
Data
Process
Data
Process
Data

Data

Data

Data
Data
Sources
Data

Data

Data

EDW

Data
Data

Data
Data

Data

Schema
Data

Data

A
Data
Lake
Architecture
enables:

-‐
Landing
data
without
forcing
a
single
schema

-‐
Landing
a
variety
and
large
volume
of
data

eﬃciently

-‐
Retaining
data
for
a
long
period
of
/me
with
a
very

low
$/TB

-‐
A
plaForm
to
feed
other
Analy/cal
DBs

-‐
A
plaForm
to
execute
next
gen
data
analy/cs
and

processing
applica/ons
(SAS,
Informa/ca,

Graph
Analy/cs,
Machine
Learning,
SAP,

etc…)

In
most
cases,
more
data
is
be^er.

Work
with
the
popula/on,
not
just
a

sample.

Town/City

Middle
Income
Band

Your
view
of
a
client
today.

Female

Age:
25-‐30

Male

Product
Category

Preferences

GPS
coordinates

Looking
to
start
a

business

Walking
into

Starbucks
right
now…

Spent
25
minutes

looking
at
tea
cozies

Unhappy
with
his
cell

phone
plan

$65-‐68k
per
year

Your
view
with
more
data.

Pregnant

Tea
Party

Hippie

A
depressed
Toronto

Maple
Leaf’s
Fan

Gene

Expression
for

Risk
Taker

Male

Female

Age:
27
but

feels
old

Product

recommenda/ons

Thinking
about

a
new
house

Products
leT
in

basket
indicate
drunk

amazon
shopper

Pick
up
all
of
that
data
that
was

prohibi/vely
expensive
to
store
and

use.

Why
do
viewer
surveys…

…when
raw
data
can
tell
you
what

bu^on
on
the
remote
was
pressed

during
what
commercial
for
the

en/re
viewer
popula/on?

Why
make
separate
risk

assessments
in
separate
data
silos…

…when
you
can
do
a
risk

assessment
on
the
en/re
data

footprint
of
the
client?

To
approach
these
use
cases
you

need
an
aﬀordable
plaForm
that

stores,
processes,
and
analyzes
the

data.

So
what
is
the
answer?

Enter
the
Hadoop.

………

h^p://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/

Hadoop
was
created
because

tradi/onal
technologies
never
cut
it

for
the
Internet
proper/es
like

Google,
Yahoo,
Facebook,
Twi^er,

and
LinkedIn

Tradi/onal
architecture
didn’t

scale
enough…

App
App
App
App

App
App
App
App

DB
DB

DB

SAN

App
App
App
App

DB
DB

DB

SAN

DB
DB

DB

SAN

Databases
can
become
bloated

and
useless

$upercompu/ng

Tradi/onal
architectures
cost
too

much
at
that
volume…

$/TB

$pecial

Hardware

If
you
could
design
a
system
that

would
handle
this,
what
would
it

look
like?

It
would
probably
need
a
highly

resilient,
self-‐healing,
cost-‐eﬃcient,

distributed
ﬁle
system…

Storage

Storage

Storage

Storage

Storage

Storage

Storage

Storage

Storage

It
would
probably
need
a
completely

parallel
processing
framework
that

took
tasks
to
the
data…

Processing
Processing
Processing

Storage
Storage
Storage

Processing
Processing
Processing

Storage
Storage
Storage

Processing
Processing
Processing

Storage
Storage
Storage

It
would
probably
run
on
commodity

hardware,
virtualized
machines,
and

common
OS
plaForms

Processing
Processing
Processing

Storage
Storage
Storage

Processing
Processing
Processing

Storage
Storage
Storage

Processing
Processing
Processing

Storage
Storage
Storage

It
would
probably
be
open
source
so

innova/on
could
happen
as
quickly

as
possible

It
would
need
a
cri/cal
mass
of

users

{Processing
+
Storage}

=

{MapReduce/Tez/YARN+
HDFS}

HDFS
stores
data
in
blocks
and

replicates
those
blocks

block1

Processing
Processing
Processing

Storage
Storage
Storage

block2

block2

Processing
Processing
Processing

block1

Storage
Storage
Storage

block3

block2

Processing

Storage

block3

Processing
Processing

block1

Storage
Storage

block3

If
a
block
fails
then
HDFS
always
has

the
other
copies
and
heals
itself

block1

Processing
Processing
Processing

block3

Storage
Storage
Storage

block2

block2

Processing
Processing
Processing

block1

Storage
Storage
Storage

block3

block2

Processing

Storage

block3

Processing
Processing

block1

Storage
Storage

X

MapReduce
is
a
programming

paradigm
that
completely
parallel

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Data

Data

Data

Reducer

Data

Data

Data

Reducer

Data

Data

Data

MapReduce
has
three
phases:

Map,
Sort/Shuﬄe,
Reduce

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Mapper

Mapper

Key,
Value

Key,
Value

Key,
Value

Reducer

Key,
Value

Key,
Value

Key,
Value

Mapper

Reducer

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Mapper

Reducer

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Mapper

Key,
Value

Key,
Value

Key,
Value

MapReduce
applies
to
a
lot
of

data
processing
problems

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Data

Data

Data

Reducer

Data

Data

Data

Reducer

Data

Data

Data

MapReduce
goes
a
long
way,
but

not
all
data
processing
and
analy/cs

are
solved
the
same
way

Some/mes
your
data
applica/on

needs
parallel
processing
and
inter-‐
process
communica/on

Data

Data

Data

Data

Data

Data

Process

Data

Data

Data

Process

Data

Data

Data

Data

Data

Data

Data

Data

Data

Process

Process

Data

Data

Data

Data

Data

Data

Data

Data

Data

…like
Complex
Event
Processing

in
Apache
Storm

Some/mes
your
machine
learning

data
applica/on
needs
to
process
in

memory
and
iterate

Data

Data

Data

Data

Data

Data

Process

Data

Data

Data

Process

Data

Data

Data

Data

Data

Data

Data

Data

Data

Process

Process

Process

Process

Process

Data

Data

Data

Data

Data

Data

…like
in
Machine
Learning
in

Spark

Tez
is
a
YARN
applica/on,
like

MapReduce
is
a
YARN
applica/on

Tez
is
the
Lego
set
for
your
data

applica/on

Tez
provides
a
layer
for
abstract

tasks,
these
could
be
mappers,

reducers,
customized
stream

processes,
in
memory
structures,

etc

Tez
can
chain
tasks
together
into
one

job
to
get
Map
–
Reduce
–
Reduce
jobs

suitable
for
things
like
Hive
SQL

projec/ons,
group
by,
and
order
by

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

TezMap

TezMap

TezReduce

TezReduce

Data

Data

Data

TezMap

TezReduce

TezReduce

Data

Data

Data

TezReduce

TezReduce

TezMap

TezMap

Data

Data

Data

Tez
can
provide
long-‐running

containers
for
applica/ons
like
Hive

to
side-‐step
batch
process
startups

you
would
have
with
MapReduce

YARN:

Yeah,
we
did
that
too.

hortonworks.com/yarn/

YARN
=
Yet
Another
Resource

Nego/ator

Node
Manager

Resource
Manager

Container

Scheduler

Pig

AppMaster

Container

Resource
Manager

+

Node
Managers

=
YARN

Node
Manager

Container

Container

Storm

Node
Manager

Node
Manager

MapReduce

AppMaster

Container

Container

Container

Container

Container

AppMaster

YARN
abstracts
resource

management
so
you
can
run
more

than
just
MapReduce

MapReduce
V2

MapReduce
V?
STORM

Giraph

Tez

YARN

HDFS2

MPI

HBase
…
and

more

Spark

Hadoop
has
other
open
source

projects…

Hive
=
{SQL
-‐>
Tez
||
MapReduce}

SQL-‐IN-‐HADOOP

Pig
=
{PigLa/n
-‐>
Tez
||

MapReduce}

HCatalog
=
{metadata*
for

MapReduce,
Hive,
Pig,
HBase}

*metadata
=
tables,
columns,
par//ons,
types

Oozie
=
Job::{Task,
Task,
if
Task,

then
Task,
ﬁnal
Task}

Falcon

Late
Data

Arrival

Data

Set

Archival
Data

Data

Set

Set

Lineage

Hadoop

Data

Set

Audit

RetenAon

Policy

ReplicaAon

Data
Monitoring

Set

Hadoop

Data

Set

Data

Process

Set
Management

Knox

REST

Client

REST

Client

Knox
Gateway

REST

Client

Hadoop

Cluster

Hadoop

Cluster

Enterprise

LDAP

Flume

Files

Flume

JMS

Weblogs

Events

Flume

Flume

Flume

Flume

Flume

Hadoop

Sqoop

DB

DB

Sqoop

Hadoop

Sqoop

Ambari
=
{install,
manage,

monitor}

HBase
=
{real-‐/me,
distributed-‐
map,
big-‐tables}

Storm
=
{Complex
Event
Processing,

Near-‐Real-‐Time,
Provisioned
by

YARN
}

Tez

Storm

YARN

Pig

HDFS

MapReduce

Apache
Hadoop

HCatalog

Hive

HBase

Ambari

Knox

Sqoop

Falcon

Flume

Storm

Tez

Pig

YARN

HDFS

MapReduce

Hortonworks
Data
PlaForm

HCatalog

Hive

HBase

Ambari

Knox

Sqoop

Falcon

Flume

What
else
are
we
working
on?

hortonworks.com/labs/

Hadoop
is
the
new
Modern
Data

Architecture
for
the
Enterprise

There is NO second place

Hortonworks

…the
Bull
Elephant
of
Hadoop
InnovaCon

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Page
100

2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Semelhante a 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture (20)

Mais de Adam Muise

Mais de Adam Muise (15)

Último

Último (20)

2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture