Philly DB MapR M7 - March 2013

Hbase
and
M7
Technical

Overview

Keys
Botzum

Senior
Principal
Technologist

MapR
Technologies

March
2013

©MapR
Technologies

1

Agenda

HBase

MapR

M7

Containers

©MapR
Technologies

2

HBase

A
sparse,
distributed,
persistent,
indexed,
and

sorted
map

OR

A
NoSQL
database

OR

A
Columnar
data
store

©MapR
Technologies

3

Key-‐Value
Store

§  Row
key

–  Binary
sortable
value

§  Row
content
key
(analogous
to
a
column)

–  Column
family
(string)

–  Column
qualifier
(binary)

–  Version/Omestamp
(number)

§  A
row
key,
column
family,
column
qualifier,
and
version
uniquely

idenOfies
a
parOcular
cell

–  A
cell
contains
a
single
binary
value

©MapR
Technologies

4

A
Row

C0

C1

C2

C3

C4

CN

Row
Key
Value1
Value2
Value3
Value4
ValueN

…

Column
Column

Row
Key
Version
Value1

Family
Qualifier

Column
Column

Row
Key
Version
Value2

Family
Qualifier

…
Column
Column

Row
Key
Version
ValueN

Family
Qualifier

©MapR
Technologies

5

Not
A
TradiAonal
RDBMS

§  Weakly
typed
and
schema-‐less
(unstructured
or
perhaps
semi-‐
structured)

–  Almost
everything
is
binary

§  No
constraints

–  You
can
put
any
binary
value
in
any
cell

–  You
can
even
put
incompaOble
types
in
two
different
instances
of
the
same

column
family:column
qualifier

§  Column
(qualifiers)
are
created
implicitly

§  Different
rows
can
have
different
columns

§  No
transacOons/no
ACID

–  Only
unit
of
atomic
operaOon
is
a
single
row

©MapR
Technologies

6

API

§  APIs
for
querying
(get),
scanning,
and
updaOng
(put)

–  Operate
on
row
key,
column
family,
qualiﬁer,
version,
and
values

–  Can
parOally
specify
and
will
retrieve
union
of
results

•  if
just
specify
row
key,
will
get
all
values
for
it
(with
column
family,
qualiﬁer)

–  By
default
only
largest
version
(most
recent
if
Omestamp)

is
returned

•  Specify
row
key
and
column
family
to
get
will
retrieve
all
values
for
that
row
and

column
family

–  Scanning
is
just
get
over
a
range
of
row
keys

§  Version

–  While
defaults
to
a
Omestamp,
any
integer
is
acceptable

©MapR
Technologies

7

Columnar

§  Rather
than
storing
table
rows
linearly
on
disk
and
each
row
on

disk
as
a
single
byte
range
with
fixed
size
fields,
store
columns
of

row
separately

–  Very
efficient
storage
for
sparse
data
sets
(NULL
is
free)

–  Compression
works
befer
on
similar
data

–  Fetches
of
only
subsets
of
row
very
efficient
(less
disk
IO)

–  No
fixed
size
on
column
values

–  No
requirement
to
even
define
columns

§  Columns
are
grouped
together
into
column
families

–  Basically
a
file
on
disk

–  A
unit
of
opOmizaOon

–  In
Hbase,
adding
column
is
implicit,
adding
column
family
is
explicit

©MapR
Technologies

8

HBase
Table
Architecture

§  Tables
are
divided
into
key
ranges
(regions)

§  Regions
are
served
by
nodes
(RegionServers)

§  Columns
are
divided
into
access
groups
(columns
families)

CF1
CF2
CF3
CF4
CF5

R1

R2

R3

R4

©MapR
Technologies

9

Storage
Model
Highlights

§  Data
is
stored
in
sorted
order

–  A
table
contains
rows

–  A
sequence
of
rows
are
grouped
together
into
a
region

•  A
region
consists
of
various
ﬁles
related
to
those
rows
and
is
loaded
into
a
region

server

•  Regions
are
stored
in
HDFS
for
high
availability

–  A
single
region
server
manages
mulOple
regions

•  Region
assignment
can
change
–
load
balancing,
failures,
etc.

§  Clients
connect
to
tables

–  HBase
runOme
transparently
determines
the
region
(based
on
key
ranges)

and
contacts
the
appropriate
region
server

§  At
any
given
Ome
exactly
one
region
server
provides
access
to
a

region

–  Master
region
servers
(with
Zookeeper)
manage
that

©MapR
Technologies

10

What’s
Great
About
This?

§  Very
scalable

§  Easy
to
add
region
servers

§  Easy
to
move
regions
around

§  Scans
are
eﬃcient

–  Unlike
hashing
based
models

§  Access
via
row
key
is
very
eﬃcient

–  Note:
there
are
no
secondary
indexes

§  No
schema,
can
store
whatever
you
want
when
you
want

§  Strong
consistency

§  Integrated
with
Hadoop

–  Map-‐reduce
on
HBase
is
straighlorward

–  HDFS/MapR-‐FS
provides
data
replicaOon

©MapR
Technologies

11

Data
Storage
Architecture

§  Data
from
a
region
column
family
is
stored
in
an
HFile

–  An
HFile
contains
row
key:column
qualifier:version:value

entries

–  Index
at
the
end
into
the
data
–
64KB
“blocks”
by
default

§  Update

–  New
value
is
wrifen
persistently
to
Write
Ahead
Log
(WAL)

–  Cached
in
memory

–  When
memory
fills,
write
out
new
HFile

§  Read

–  Checks
in
memory,
then
all
of
the
Hfiles

–  Read
data
cached
in
memory

§  Delete

–  Create
a
tombstone
record
(purged
at
major
compacOon)

©MapR
Technologies

12

Apache
HBase
HFile
Structure

Each
cell
is
an
individual
key
+
value

-‐
a
row
repeats
the
key
for
each
column

64Kbyte
blocks

Key-‐value

are
compressed

pairs
are

laid
out
in

increasing

order

An
index
into
the

compressed
blocks
is

created
as
a
btree

©MapR
Technologies

13

HBase
Region
OperaAon

§  Typical
region
size
is
a
few
GB,
someOmes
even
10G
or
20G

§  RS

holds
data
in
memory
unOl
full,
then
writes
a
new
HFile

–  Logical
view
of
database
constructed
by
layering
these
ﬁles,
with
the

latest
on
top

newest

oldest

Key
range
represented
by
this
region

©MapR
Technologies

14

HBase
Read
AmplificaAon

§  When
a
get/scan
comes
in,
all
the
files
have
to
be
examined

–  schema-‐less,
so
where
is
the
column?

–  Done
in-‐memory
and
does
not
change
what's
on
disk

•  Bloom-‐filters
do
not
help
in
scans

newest

oldest

With
7
files,
a
1K-‐record
get()
potenOally
takes
about
30
seeks,

7
block
fetches
and
decompressions,
from
HDFS.
Even
with
the
index
in
memory

7
seeks
and
7
block
fetches
are
required.

©MapR
Technologies

15

HBase
Write
AmplificaAon

§  To
reduce
the
read-‐amplificaOon,
HBase
merges
the
HFiles

periodically

–  process
called
compacOon

–  runs
automaOcally
when
too
many
files

–  usually
turned
off
due
to
I/O
storms
which
interfere
with
client

access

–  and
kicked-‐off
manually
on
weekends

Major
compacOon
reads
all
files
and

merges

into
a
single
HFile

©MapR
Technologies

16

HBase
Server
Architecture

Zookeeper

HDFS
Server

Coordinates

Lookup

Hbase
Master
Linux

Client
Filesystem

Data

Hbase
Region

Server

HFiles

WAL

©MapR
Technologies

17

WAL
File

§  A
persistent
record
of
every
update/insert
in
sequence
order

–  Shared
by
all
regions
on
one
region
server

–  WAL
files
periodically
rolled
to
limit
size
but
older
WALs
sOll
needed

–  WAL
file
no
longer
needed
once
every
region
with
updates
in
WAL
file
has

flushed
those
from
memory
to
an
HFile

•  Remember
that
more
HFiles
slow
read
path!

§  Must
be
replayed
as
part
of
recovery
process
since
in
memory

updates
are
“lost”

–  This
is
very
expensive
and
delays
bringing
a
region
back
online

©MapR
Technologies

18

What’s
Not
So
Good

Reliability

•  Complex
coordinaOon
between
ZK,
HDFS,
HBase

Master,
and
Region
Server
during
region
movement

•  CompacOons
disrupt
operaOons

•  Very
slow
crash
recovery
because
of

•  CoordinaOon
complexity

•  WAL
log
reading
(one
log/server)

Business
conAnuity

•  Many
administraOve
acOons
require
downOme

•  Not
well
integrated
into
MapR-‐FS
mirroring
and

snapshot
funcOonality

©MapR
Technologies

19

What’s
Not
So
Good

Performance

•  Very
long
read/write
path

•  Signiﬁcant
read
and
write
ampliﬁcaOon

•  MulOple
JVMs
in
read/write
path
–
GC
delays!

Manageability

•  CompacOons,
splits
and
merges
must
be
done

manually
(in
reality)

•  Lots
of
“well
known”
problems
maintaining
reliable

cluster
–
spliwng,
compacOons,
region
assignment,
etc.

•  PracOcal
limits
on
number
of
regions/region
server
and

size
of
regions
–
can
make
it
hard
to
fully
uOlize

hardware

©MapR
Technologies

20

Region
Assignment
in
Apache
HBase

©MapR
Technologies

21

Apache
HBase
on
MapR

Limited
data
management,
data
protecOon
and
disaster
recovery
for
tables.

©MapR
Technologies

22

Agenda

HBase

MapR

M7

Containers

©MapR
Technologies

23

MapR

A
provider
of
enterprise
grade
Hadoop
with

uniquely
diﬀerenOated
features

©MapR
Technologies

24

MapR:
The
Enterprise
Grade
DistribuAon

©MapR
Technologies

25

One
PlaVorm
for
Big
Data

Broad

RecommendaOon
Engines
Fraud
DetecOon
Billing
LogisOcs

range
of

applicaOons
Risk
Modeling
Market
SegmentaOon
Inventory
ForecasOng

…

Batch
InteracOve
Real-‐Ome

Map
File-‐Based
SQL
Stream

Database
Search

Reduce
ApplicaOons
Processing

…

99.999%
Data
Disaster
Scalability

Enterprise
MulO-‐
&

HA
ProtecOon
Recovery
Performance
IntegraOon
tenancy

©MapR
Technologies

26

Dependable:
Lights
Out
Data
Center
Ready

Reliable
Compute
Dependable
Storage

§  Automated
stateful
failover
§  Business
conOnuity
with

§  Automated
re-‐replicaOon
snapshots

and
mirrors

§  Recover
to
a
point
in
Ome

§  Self-‐healing
from
HW

and
SW
failures
§  End-‐to-‐end
check
summing

§  Load
balancing
§  Strong
consistency

§  No
lost
jobs
or
data
§  Data
safe

§  99999’s
of
upOme
§  Mirror
across
sites
to
meet

Recovery
Time
ObjecOves

©MapR
Technologies

27

Fast:
World
Record
Performance

Benchmark
MapR
2.1.1
CDH
4.1.1
MapR
Speed

Increase

Terasort
(1x
replicaOon,
compression
disabled)

Total
13m
35s
26m
6s
2X

Map
7m
58s
21m
8s
3X

Reduce
13m
32s
23m
37s
1.8X

DFSIO
throughput/node

Read
1003
MB/s
656
MB/s
1.5X
MinuteSort
Record

Write
924
MB/s
654
MB/s
1.4X

1.5
TB
in
60
seconds

YCSB
(50%
read,
50%
update)
2103
nodes

Throughput
36,584.4
op/s
12,500.5
op/s
2.9X

RunOme
3.80
hr
11.11
hr
2.9X

YCSB
(95%
read,
5%
update)

Throughput
24,704.3
op/s
10,776.4
op/s
2.3X

RunOme
0.56
hr
1.29
hr
2.3X

Benchmark
hardware
conﬁguraOon:

10
servers,
12
x
2
cores
(2.4
GHz),
12
x
2TB,
48
GB,
1
x
10GbE

©MapR
Technologies

28

The
Cloud
Leaders
Pick
MapR

Amazon
EMR
is
the
largest
Google
chose
MapR
to

Hadoop
provider
in
revenue
provide
Hadoop
on
Google

and
#
of
clusters
Compute
Engine

©MapR
Technologies

29

MapR
Supports
Broad
Set
of
Customers

Global
Credit
Card
Issuer
Leading
Retailer

§  RecommendaOon
Engine
§  Customer
Behavior
Analysis
§  Customer
targeOng

§  Fraud
detecOon
and
PrevenOon
§  Brand
Monitoring
§  Viewer
Behavioral
analyOcs

§  Global
threat

analyOcs

§  Intrusion
detecOon
&
prevenOon
§  RecommendaOon
Engine
§  Virus
analysis

§  Forensic
analysis
§  Family
tree
connecOons

§  Clickstream
Analysis

§  PaOent
care
§  Log
analysis
§  Quality
profiling/field

monitoring
§  HBase
failure
analysis

§  Fraud
DetecOon

§  AdverOsing
exchange
§  Monitoring
and
measuring

§  Channel
analyOcs
analysis
and
opOmizaOon
online
behavior

§  Customer
Revenue
§  Enterprise
Grade

AnalyOcs
§  Customer
targeOng
Plalorm

§  ETL
Offload
§  Social
media
analysis
§  COOP
features

©MapR
Technologies

30

MapR
EdiAons

§  Control
System
§  Control
System
§  All
the
Features
of
M5

§  NFS
Access
§  NFS
Access
§  Simpliﬁed

§  Performance
AdministraOon
for

§  Performance
HBase

§  Unlimited
Nodes
§  High
Availability

§  Increased
Performance

§  Free

§  Snapshots
&
Mirroring

§  Consistent
Low
Latency

§  24
X
7
Support

§  Uniﬁed
Snapshots,

§  Annual
SubscripOon
Mirroring

Also
Available
through:

Compute
Engine

©MapR
Technologies

31

Agenda

Hbase

MapR

M7

Containers

©MapR
Technologies

32

M7

An
integrated
system
for

unstructured
and
structured
data

©MapR
Technologies

33

Introducing
MapR
M7

§  An
integrated
system

–  Unified
namespace
for
files
and
tables

–  Built-‐in
data
management
&
protecOon

–  No
extra
administraOon

§  Architected
for
reliability
and
performance

–  Fewer
layers

–  Single
hop
to
data

–  No
compacOons,
low
i/o
amplificaOon

–  Seamless
splits,
automaOc
merges

–  Instant
recovery

©MapR
Technologies

34

Binary
CompaAble
with
HBase
APIs

§  HBase
applicaOons
work
"as
is"
with
M7

–  No
need
to
recompile
(binary
compaOble)

§  Can
run
M7
and
HBase
side-‐by-‐side
on
the
same
cluster

–  e.g.,
during
a
migraOon

–  can
access
both
M7
table
and
HBase
table
in
same
program

§  Use
standard
Apache
HBase
CopyTable
tool
to
copy
a
table

from
HBase
to
M7
or
vice-‐versa

%
hbase
org.apache.hadoop.hbase.mapreduce.CopyTable

-‐-‐new.name=/user/srivas/mytable
oldtable

©MapR
Technologies

35

M7:

Remove
Layers,
Simplify

Take
note!
No
JVM!

MapR

M7

©MapR
Technologies

36

M7:

No
Master
and
No
RegionServers

No
JVM
problems

One
hop
to
data
Uniﬁed
cache

No
extra
daemons
to
manage

©MapR
Technologies

37

Region
Assignment
in
Apache
HBase

None
of
this
complexity
is
present
in
MapR
M7

©MapR
Technologies

38

Uniﬁed
Namespace
for
Files
and
Tables

$
pwd

/mapr/default/user/dave

$
ls

file1

file2

table1

table2

$
hbase
shell

hbase(main):003:0>
create
'/user/dave/table3',
'cf1',
'cf2',
'cf3'

0
row(s)
in
0.1570
seconds

$
ls

file1

file2

table1

table2

table3

$
hadoop
fs
-‐ls
/user/dave

Found
5
items

-‐rw-‐r-‐-‐r-‐-‐

3
mapr
mapr

16
2012-‐09-‐28
08:34
/user/dave/file1

-‐rw-‐r-‐-‐r-‐-‐

3
mapr
mapr

22
2012-‐09-‐28
08:34
/user/dave/file2

trwxr-‐xr-‐x

3
mapr
mapr

2
2012-‐09-‐28
08:32
/user/dave/table1

trwxr-‐xr-‐x

3
mapr
mapr

2
2012-‐09-‐28
08:33
/user/dave/table2

trwxr-‐xr-‐x

3
mapr
mapr

2
2012-‐09-‐28
08:38
/user/dave/table3

©MapR
Technologies

39

Tables
for
End
Users

§  Users
can
create
and
manage
their
own
tables

–  Unlimited
#
of
tables

§  Tables
can
be
created
in
any
directory

–  Tables
count
towards
volume
and
user
quotas

§  No
admin
intervenOon
needed

–  I
can
create
a
file
or
a
directory
without
opening
a
Ocket
with

admin
team,
why
not
a
table?

–  Do
stuff
on
the
fly,

no
stop/restart
servers

§  AutomaOc
data
protecOon
and
disaster
recovery

–  Users
can
recover
from
snapshots/mirrors
on
their
own

©MapR
Technologies

40

M7
–
An
Integrated
System

©MapR
Technologies

41

M7

ComparaOve
Analysis
with

Apache
HBase,
Level-‐DB
and
a
BTree

©MapR
Technologies

42

HBase
Write
AmplificaAon
Analysis

§  Assume
10G
per
region,
write
10%
per
day,
grow
10%
per
week

–  1G
of
writes

–  a€er
7
days,
7
files
of
1G
and
1file
of
10G
(only
1G
is
growth)

§  IO
Cost

–  Wrote
7G
to
WAL
+
7G
to
HFiles

–  CompacOon
adds
sOll
more

•  read:
17G

(=
7
x
1G

+
1
x
10G)

•  write:

11G
write
to
new
Hfile

–  WAF
–
wrote
7G
“for
real”
but
actual
disk
IO
a€er
compacOon
is
read

17G
+
write
25G
and
that’s
assuming
no
applicaOon
reads!

§  IO
Cost
of
1000
regions
similar
to
above

–  read
17T,

write
25T

è
major
impact
on
node

§  Best
pracOce,
limit
#
of
regions/node
à
can’t
fully
uOlize

storage

©MapR
Technologies

43

AlternaAve:
Level-‐DB

§  Tiered,
logarithmic
increase

–  L1:
2
x
1M

ﬁles

–  L2:

10
x
1M

–  L3:

100
x
1M

–  L4:

1,000
x
1M,
etc

§  CompacOon
overhead

–  avoids
IO
storms

(i/o
done
in
smaller
increments
of

~10M)

–  but
signiﬁcantly
extra
bandwidth
compared
to
HBase

§  Read
overhead
is
sOll
high

–  10-‐15
seeks,
perhaps
more
if
the
lowest
level
is
very
large

–  40K
-‐
60K

read
from
disk
to
retrieve
a
1K
record

©MapR
Technologies

44

BTree
analysis

§  Read
ﬁnds
data
directly,
proven
to
be
fastest

–  interior
nodes
only
hold
keys

–  very
large
branching
factor

–  values
only
at
leaves

–  thus
index
caches
work

–  R
=
logN
seeks,
if
no
caching

–  1K
record
read
will
transfer
about
logN
blocks
from
disk

§  Writes
are
slow
on
inserts

–  inserted
into
correct
place
right
away

–  otherwise
read
will
not
ﬁnd
it

–  requires
btree
to
be
conOnuously
rebalanced

–  causes
extreme
random
i/o
in
insert
path

–  W
=
2.5x
+
logN
seeks
if
no
caching

©MapR
Technologies

45

Log-‐Structured
Merge
Trees

§  LSM
Trees
reduce
insert
cost
by
deferring
and
batching
index
changes

–  If
don't
compact
o€en,
read
perf
is
impacted

–  If
compact
too
o€en,
write
perf
is
impacted

§  B-‐Trees
are
great
for
reads

–  but
expensive
to
update
in
real-‐Ome

Can
we
combine
both
ideas?

Writes
cannot
be
done
befer
than
W
=
2.5x

write
to
log

+

write
data
to
somewhere

+

update
meta-‐data

Memory Disk

Index Log
Write

Read Index

©MapR
Technologies

46

M7
from
MapR

§  TwisOng
BTree's

–  leaves
are
variable
size
(8K
-‐
8M
or
larger)

–  can
stay
unbalanced
for
long
periods
of
Ome

•  more
inserts
will
balance
it
eventually

•  automaOcally
throfles
updates
to
interior
btree
nodes

–  M7
inserts
"close
to"
where
the
data
is
supposed
to
go

§  Reads

–  Uses
BTree
structure
to
get
"close"
very
fast

•  very
high
branching
with
key-‐prefix-‐compression

–  UOlizes
a
separate
lower-‐level
index
to
find
it
exactly

•  updated
"in-‐place”
bloom-‐filters
for
gets,
range-‐maps
for
scans

§  Overhead

–  1K
record
read
will
transfer
about
32K
from
disk
in
logN
seeks

©MapR
Technologies

47

M7

provides
Instant
Recovery

§  Instead
of
having
one
WAL/region
server
or
even
one/region,

we
have
many
micro-‐WALs/region

§  0-‐40
microWALs
per
region

–  idle
WALs
“compacted”,
so
most
are
empty

–  region
is
up
before
all
microWALs
are
recovered

–  recovers
region
in
background
in
parallel

–  when
a
key
is
accessed,
that
microWAL
is
recovered
inline

–  1000-‐10000x
faster
recovery

§  Never
perform
equivalent
of
HBase
major
or
minor

compacOon

§  Why
doesn't
HBase
do
this?
M7
uses
MapR-‐FS,
not
HDFS

–  No
limit
to
#
of
ﬁles
on
disk

–  No
limit
to
#
open
ﬁles

–  I/O
path
translates
random
writes
to
sequenOal
writes
on
disk

©MapR
Technologies

48

Summary

1K
record
-‐read
CompacAon
Recovery

amplificaAon

HBase
with
7
hfiles
30
seeks
IO
Storms
Huge
WAL
to
recover

130K
xfer
good
bandwidth

HBase
with
3
hfiles
15
seeks,
IO
Storms
Huge
WAL
to
recover

70K
xfer
high
bandwidth

LevelDB
with
5
levels
13
seeks
No
i/o
storms
WAL
is
Ony

48K
xfer
Very
high
b/w

BTree
logN
seeks
No
i/o
storms
WAL
is
proporOonal
to

logN
xfer
but
100%
random
concurrency
+
cache

MapR

M7
logN
seeks
No
i/o
storms
microWALs

allow

32K
xfer
low
bandwidth

recovery
<
100ms

©MapR
Technologies

49

M7:

Fileservers
Serve
Regions

§  Region
lives
enOrely
inside
a
container

–  Does
not
coordinate
through
ZooKeeper

§  Containers
support
distributed
transacOons

–  with
replicaOon
built-‐in

§  Only
coordinaOon
in
the
system
is
for
splits

–  Between
region-‐map
and
data-‐container

–  already
solved
this
problem
for
ﬁles
and
its
chunks

©MapR
Technologies

50

MapR's
Containers

Files/directories
are
sharded
into
blocks,

and

placed
in
containers
on
disks

l  Each
container
contains

l  Directories
&
ﬁles

l  Data
blocks

Containers
are
l  BTrees

~32
GB
segments
of

100%
random
writes

disk,
placed
on

l 

nodes

Patent
Pending

©MapR
Technologies

53

M7
Containers

§  Container
holds
many
files

–  regular,
dir,
symlink,
btree,
chunk-‐map,
region-‐map,
…

–  all
random-‐write
capable

§  Container
is
replicated
to
servers

–  unit
of
resynchronizaOon

§  Region
lives
enOrely
inside
1
container

–  all
files
+
WALs
+
btree's
+
bloom-‐filters
+
range-‐maps

©MapR
Technologies

54

Read-‐write
ReplicaAon

§  Write
are
synchronous
client2

–  All
copies
have
same
data
client1

clientN

§  Data
is
replicated
in
a
"chain"

fashion

–  befer
bandwidth,
uOlizes
full-‐duplex

network
links
well

§  Meta-‐data
is
replicated
in
a
"star"

manner

–  response
Ome
befer,
bandwidth
not

of
concern

–  data
can
also
be
done
this
way

©MapR
Technologies

55

55

Random
WriAng
in
MapR

S1
Ask
for

Client
64M
block

wriAng
CLDB

Create
cont.

data
S1, S2, S4
afach
S1, S3
Write
S1, S4, S5
next
chunk
S2
Picks
master
S2, S4, S5
to
S2

and
2
replica
slaves
S3
S2, S3, S5

S4 S5
S3

©MapR
Technologies

56

Container
Balancing

•  Servers
keep
a
bunch
of
containers
"ready
to
go".

•  Writes
get
distributed
around
the
cluster.

l  As
data
size
increases,
writes

spread
more,
like
dropping
a

pebble
in
a
pond

l  Larger
pebbles
spread
the

ripples
farther

l  Space
balanced
by
moving
idle

containers

©MapR
Technologies

57

Failure
Handling

Containers
managed
at
CLDB
(HB,
container-‐reports).

l  HB
loss

+

upstream

enOty
reports
failure

=>
server
dead

l  Incr
epoch
at
CLDB

l  Rearrange
repl
path

l  Exact
same
code
for
ﬁles

Container
LocaOon
DataBase

and
M7
tables

(CLDB)
l  No
ZK

©MapR
Technologies

58

Architectural
Params

HDFS
'block'

§  Unit
of
I/O

–  4K/8K

(8K
in
MapR)

10^3
10^6
10^9

i/o

map-‐red
resync
admin

§  Unit
of
Chunking

(a
map-‐reduce

split)
§  Unit
of
AdministraOon

(snap,

–  10-‐100's
of
megabytes
repl,
mirror,
quota,
backup)

–  1
gigabyte
-‐
1000's
of
terabytes

§  Unit
of
Resync

(a
replica)
–  volume
in
MapR

–  10-‐100's
of
gigabytes
–  what
data
is
aﬀected
by
my

missing
blocks?

–  container
in
MapR

©MapR
Technologies

59

Other
M7
Features

§  Smaller
disk
footprint

–  M7
never
repeats
the
key
or
column
name

§  Columnar
layout

–  M7
supports
64
column
families

–  in-‐memory
column-‐families

§  Online
admin

–  M7
schema
changes
on
the
ﬂy

–  delete/rename/redistribute
tables

©MapR
Technologies

60

Examples:
Reliability
Issues

§  CompacAons
disrupt
HBase
operaAons:

I/O
bursts
overwhelm

nodes
(hfp://hbase.apache.org/book.html#compacOon)

§  Very
slow
crash
recovery:
RegionServer
crash
can
cause
data
to
be

unavailable
for
up
to
30
minutes
while
WALs
are
replayed
for

impacted
regions.
(HBASE-‐1111)

§  Unreliable
splibng:
Region
spliwng
may
cause
data
to
be

inconsistent
and
unavailable.
(
hfp://chilinglam.blogspot.com/2011/12/my-‐experience-‐with-‐
hbase-‐dynamic.html)

§  No
client
throcling:
HBase
client
can
easily
overwhelm

RegionServers
and
cause
downOme.
(HBASE-‐5161,
HBASE-‐5162)

©MapR
Technologies

62

Examples:
Business
ConAnuity
Issues

§  No
Snapshots:
MapR
provides
all-‐or-‐nothing
snapshots
for
HBase.

The
WALs
are
shared
among
tables
so
single-‐table
and
selecOve

mulO-‐table
snapshots
are
not
possible.
(HDFS-‐2802,
HDFS-‐3370,

HBASE-‐50,
HBASE-‐6055)

§  Complex
Backup
Process:

complex,
unreliable
and
ineﬃcient.

(
hfp://bruteforcedata.blogspot.com/2012/08/hbase-‐disaster-‐
recovery-‐and-‐whisky.html)

§  AdministraAon
Requires
DownAme:
The
enOre
cluster
must

be
taken
down
in
order
to
merge
regions.
Tables
must
be
disabled
to

change
schema,
replicaOon
and
other
properOes.
(HBASE-‐420,

HBASE-‐1621,
HBASE-‐5504,
HBASE-‐5335,
HBASE-‐3909)

©MapR
Technologies

63

Examples:
Performance
Issues

§  Limited
support
for
mulAple
column
families:
HBase
has

issues
handling
mulOple
column
family
due
to
compacOons.
The
standard

HBase
documentaOon
recommends
no
more
than
2-‐3
column
families.

(HBASE-‐3149)

§  Limited
data
locality:
HBase
does
not
take
into
account
block

locaOons
when
assigning
regions.
A€er
a
reboot,
RegionServers
are
o€en

reading
data
over
the
network
rather
than
the
local
drives.
(HBASE-‐4755,

HBASE-‐4491)

§  Cannot
uAlize
disk
space:
HBase
RegionServers
struggle
with
more

than
50-‐150
regions
per
RegionServer
so
a
commodity
server
can
only
handle

about
1TB
of
HBase
data,
wasOng
disk
space.
(
hfp://hbase.apache.org/book/important_configuraOons.html,

hfp://www.cloudera.com/blog/2011/04/hbase-‐dos-‐and-‐donts/)

§  Limited
#
of
tables:
A
single
cluster
can
only
handle
several
tens
of

tables
effecOvely.
(
hfp://hbase.apache.org/book/important_configuraOons.html)

©MapR
Technologies

64

Examples:
Manageability
Issues

§  Manual
major
compacAons:
HBase
major
compacOons
are
disrupOve

so
producOon
clusters
keep
them
disabled
and
rely
on
the
administrator
to

manually
trigger
compacOons.
(
hfp://hbase.apache.org/book.html#compacOon)

§  Manual
splibng:
HBase
auto-‐spliwng
does
not
work
properly
in
a
busy

cluster
so
users
must
pre-‐split
a
table
based
on
their
esOmate
of
data
size/
growth.
(
hfp://chilinglam.blogspot.com/2011/12/my-‐experience-‐with-‐hbase-‐
dynamic.html)

§  Manual
merging:
HBase
does
not
automaOcally
merge
regions
that
are

too
small.
The
administrator
must
take
down
the
cluster
and
trigger
the

merges
manually.

§  Basic
administraAon
is
complex:
Renaming
a
table
requires
copying

all
the
data.
Backing
up
a
cluster
is
a
complex
process.
(HBASE-‐643)

©MapR
Technologies

65

Philly DB MapR M7 - March 2013

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Philly DB MapR M7 - March 2013

Semelhante a Philly DB MapR M7 - March 2013 (20)

Mais de MapR Technologies

Mais de MapR Technologies (20)

Último

Último (20)

Philly DB MapR M7 - March 2013