Mais conteúdo relacionado
Semelhante a Philly DB MapR M7 - March 2013 (20)
Mais de MapR Technologies (20)
Philly DB MapR M7 - March 2013
- 1. Hbase
and
M7
Technical
Overview
Keys
Botzum
Senior
Principal
Technologist
MapR
Technologies
March
2013
©MapR
Technologies
1
- 2. Agenda
HBase
MapR
M7
Containers
©MapR
Technologies
2
- 3.
HBase
A
sparse,
distributed,
persistent,
indexed,
and
sorted
map
OR
A
NoSQL
database
OR
A
Columnar
data
store
©MapR
Technologies
3
- 4. Key-‐Value
Store
§ Row
key
– Binary
sortable
value
§ Row
content
key
(analogous
to
a
column)
– Column
family
(string)
– Column
qualifier
(binary)
– Version/Omestamp
(number)
§ A
row
key,
column
family,
column
qualifier,
and
version
uniquely
idenOfies
a
parOcular
cell
– A
cell
contains
a
single
binary
value
©MapR
Technologies
4
- 5. A
Row
C0
C1
C2
C3
C4
CN
Row
Key
Value1
Value2
Value3
Value4
ValueN
…
Column
Column
Row
Key
Version
Value1
Family
Qualifier
Column
Column
Row
Key
Version
Value2
Family
Qualifier
…
Column
Column
Row
Key
Version
ValueN
Family
Qualifier
©MapR
Technologies
5
- 6. Not
A
TradiAonal
RDBMS
§ Weakly
typed
and
schema-‐less
(unstructured
or
perhaps
semi-‐
structured)
– Almost
everything
is
binary
§ No
constraints
– You
can
put
any
binary
value
in
any
cell
– You
can
even
put
incompaOble
types
in
two
different
instances
of
the
same
column
family:column
qualifier
§ Column
(qualifiers)
are
created
implicitly
§ Different
rows
can
have
different
columns
§ No
transacOons/no
ACID
– Only
unit
of
atomic
operaOon
is
a
single
row
©MapR
Technologies
6
- 7. API
§ APIs
for
querying
(get),
scanning,
and
updaOng
(put)
– Operate
on
row
key,
column
family,
qualifier,
version,
and
values
– Can
parOally
specify
and
will
retrieve
union
of
results
• if
just
specify
row
key,
will
get
all
values
for
it
(with
column
family,
qualifier)
– By
default
only
largest
version
(most
recent
if
Omestamp)
is
returned
• Specify
row
key
and
column
family
to
get
will
retrieve
all
values
for
that
row
and
column
family
– Scanning
is
just
get
over
a
range
of
row
keys
§ Version
– While
defaults
to
a
Omestamp,
any
integer
is
acceptable
©MapR
Technologies
7
- 8. Columnar
§ Rather
than
storing
table
rows
linearly
on
disk
and
each
row
on
disk
as
a
single
byte
range
with
fixed
size
fields,
store
columns
of
row
separately
– Very
efficient
storage
for
sparse
data
sets
(NULL
is
free)
– Compression
works
befer
on
similar
data
– Fetches
of
only
subsets
of
row
very
efficient
(less
disk
IO)
– No
fixed
size
on
column
values
– No
requirement
to
even
define
columns
§ Columns
are
grouped
together
into
column
families
– Basically
a
file
on
disk
– A
unit
of
opOmizaOon
– In
Hbase,
adding
column
is
implicit,
adding
column
family
is
explicit
©MapR
Technologies
8
- 9. HBase
Table
Architecture
§ Tables
are
divided
into
key
ranges
(regions)
§ Regions
are
served
by
nodes
(RegionServers)
§ Columns
are
divided
into
access
groups
(columns
families)
CF1
CF2
CF3
CF4
CF5
R1
R2
R3
R4
©MapR
Technologies
9
- 10. Storage
Model
Highlights
§ Data
is
stored
in
sorted
order
– A
table
contains
rows
– A
sequence
of
rows
are
grouped
together
into
a
region
• A
region
consists
of
various
files
related
to
those
rows
and
is
loaded
into
a
region
server
• Regions
are
stored
in
HDFS
for
high
availability
– A
single
region
server
manages
mulOple
regions
• Region
assignment
can
change
–
load
balancing,
failures,
etc.
§ Clients
connect
to
tables
– HBase
runOme
transparently
determines
the
region
(based
on
key
ranges)
and
contacts
the
appropriate
region
server
§ At
any
given
Ome
exactly
one
region
server
provides
access
to
a
region
– Master
region
servers
(with
Zookeeper)
manage
that
©MapR
Technologies
10
- 11. What’s
Great
About
This?
§ Very
scalable
§ Easy
to
add
region
servers
§ Easy
to
move
regions
around
§ Scans
are
efficient
– Unlike
hashing
based
models
§ Access
via
row
key
is
very
efficient
– Note:
there
are
no
secondary
indexes
§ No
schema,
can
store
whatever
you
want
when
you
want
§ Strong
consistency
§ Integrated
with
Hadoop
– Map-‐reduce
on
HBase
is
straighlorward
– HDFS/MapR-‐FS
provides
data
replicaOon
©MapR
Technologies
11
- 12. Data
Storage
Architecture
§ Data
from
a
region
column
family
is
stored
in
an
HFile
– An
HFile
contains
row
key:column
qualifier:version:value
entries
– Index
at
the
end
into
the
data
–
64KB
“blocks”
by
default
§ Update
– New
value
is
wrifen
persistently
to
Write
Ahead
Log
(WAL)
– Cached
in
memory
– When
memory
fills,
write
out
new
HFile
§ Read
– Checks
in
memory,
then
all
of
the
Hfiles
– Read
data
cached
in
memory
§ Delete
– Create
a
tombstone
record
(purged
at
major
compacOon)
©MapR
Technologies
12
- 13. Apache
HBase
HFile
Structure
Each
cell
is
an
individual
key
+
value
-‐
a
row
repeats
the
key
for
each
column
64Kbyte
blocks
Key-‐value
are
compressed
pairs
are
laid
out
in
increasing
order
An
index
into
the
compressed
blocks
is
created
as
a
btree
©MapR
Technologies
13
- 14. HBase
Region
OperaAon
§ Typical
region
size
is
a
few
GB,
someOmes
even
10G
or
20G
§ RS
holds
data
in
memory
unOl
full,
then
writes
a
new
HFile
– Logical
view
of
database
constructed
by
layering
these
files,
with
the
latest
on
top
newest
oldest
Key
range
represented
by
this
region
©MapR
Technologies
14
- 15. HBase
Read
AmplificaAon
§ When
a
get/scan
comes
in,
all
the
files
have
to
be
examined
– schema-‐less,
so
where
is
the
column?
– Done
in-‐memory
and
does
not
change
what's
on
disk
• Bloom-‐filters
do
not
help
in
scans
newest
oldest
With
7
files,
a
1K-‐record
get()
potenOally
takes
about
30
seeks,
7
block
fetches
and
decompressions,
from
HDFS.
Even
with
the
index
in
memory
7
seeks
and
7
block
fetches
are
required.
©MapR
Technologies
15
- 16. HBase
Write
AmplificaAon
§ To
reduce
the
read-‐amplificaOon,
HBase
merges
the
HFiles
periodically
– process
called
compacOon
– runs
automaOcally
when
too
many
files
– usually
turned
off
due
to
I/O
storms
which
interfere
with
client
access
– and
kicked-‐off
manually
on
weekends
Major
compacOon
reads
all
files
and
merges
into
a
single
HFile
©MapR
Technologies
16
- 17. HBase
Server
Architecture
Zookeeper
HDFS
Server
Coordinates
Lookup
Hbase
Master
Linux
Client
Filesystem
Data
Hbase
Region
Server
HFiles
WAL
©MapR
Technologies
17
- 18. WAL
File
§ A
persistent
record
of
every
update/insert
in
sequence
order
– Shared
by
all
regions
on
one
region
server
– WAL
files
periodically
rolled
to
limit
size
but
older
WALs
sOll
needed
– WAL
file
no
longer
needed
once
every
region
with
updates
in
WAL
file
has
flushed
those
from
memory
to
an
HFile
• Remember
that
more
HFiles
slow
read
path!
§ Must
be
replayed
as
part
of
recovery
process
since
in
memory
updates
are
“lost”
– This
is
very
expensive
and
delays
bringing
a
region
back
online
©MapR
Technologies
18
- 19. What’s
Not
So
Good
Reliability
• Complex
coordinaOon
between
ZK,
HDFS,
HBase
Master,
and
Region
Server
during
region
movement
• CompacOons
disrupt
operaOons
• Very
slow
crash
recovery
because
of
• CoordinaOon
complexity
• WAL
log
reading
(one
log/server)
Business
conAnuity
• Many
administraOve
acOons
require
downOme
• Not
well
integrated
into
MapR-‐FS
mirroring
and
snapshot
funcOonality
©MapR
Technologies
19
- 20. What’s
Not
So
Good
Performance
• Very
long
read/write
path
• Significant
read
and
write
amplificaOon
• MulOple
JVMs
in
read/write
path
–
GC
delays!
Manageability
• CompacOons,
splits
and
merges
must
be
done
manually
(in
reality)
• Lots
of
“well
known”
problems
maintaining
reliable
cluster
–
spliwng,
compacOons,
region
assignment,
etc.
• PracOcal
limits
on
number
of
regions/region
server
and
size
of
regions
–
can
make
it
hard
to
fully
uOlize
hardware
©MapR
Technologies
20
- 22. Apache
HBase
on
MapR
Limited
data
management,
data
protecOon
and
disaster
recovery
for
tables.
©MapR
Technologies
22
- 23. Agenda
HBase
MapR
M7
Containers
©MapR
Technologies
23
- 24. MapR
A
provider
of
enterprise
grade
Hadoop
with
uniquely
differenOated
features
©MapR
Technologies
24
- 26. One
PlaVorm
for
Big
Data
Broad
RecommendaOon
Engines
Fraud
DetecOon
Billing
LogisOcs
range
of
applicaOons
Risk
Modeling
Market
SegmentaOon
Inventory
ForecasOng
…
Batch
InteracOve
Real-‐Ome
Map
File-‐Based
SQL
Stream
Database
Search
Reduce
ApplicaOons
Processing
…
99.999%
Data
Disaster
Scalability
Enterprise
MulO-‐
&
HA
ProtecOon
Recovery
Performance
IntegraOon
tenancy
©MapR
Technologies
26
- 27. Dependable:
Lights
Out
Data
Center
Ready
Reliable
Compute
Dependable
Storage
§ Automated
stateful
failover
§ Business
conOnuity
with
§ Automated
re-‐replicaOon
snapshots
and
mirrors
§ Recover
to
a
point
in
Ome
§ Self-‐healing
from
HW
and
SW
failures
§ End-‐to-‐end
check
summing
§ Load
balancing
§ Strong
consistency
§ No
lost
jobs
or
data
§ Data
safe
§ 99999’s
of
upOme
§ Mirror
across
sites
to
meet
Recovery
Time
ObjecOves
©MapR
Technologies
27
- 28. Fast:
World
Record
Performance
Benchmark
MapR
2.1.1
CDH
4.1.1
MapR
Speed
Increase
Terasort
(1x
replicaOon,
compression
disabled)
Total
13m
35s
26m
6s
2X
Map
7m
58s
21m
8s
3X
Reduce
13m
32s
23m
37s
1.8X
DFSIO
throughput/node
Read
1003
MB/s
656
MB/s
1.5X
MinuteSort
Record
Write
924
MB/s
654
MB/s
1.4X
1.5
TB
in
60
seconds
YCSB
(50%
read,
50%
update)
2103
nodes
Throughput
36,584.4
op/s
12,500.5
op/s
2.9X
RunOme
3.80
hr
11.11
hr
2.9X
YCSB
(95%
read,
5%
update)
Throughput
24,704.3
op/s
10,776.4
op/s
2.3X
RunOme
0.56
hr
1.29
hr
2.3X
Benchmark
hardware
configuraOon:
10
servers,
12
x
2
cores
(2.4
GHz),
12
x
2TB,
48
GB,
1
x
10GbE
©MapR
Technologies
28
- 29. The
Cloud
Leaders
Pick
MapR
Amazon
EMR
is
the
largest
Google
chose
MapR
to
Hadoop
provider
in
revenue
provide
Hadoop
on
Google
and
#
of
clusters
Compute
Engine
©MapR
Technologies
29
- 30. MapR
Supports
Broad
Set
of
Customers
Global
Credit
Card
Issuer
Leading
Retailer
§ RecommendaOon
Engine
§ Customer
Behavior
Analysis
§ Customer
targeOng
§ Fraud
detecOon
and
PrevenOon
§ Brand
Monitoring
§ Viewer
Behavioral
analyOcs
§ Global
threat
analyOcs
§ Intrusion
detecOon
&
prevenOon
§ RecommendaOon
Engine
§ Virus
analysis
§ Forensic
analysis
§ Family
tree
connecOons
§ Clickstream
Analysis
§ PaOent
care
§ Log
analysis
§ Quality
profiling/field
monitoring
§ HBase
failure
analysis
§ Fraud
DetecOon
§ AdverOsing
exchange
§ Monitoring
and
measuring
§ Channel
analyOcs
analysis
and
opOmizaOon
online
behavior
§ Customer
Revenue
§ Enterprise
Grade
AnalyOcs
§ Customer
targeOng
Plalorm
§ ETL
Offload
§ Social
media
analysis
§ COOP
features
©MapR
Technologies
30
- 31. MapR
EdiAons
§ Control
System
§ Control
System
§ All
the
Features
of
M5
§ NFS
Access
§ NFS
Access
§ Simplified
§ Performance
AdministraOon
for
§ Performance
HBase
§ Unlimited
Nodes
§ High
Availability
§ Increased
Performance
§ Free
§ Snapshots
&
Mirroring
§ Consistent
Low
Latency
§ 24
X
7
Support
§ Unified
Snapshots,
§ Annual
SubscripOon
Mirroring
Also
Available
through:
Compute
Engine
©MapR
Technologies
31
- 32. Agenda
Hbase
MapR
M7
Containers
©MapR
Technologies
32
- 33. M7
An
integrated
system
for
unstructured
and
structured
data
©MapR
Technologies
33
- 34. Introducing
MapR
M7
§ An
integrated
system
– Unified
namespace
for
files
and
tables
– Built-‐in
data
management
&
protecOon
– No
extra
administraOon
§ Architected
for
reliability
and
performance
– Fewer
layers
– Single
hop
to
data
– No
compacOons,
low
i/o
amplificaOon
– Seamless
splits,
automaOc
merges
– Instant
recovery
©MapR
Technologies
34
- 35. Binary
CompaAble
with
HBase
APIs
§ HBase
applicaOons
work
"as
is"
with
M7
– No
need
to
recompile
(binary
compaOble)
§ Can
run
M7
and
HBase
side-‐by-‐side
on
the
same
cluster
– e.g.,
during
a
migraOon
– can
access
both
M7
table
and
HBase
table
in
same
program
§ Use
standard
Apache
HBase
CopyTable
tool
to
copy
a
table
from
HBase
to
M7
or
vice-‐versa
%
hbase
org.apache.hadoop.hbase.mapreduce.CopyTable
-‐-‐new.name=/user/srivas/mytable
oldtable
©MapR
Technologies
35
- 36. M7:
Remove
Layers,
Simplify
Take
note!
No
JVM!
MapR
M7
©MapR
Technologies
36
- 37. M7:
No
Master
and
No
RegionServers
No
JVM
problems
One
hop
to
data
Unified
cache
No
extra
daemons
to
manage
©MapR
Technologies
37
- 38. Region
Assignment
in
Apache
HBase
None
of
this
complexity
is
present
in
MapR
M7
©MapR
Technologies
38
- 39. Unified
Namespace
for
Files
and
Tables
$
pwd
/mapr/default/user/dave
$
ls
file1
file2
table1
table2
$
hbase
shell
hbase(main):003:0>
create
'/user/dave/table3',
'cf1',
'cf2',
'cf3'
0
row(s)
in
0.1570
seconds
$
ls
file1
file2
table1
table2
table3
$
hadoop
fs
-‐ls
/user/dave
Found
5
items
-‐rw-‐r-‐-‐r-‐-‐
3
mapr
mapr
16
2012-‐09-‐28
08:34
/user/dave/file1
-‐rw-‐r-‐-‐r-‐-‐
3
mapr
mapr
22
2012-‐09-‐28
08:34
/user/dave/file2
trwxr-‐xr-‐x
3
mapr
mapr
2
2012-‐09-‐28
08:32
/user/dave/table1
trwxr-‐xr-‐x
3
mapr
mapr
2
2012-‐09-‐28
08:33
/user/dave/table2
trwxr-‐xr-‐x
3
mapr
mapr
2
2012-‐09-‐28
08:38
/user/dave/table3
©MapR
Technologies
39
- 40. Tables
for
End
Users
§ Users
can
create
and
manage
their
own
tables
– Unlimited
#
of
tables
§ Tables
can
be
created
in
any
directory
– Tables
count
towards
volume
and
user
quotas
§ No
admin
intervenOon
needed
– I
can
create
a
file
or
a
directory
without
opening
a
Ocket
with
admin
team,
why
not
a
table?
– Do
stuff
on
the
fly,
no
stop/restart
servers
§ AutomaOc
data
protecOon
and
disaster
recovery
– Users
can
recover
from
snapshots/mirrors
on
their
own
©MapR
Technologies
40
- 41. M7
–
An
Integrated
System
©MapR
Technologies
41
- 42. M7
ComparaOve
Analysis
with
Apache
HBase,
Level-‐DB
and
a
BTree
©MapR
Technologies
42
- 43. HBase
Write
AmplificaAon
Analysis
§ Assume
10G
per
region,
write
10%
per
day,
grow
10%
per
week
– 1G
of
writes
– a€er
7
days,
7
files
of
1G
and
1file
of
10G
(only
1G
is
growth)
§ IO
Cost
– Wrote
7G
to
WAL
+
7G
to
HFiles
– CompacOon
adds
sOll
more
• read:
17G
(=
7
x
1G
+
1
x
10G)
• write:
11G
write
to
new
Hfile
– WAF
–
wrote
7G
“for
real”
but
actual
disk
IO
a€er
compacOon
is
read
17G
+
write
25G
and
that’s
assuming
no
applicaOon
reads!
§ IO
Cost
of
1000
regions
similar
to
above
– read
17T,
write
25T
è
major
impact
on
node
§ Best
pracOce,
limit
#
of
regions/node
à
can’t
fully
uOlize
storage
©MapR
Technologies
43
- 44. AlternaAve:
Level-‐DB
§ Tiered,
logarithmic
increase
– L1:
2
x
1M
files
– L2:
10
x
1M
– L3:
100
x
1M
– L4:
1,000
x
1M,
etc
§ CompacOon
overhead
– avoids
IO
storms
(i/o
done
in
smaller
increments
of
~10M)
– but
significantly
extra
bandwidth
compared
to
HBase
§ Read
overhead
is
sOll
high
– 10-‐15
seeks,
perhaps
more
if
the
lowest
level
is
very
large
– 40K
-‐
60K
read
from
disk
to
retrieve
a
1K
record
©MapR
Technologies
44
- 45. BTree
analysis
§ Read
finds
data
directly,
proven
to
be
fastest
– interior
nodes
only
hold
keys
– very
large
branching
factor
– values
only
at
leaves
– thus
index
caches
work
– R
=
logN
seeks,
if
no
caching
– 1K
record
read
will
transfer
about
logN
blocks
from
disk
§ Writes
are
slow
on
inserts
– inserted
into
correct
place
right
away
– otherwise
read
will
not
find
it
– requires
btree
to
be
conOnuously
rebalanced
– causes
extreme
random
i/o
in
insert
path
– W
=
2.5x
+
logN
seeks
if
no
caching
©MapR
Technologies
45
- 46. Log-‐Structured
Merge
Trees
§ LSM
Trees
reduce
insert
cost
by
deferring
and
batching
index
changes
– If
don't
compact
o€en,
read
perf
is
impacted
– If
compact
too
o€en,
write
perf
is
impacted
§ B-‐Trees
are
great
for
reads
– but
expensive
to
update
in
real-‐Ome
Can
we
combine
both
ideas?
Writes
cannot
be
done
befer
than
W
=
2.5x
write
to
log
+
write
data
to
somewhere
+
update
meta-‐data
Memory Disk
Index Log
Write
Read Index
©MapR
Technologies
46
- 47. M7
from
MapR
§ TwisOng
BTree's
– leaves
are
variable
size
(8K
-‐
8M
or
larger)
– can
stay
unbalanced
for
long
periods
of
Ome
• more
inserts
will
balance
it
eventually
• automaOcally
throfles
updates
to
interior
btree
nodes
– M7
inserts
"close
to"
where
the
data
is
supposed
to
go
§ Reads
– Uses
BTree
structure
to
get
"close"
very
fast
• very
high
branching
with
key-‐prefix-‐compression
– UOlizes
a
separate
lower-‐level
index
to
find
it
exactly
• updated
"in-‐place”
bloom-‐filters
for
gets,
range-‐maps
for
scans
§ Overhead
– 1K
record
read
will
transfer
about
32K
from
disk
in
logN
seeks
©MapR
Technologies
47
- 48. M7
provides
Instant
Recovery
§ Instead
of
having
one
WAL/region
server
or
even
one/region,
we
have
many
micro-‐WALs/region
§ 0-‐40
microWALs
per
region
– idle
WALs
“compacted”,
so
most
are
empty
– region
is
up
before
all
microWALs
are
recovered
– recovers
region
in
background
in
parallel
– when
a
key
is
accessed,
that
microWAL
is
recovered
inline
– 1000-‐10000x
faster
recovery
§ Never
perform
equivalent
of
HBase
major
or
minor
compacOon
§ Why
doesn't
HBase
do
this?
M7
uses
MapR-‐FS,
not
HDFS
– No
limit
to
#
of
files
on
disk
– No
limit
to
#
open
files
– I/O
path
translates
random
writes
to
sequenOal
writes
on
disk
©MapR
Technologies
48
- 49. Summary
1K
record
-‐read
CompacAon
Recovery
amplificaAon
HBase
with
7
hfiles
30
seeks
IO
Storms
Huge
WAL
to
recover
130K
xfer
good
bandwidth
HBase
with
3
hfiles
15
seeks,
IO
Storms
Huge
WAL
to
recover
70K
xfer
high
bandwidth
LevelDB
with
5
levels
13
seeks
No
i/o
storms
WAL
is
Ony
48K
xfer
Very
high
b/w
BTree
logN
seeks
No
i/o
storms
WAL
is
proporOonal
to
logN
xfer
but
100%
random
concurrency
+
cache
MapR
M7
logN
seeks
No
i/o
storms
microWALs
allow
32K
xfer
low
bandwidth
recovery
<
100ms
©MapR
Technologies
49
- 50. M7:
Fileservers
Serve
Regions
§ Region
lives
enOrely
inside
a
container
– Does
not
coordinate
through
ZooKeeper
§ Containers
support
distributed
transacOons
– with
replicaOon
built-‐in
§ Only
coordinaOon
in
the
system
is
for
splits
– Between
region-‐map
and
data-‐container
– already
solved
this
problem
for
files
and
its
chunks
©MapR
Technologies
50
- 51. Agenda
Hbase
MapR
M7
Containers
©MapR
Technologies
51
- 52.
What's
a
MapR
container?
©MapR
Technologies
52
- 53. MapR's
Containers
Files/directories
are
sharded
into
blocks,
and
placed
in
containers
on
disks
l Each
container
contains
l Directories
&
files
l Data
blocks
Containers
are
l BTrees
~32
GB
segments
of
100%
random
writes
disk,
placed
on
l
nodes
Patent
Pending
©MapR
Technologies
53
- 54. M7
Containers
§ Container
holds
many
files
– regular,
dir,
symlink,
btree,
chunk-‐map,
region-‐map,
…
– all
random-‐write
capable
§ Container
is
replicated
to
servers
– unit
of
resynchronizaOon
§ Region
lives
enOrely
inside
1
container
– all
files
+
WALs
+
btree's
+
bloom-‐filters
+
range-‐maps
©MapR
Technologies
54
- 55. Read-‐write
ReplicaAon
§ Write
are
synchronous
client2
– All
copies
have
same
data
client1
clientN
§ Data
is
replicated
in
a
"chain"
fashion
– befer
bandwidth,
uOlizes
full-‐duplex
network
links
well
§ Meta-‐data
is
replicated
in
a
"star"
manner
– response
Ome
befer,
bandwidth
not
of
concern
– data
can
also
be
done
this
way
©MapR
Technologies
55
55
- 56. Random
WriAng
in
MapR
S1
Ask
for
Client
64M
block
wriAng
CLDB
Create
cont.
data
S1, S2, S4
afach
S1, S3
Write
S1, S4, S5
next
chunk
S2
Picks
master
S2, S4, S5
to
S2
and
2
replica
slaves
S3
S2, S3, S5
S4 S5
S3
©MapR
Technologies
56
- 57. Container
Balancing
• Servers
keep
a
bunch
of
containers
"ready
to
go".
• Writes
get
distributed
around
the
cluster.
l As
data
size
increases,
writes
spread
more,
like
dropping
a
pebble
in
a
pond
l Larger
pebbles
spread
the
ripples
farther
l Space
balanced
by
moving
idle
containers
©MapR
Technologies
57
- 58. Failure
Handling
Containers
managed
at
CLDB
(HB,
container-‐reports).
l HB
loss
+
upstream
enOty
reports
failure
=>
server
dead
l Incr
epoch
at
CLDB
l Rearrange
repl
path
l Exact
same
code
for
files
Container
LocaOon
DataBase
and
M7
tables
(CLDB)
l No
ZK
©MapR
Technologies
58
- 59. Architectural
Params
HDFS
'block'
§ Unit
of
I/O
– 4K/8K
(8K
in
MapR)
10^3
10^6
10^9
i/o
map-‐red
resync
admin
§ Unit
of
Chunking
(a
map-‐reduce
split)
§ Unit
of
AdministraOon
(snap,
– 10-‐100's
of
megabytes
repl,
mirror,
quota,
backup)
– 1
gigabyte
-‐
1000's
of
terabytes
§ Unit
of
Resync
(a
replica)
– volume
in
MapR
– 10-‐100's
of
gigabytes
– what
data
is
affected
by
my
missing
blocks?
– container
in
MapR
©MapR
Technologies
59
- 60. Other
M7
Features
§ Smaller
disk
footprint
– M7
never
repeats
the
key
or
column
name
§ Columnar
layout
– M7
supports
64
column
families
– in-‐memory
column-‐families
§ Online
admin
– M7
schema
changes
on
the
fly
– delete/rename/redistribute
tables
©MapR
Technologies
60
- 62. Examples:
Reliability
Issues
§ CompacAons
disrupt
HBase
operaAons:
I/O
bursts
overwhelm
nodes
(hfp://hbase.apache.org/book.html#compacOon)
§ Very
slow
crash
recovery:
RegionServer
crash
can
cause
data
to
be
unavailable
for
up
to
30
minutes
while
WALs
are
replayed
for
impacted
regions.
(HBASE-‐1111)
§ Unreliable
splibng:
Region
spliwng
may
cause
data
to
be
inconsistent
and
unavailable.
(
hfp://chilinglam.blogspot.com/2011/12/my-‐experience-‐with-‐
hbase-‐dynamic.html)
§ No
client
throcling:
HBase
client
can
easily
overwhelm
RegionServers
and
cause
downOme.
(HBASE-‐5161,
HBASE-‐5162)
©MapR
Technologies
62
- 63. Examples:
Business
ConAnuity
Issues
§ No
Snapshots:
MapR
provides
all-‐or-‐nothing
snapshots
for
HBase.
The
WALs
are
shared
among
tables
so
single-‐table
and
selecOve
mulO-‐table
snapshots
are
not
possible.
(HDFS-‐2802,
HDFS-‐3370,
HBASE-‐50,
HBASE-‐6055)
§ Complex
Backup
Process:
complex,
unreliable
and
inefficient.
(
hfp://bruteforcedata.blogspot.com/2012/08/hbase-‐disaster-‐
recovery-‐and-‐whisky.html)
§ AdministraAon
Requires
DownAme:
The
enOre
cluster
must
be
taken
down
in
order
to
merge
regions.
Tables
must
be
disabled
to
change
schema,
replicaOon
and
other
properOes.
(HBASE-‐420,
HBASE-‐1621,
HBASE-‐5504,
HBASE-‐5335,
HBASE-‐3909)
©MapR
Technologies
63
- 64. Examples:
Performance
Issues
§ Limited
support
for
mulAple
column
families:
HBase
has
issues
handling
mulOple
column
family
due
to
compacOons.
The
standard
HBase
documentaOon
recommends
no
more
than
2-‐3
column
families.
(HBASE-‐3149)
§ Limited
data
locality:
HBase
does
not
take
into
account
block
locaOons
when
assigning
regions.
A€er
a
reboot,
RegionServers
are
o€en
reading
data
over
the
network
rather
than
the
local
drives.
(HBASE-‐4755,
HBASE-‐4491)
§ Cannot
uAlize
disk
space:
HBase
RegionServers
struggle
with
more
than
50-‐150
regions
per
RegionServer
so
a
commodity
server
can
only
handle
about
1TB
of
HBase
data,
wasOng
disk
space.
(
hfp://hbase.apache.org/book/important_configuraOons.html,
hfp://www.cloudera.com/blog/2011/04/hbase-‐dos-‐and-‐donts/)
§ Limited
#
of
tables:
A
single
cluster
can
only
handle
several
tens
of
tables
effecOvely.
(
hfp://hbase.apache.org/book/important_configuraOons.html)
©MapR
Technologies
64
- 65. Examples:
Manageability
Issues
§ Manual
major
compacAons:
HBase
major
compacOons
are
disrupOve
so
producOon
clusters
keep
them
disabled
and
rely
on
the
administrator
to
manually
trigger
compacOons.
(
hfp://hbase.apache.org/book.html#compacOon)
§ Manual
splibng:
HBase
auto-‐spliwng
does
not
work
properly
in
a
busy
cluster
so
users
must
pre-‐split
a
table
based
on
their
esOmate
of
data
size/
growth.
(
hfp://chilinglam.blogspot.com/2011/12/my-‐experience-‐with-‐hbase-‐
dynamic.html)
§ Manual
merging:
HBase
does
not
automaOcally
merge
regions
that
are
too
small.
The
administrator
must
take
down
the
cluster
and
trigger
the
merges
manually.
§ Basic
administraAon
is
complex:
Renaming
a
table
requires
copying
all
the
data.
Backing
up
a
cluster
is
a
complex
process.
(HBASE-‐643)
©MapR
Technologies
65