Mais conteúdo relacionado Semelhante a Dchug m7-30 apr2013 (20) Dchug m7-30 apr20131. 1
©MapR
Technologies
HBase
and
M7
Technical
Overview
Jim
Fiori
Senior
Solu8ons
Architect
MapR
Technologies
April2013
2. 2
©MapR
Technologies
§ Background
§ “a
3-‐hour
tour”
§ Early
Hadoop
fire-‐fight
§ Big
Data
Who
am
I?
4. 4
©MapR
Technologies
HBase
Google
BigTable
Paper
-‐
2006
A
sparse,
distributed,
persistent,
indexed,
and
sorted
map
OR
A
NoSQL
database
OR
A
Columnar
data
store
5. 5
©MapR
Technologies
Key-‐Value
Store
§ Row
key
– Binary
sortable
value
§ Row
content
key
(analogous
to
a
column)
– Column
family
(string)
– Column
qualifier
(binary)
– Version/8mestamp
(number)
§ A
row
key,
column
family,
column
qualifier,
and
version
uniquely
iden8fies
a
par8cular
cell
– A
cell
contains
a
single
binary
value
6. 6
©MapR
Technologies
A
Row
Value1
Row
Key
Value2
Value3
Value4
ValueN
…
C0
C1
C2
C3
C4
CN
Column
Family
Row
Key
Column
Qualifier
Version
Value2
Column
Family
Row
Key
Column
Qualifier
Version
Value1
Column
Family
Row
Key
Column
Qualifier
Version
ValueN
…
7. 7
©MapR
Technologies
§ Weakly
typed
and
schema-‐less
(unstructured
or
perhaps
semi-‐
structured)
– Almost
everything
is
binary
§ No
constraints
– You
can
put
any
binary
value
in
any
cell
– You
can
even
put
incompa8ble
types
in
two
different
instances
of
the
same
column
family:column
qualifier
§ Column
(qualifiers)
are
created
implicitly
§ Different
rows
can
have
different
columns
§ No
transac8ons/no
ACID
– Only
unit
of
atomic
opera8on
is
a
single
row
Not
A
TradiDonal
RDBMS
8. 8
©MapR
Technologies
§ APIs
for
querying
(get),
scanning,
and
upda8ng
(put)
– Operate
on
row
key,
column
family,
qualifier,
version,
and
values
– Can
par8ally
specify
and
will
retrieve
union
of
results
• if
just
specify
row
key,
will
get
all
values
for
it
(with
column
family,
qualifier)
– By
default
only
largest
version
(most
recent
if
8mestamp)
is
returned
• Specify
row
key
and
column
family
to
get
will
retrieve
all
values
for
that
row
and
column
family
– Scanning
is
just
get
over
a
range
of
row
keys
§ Version
– While
defaults
to
a
8mestamp,
any
integer
is
acceptable
API
9. 9
©MapR
Technologies
§ Rather
than
storing
table
rows
linearly
on
disk
and
each
row
on
disk
as
a
single
byte
range
with
fixed
size
fields,
store
columns
of
row
separately
– Very
efficient
storage
for
sparse
data
sets
(NULL
is
free)
– Compression
works
beker
on
similar
data
– Fetches
of
only
subsets
of
row
very
efficient
(less
disk
IO)
– No
fixed
size
on
column
values
– No
requirement
to
even
define
columns
§ Columns
are
grouped
together
into
column
families
– Basically
a
file
on
disk
– A
unit
of
op8miza8on
– In
Hbase,
adding
column
is
implicit,
adding
column
family
is
explicit
Columnar
10. 10
©MapR
Technologies
HBase
Table
Architecture
§ Tables
are
divided
into
key
ranges
(regions)
§ Regions
are
served
by
nodes
(RegionServers)
§ Columns
are
divided
into
access
groups
(columns
families)
CF1
CF2
CF3
CF4
CF5
R1
R2
R3
R4
12. 12
©MapR
Technologies
§ Data
is
stored
in
sorted
order
– A
table
contains
rows
– A
sequence
of
rows
are
grouped
together
into
a
region
• A
region
consists
of
various
files
related
to
those
rows
and
is
loaded
into
a
region
server
• Regions
are
stored
in
HDFS
for
high
availability
– A
single
region
server
manages
mul8ple
regions
• Region
assignment
can
change
–
load
balancing,
failures,
etc.
§ Clients
connect
to
tables
– HBase
run8me
transparently
determines
the
region
(based
on
key
ranges)
and
contacts
the
appropriate
region
server
§ At
any
given
8me
exactly
one
region
server
provides
access
to
a
region
– Master
region
servers
(with
Zookeeper)
manage
that
Storage
Model
Highlights
13. 13
©MapR
Technologies
§ Very
scalable
§ Easy
to
add
region
servers
§ Easy
to
move
regions
around
§ Scans
are
efficient
– Unlike
hashing
based
models
§ Access
via
row
key
is
very
efficient
– Note:
there
are
no
secondary
indexes
§ No
schema,
can
store
whatever
you
want
when
you
want
§ Strong
consistency
§ Integrated
with
Hadoop
– Map-‐Reduce
on
HBase
is
straighoorward
– HDFS/MapR-‐FS
provides
data
replica8on
What’s
Great
About
This?
14. 14
©MapR
Technologies
§ Data
from
a
region
column
family
is
stored
in
an
HFile
– An
HFile
contains
row
key:column
qualifier:version:value
entries
– Index
at
the
end
into
the
data
–
64KB
“blocks”
by
default
§ Update
– New
value
is
wriken
persistently
to
Write
Ahead
Log
(WAL)
– Cached
in
memory
(MemStore)
– When
memory
fills,
write
out
new
HFile
§ Read
– Checks
in
memory,
then
all
of
the
HFiles
– Read
data
cached
in
memory
§ Delete
– Create
a
tombstone
record
(purged
at
major
compac8on)
Data
Storage
Architecture
15. 15
©MapR
Technologies
Apache
HBase
HFile
Structure
64Kbyte
blocks
are
compressed
An
index
into
the
compressed
blocks
is
created
as
a
btree
Key-‐value
pairs
are
laid
out
in
increasing
order
Each
cell
is
an
individual
key
+
value
-‐
a
row
repeats
the
key
for
each
column
16. 16
©MapR
Technologies
HBase
Region
OperaDon
§ Typical
region
size
is
a
few
GB,
some8mes
even
10G
or
20G
§ RS
holds
data
in
memory
in
a
MemStore
un8l
full,
then
writes
a
new
HFile
– Logical
view
of
database
constructed
by
layering
these
files,
with
the
latest
on
top
Key
range
represented
by
this
region
newest
oldest
17. 17
©MapR
Technologies
HBase
Read
AmplificaDon
§ When
a
get/scan
comes
in,
all
the
files
have
to
be
examined
– schema-‐less,
so
where
is
the
column?
– Done
in-‐memory
and
does
not
change
what's
on
disk
• Bloom-‐filters
do
not
help
in
scans
newest
oldest
With
7
files,
a
1K-‐record
get()
poten8ally
takes
about
30
seeks,
7
block
fetches
and
decompressions,
from
HDFS.
Even
with
the
index
in
memory
7
seeks
and
7
block
fetches
are
required.
18. 18
©MapR
Technologies
HBase
Write
AmplificaDon
§ To
reduce
the
read-‐amplifica8on,
HBase
merges
the
HFiles
periodically
– process
called
compac8on
– runs
automa8cally
when
too
many
files
– usually
turned
off
due
to
I/O
storms
which
interfere
with
client
access
– and
kicked-‐off
manually
on
weekends
Major
compac8on
reads
all
files
and
merges
into
a
single
HFile
19. 20
©MapR
Technologies
§ A
persistent
record
of
every
update/insert
in
sequence
order
– Shared
by
all
regions
on
one
region
server
– WAL
files
periodically
rolled
to
limit
size
but
older
WALs
s8ll
needed
– WAL
file
no
longer
needed
once
every
region
with
updates
in
WAL
file
has
flushed
those
from
memory
to
an
HFile
• Remember
that
more
HFiles
slow
read
path!
§ Must
be
replayed
as
part
of
recovery
process
since
in
memory
updates
are
“lost”
– This
is
very
expensive
and
delays
bringing
a
region
back
online
WAL
File
20. 21
©MapR
Technologies
What’s
Not
So
Good
Reliability
• Complex
coordina8on
between
ZK,
HDFS,
HBase
Master,
and
Region
Server
during
region
movement
• Compac8ons
disrupt
opera8ons
• Very
slow
crash
recovery
because
of
• Coordina8on
complexity
• WAL
log
reading
(one
log/server)
Business
conDnuity
• Many
administra8ve
ac8ons
require
down8me
• Not
well
integrated
into
MapR-‐FS
mirroring
and
snapshot
func8onality
21. 22
©MapR
Technologies
What’s
Not
So
Good
Performance
• Very
long
read/write
path
• Significant
read
and
write
amplifica8on
• Mul8ple
JVMs
in
read/write
path
–
GC
delays!
Manageability
• Compac8ons,
splits
and
merges
must
be
done
manually
(in
reality)
• Lots
of
“well
known”
problems
maintaining
reliable
cluster
–
spliwng,
compac8ons,
region
assignment,
etc.
• Prac8cal
limits
on
number
of
regions/region
server
and
size
of
regions
–
can
make
it
hard
to
fully
u8lize
hardware
23. 24
©MapR
Technologies
Apache
HBase
on
MapR
Limited
data
management,
data
protec8on
and
disaster
recovery
for
tables.
25. 27
©MapR
Technologies
MapR
DistribuDon
for
Apache
Hadoop
§ Complete
Hadoop
distribu8on
§ Comprehensive
management
suite
§ Industry-‐standard
interfaces
§ Enterprise-‐grade
dependability
§ Higher
performance
27. 29
©MapR
Technologies
One
PlaVorm
for
Big
Data
…
Batch
99.999%
HA
Data
Protec8on
Disaster
Recovery
Scalability
&
Performance
Enterprise
Integra8on
Mul8-‐
tenancy
Map
Reduce
File-‐Based
Applica8ons
SQL
Database
Search
Stream
Processing
Interac8ve
Real-‐8me
…
Broad
range
of
applica8ons
Recommenda8on
Engines
Fraud
Detec8on
Billing
Logis8cs
Risk
Modeling
Market
Segmenta8on
Inventory
Forecas8ng
28. 32
©MapR
Technologies
The
Cloud
Leaders
Pick
MapR
Google
chose
MapR
to
provide
Hadoop
on
Google
Compute
Engine
Amazon
EMR
is
the
largest
Hadoop
provider
in
revenue
and
#
of
clusters
MinuteSort
Record
1.5
TB
in
60
seconds
2103
nodes
29. 34
©MapR
Technologies
MapR
EdiDons
§ Control
System
§ NFS
Access
§ Performance
§ High
Availability
§ Snapshots
&
Mirroring
§ 24
X
7
Support
§ Annual
Subscrip8on
§ Control
System
§ NFS
Access
§ Performance
§ Unlimited
Nodes
§ Free
Compute
Engine
Also
Available
through:
§ All
the
Features
of
M5
§ Simplified
Administra8on
for
HBase
§ Increased
Performance
§ Consistent
Low
Latency
§ Unified
Snapshots,
Mirroring
31. 37
©MapR
Technologies
Introducing
MapR
M7
§ An
integrated
system
– Unified
namespace
for
files
and
tables
– Built-‐in
data
management
&
protec8on
– No
extra
administra8on
§ Architected
for
reliability
and
performance
– Fewer
layers
– Single
hop
to
data
– No
compac8ons,
low
i/o
amplifica8on
– Seamless
splits,
automa8c
merges
– Instant
recovery
33. 39
©MapR
Technologies
Binary
CompaDble
with
HBase
APIs
§ HBase
applica8ons
work
"as
is"
with
M7
– No
need
to
recompile
(binary
compa8ble)
§ Can
run
M7
and
HBase
side-‐by-‐side
on
the
same
cluster
– e.g.,
during
a
migra8on
– can
access
both
M7
table
and
HBase
table
in
same
program
§ Use
standard
Apache
HBase
CopyTable
tool
to
copy
a
table
from
HBase
to
M7
or
vice-‐versa
%
hbase
org.apache.hadoop.hbase.mapreduce.CopyTable
-‐-‐new.name=/user/srivas/mytable
oldtable
34. 40
©MapR
Technologies
M7:
No
Master
and
No
RegionServers
No
extra
daemons
to
manage
One
hop
to
data
Unified
cache
No
JVM
problems
35. 41
©MapR
Technologies
Region
Assignment
in
Apache
HBase
None
of
this
complexity
is
present
in
MapR
M7
36. 42
©MapR
Technologies
Unified
Namespace
for
Files
and
Tables
$
pwd
/mapr/default/user/dave
$
ls
file1
file2
table1
table2
$
hbase
shell
hbase(main):003:0>
create
'/user/dave/table3',
'cf1',
'cf2',
'cf3'
0
row(s)
in
0.1570
seconds
$
ls
file1
file2
table1
table2
table3
$
hadoop
fs
-‐ls
/user/dave
Found
5
items
-‐rw-‐r-‐-‐r-‐-‐
3
mapr
mapr
16
2012-‐09-‐28
08:34
/user/dave/file1
-‐rw-‐r-‐-‐r-‐-‐
3
mapr
mapr
22
2012-‐09-‐28
08:34
/user/dave/file2
trwxr-‐xr-‐x
3
mapr
mapr
2
2012-‐09-‐28
08:32
/user/dave/table1
trwxr-‐xr-‐x
3
mapr
mapr
2
2012-‐09-‐28
08:33
/user/dave/table2
trwxr-‐xr-‐x
3
mapr
mapr
2
2012-‐09-‐28
08:38
/user/dave/table3
37. 43
©MapR
Technologies
Tables
for
End
Users
§ Users
can
create
and
manage
their
own
tables
– Unlimited
#
of
tables
§ Tables
can
be
created
in
any
directory
– Tables
count
towards
volume
and
user
quotas
§ No
admin
interven8on
needed
– I
can
create
a
file
or
a
directory
without
opening
a
8cket
with
admin
team,
why
not
a
table?
– Do
stuff
on
the
fly,
no
stop/restart
servers
§ Automa8c
data
protec8on
and
disaster
recovery
– Users
can
recover
from
snapshots/mirrors
on
their
own
40. 46
©MapR
Technologies
HBase
Write
AmplificaDon
Analysis
§ Assume
10G
per
region,
write
10%
per
day,
grow
10%
per
week
– 1G
of
writes
– a~er
7
days,
7
files
of
1G
and
1file
of
10G
(only
1G
is
growth)
§ IO
Cost
– Wrote
7G
to
WAL
+
7G
to
HFiles
– Compac8on
adds
s8ll
more
• read:
17G
(=
7
x
1G
+
1
x
10G)
• write:
11G
write
to
new
Hfile
– WAF
–
wrote
7G
“for
real”
but
actual
disk
IO
a~er
compac8on
is
read
17G
+
write
25G
and
that’s
assuming
no
applica8on
reads!
§ IO
Cost
of
1000
regions
similar
to
above
– read
17T,
write
25T
è
major
impact
on
node
§ Best
prac8ce,
limit
#
of
regions/node
à
can’t
fully
u8lize
storage
41. 47
©MapR
Technologies
AlternaDve:
Level-‐DB
§ Tiered,
logarithmic
increase
– L1:
2
x
1M
files
– L2:
10
x
1M
– L3:
100
x
1M
– L4:
1,000
x
1M,
etc
§ Compac8on
overhead
– avoids
IO
storms
(i/o
done
in
smaller
increments
of
~10M)
– but
significantly
extra
bandwidth
compared
to
HBase
§ Read
overhead
is
s8ll
high
– 10-‐15
seeks,
perhaps
more
if
the
lowest
level
is
very
large
– 40K
-‐
60K
read
from
disk
to
retrieve
a
1K
record
42. 48
©MapR
Technologies
BTree
analysis
§ Read
finds
data
directly,
proven
to
be
fastest
– interior
nodes
only
hold
keys
– very
large
branching
factor
– values
only
at
leaves
– thus
index
caches
work
– R
=
logN
seeks,
if
no
caching
– 1K
record
read
will
transfer
about
logN
blocks
from
disk
§ Writes
are
slow
on
inserts
– inserted
into
correct
place
right
away
– otherwise
read
will
not
find
it
– requires
btree
to
be
con8nuously
rebalanced
– causes
extreme
random
i/o
in
insert
path
– W
=
2.5x
+
logN
seeks
if
no
caching
43. 49
©MapR
Technologies
Log-‐Structured
Merge
Trees
§ LSM
Trees
reduce
insert
cost
by
deferring
and
batching
index
changes
– If
don't
compact
o~en,
read
perf
is
impacted
– If
compact
too
o~en,
write
perf
is
impacted
§ B-‐Trees
are
great
for
reads
– but
expensive
to
update
in
real-‐8me
Index Log
Index
Memory Disk
Write
Read
Can
we
combine
both
ideas?
Writes
cannot
be
done
beker
than
W
=
2.5x
write
to
log
+
write
data
to
somewhere
+
update
meta-‐data
44. 50
©MapR
Technologies
M7
from
MapR
§ Twis8ng
BTree's
– leaves
are
variable
size
(8K
-‐
8M
or
larger)
– can
stay
unbalanced
for
long
periods
of
8me
• more
inserts
will
balance
it
eventually
• automa8cally
throkles
updates
to
interior
btree
nodes
– M7
inserts
"close
to"
where
the
data
is
supposed
to
go
§ Reads
– Uses
BTree
structure
to
get
"close"
very
fast
• very
high
branching
with
key-‐prefix-‐compression
– U8lizes
a
separate
lower-‐level
index
to
find
it
exactly
• updated
"in-‐place”
bloom-‐filters
for
gets,
range-‐maps
for
scans
§ Overhead
– 1K
record
read
will
transfer
about
32K
from
disk
in
logN
seeks
45. 51
©MapR
Technologies
M7
provides
Instant
Recovery
§ Instead
of
having
one
WAL/region
server
or
even
one/region,
we
have
many
micro-‐WALs/region
§ 0-‐40
microWALs
per
region
– idle
WALs
“compacted”,
so
most
are
empty
– region
is
up
before
all
microWALs
are
recovered
– recovers
region
in
background
in
parallel
– when
a
key
is
accessed,
that
microWAL
is
recovered
inline
– 1000-‐10000x
faster
recovery
§ Never
perform
equivalent
of
HBase
major
or
minor
compac8on
§ Why
doesn't
HBase
do
this?
M7
uses
MapR-‐FS,
not
HDFS
– No
limit
to
#
of
files
on
disk
– No
limit
to
#
open
files
– I/O
path
translates
random
writes
to
sequen8al
writes
on
disk
46. 53
©MapR
Technologies
M7:
Fileservers
Serve
Regions
§ Region
lives
en8rely
inside
a
container
– Does
not
coordinate
through
ZooKeeper
§ Containers
support
distributed
transac8ons
– with
replica8on
built-‐in
§ Only
coordina8on
in
the
system
is
for
splits
– Between
region-‐map
and
data-‐container
– already
solved
this
problem
for
files
and
its
chunks
47. 57
©MapR
Technologies
M7
Containers
§ Container
holds
many
files
– regular,
dir,
symlink,
btree,
chunk-‐map,
region-‐map,
…
– all
random-‐write
capable
§ Container
is
replicated
to
servers
– unit
of
resynchroniza8on
§ Region
lives
en8rely
inside
1
container
– all
files
+
WALs
+
btree's
+
bloom-‐filters
+
range-‐maps
48. 63
©MapR
Technologies
Other
M7
Features
§ Smaller
disk
footprint
– M7
never
repeats
the
key
or
column
name
§ Columnar
layout
– M7
supports
64
column
families
– in-‐memory
column-‐families
§ Online
admin
– M7
schema
changes
on
the
fly
– delete/rename/redistribute
tables
§ Run
MapReduce
and
tables
on
same
cluster
§ UI:
hbase
shell,
MCS
GUI,
maprcli