HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
HBase Applications - Atlanta HUG - May 2014
1. 1
HBase
Applica-ons
Selected
Use-‐Cases
around
a
Common
Theme
Atlanta
HUG
–May
2014
Lars
George,
Cloudera
EMEA
Chief
Architect
2. 2
About
Me
• EMEA
Chief
Architect
@
Cloudera
• Consul-ng
on
Hadoop
projects
(everywhere)
• Apache
CommiNer
• HBase
and
Whirr
• O’Reilly
Author
• HBase
–
The
Defini-ve
Guide
• Now
in
Japanese!
• Contact
• lars@cloudera.com
• @larsgeorge
日本語版も出ました!
3. 3
The
Content...
• HBase
-‐
Strengths
and
weaknesses
• Common
use-‐cases
and
paNerns
• Focus
on
specific
type
of
applica-ons
• Summary
4. 4
CONFIDENTIAL
-‐
RESTRICTED
HBase
Strength
and
Weaknesses
5. 5
IOPS
vs
Throughput
Mythbusters
It
is
all
physics
in
the
end,
you
cannot
solve
an
I/O
problem
without
reducing
I/O
in
general.
Parallelize
access
and
read/write
sequen-ally.
6. 6
HBase:
Strengths
&
Weaknesses
Strengths:
• Random
access
to
small(ish)
key-‐value
pairs
• Rows
and
columns
stored
sorted
lexicographically
• Adds
table
and
region
concepts
to
group
related
KVs
• Stores
and
reads
data
sequen-ally
• Parallelizes
across
all
clients
• Non-‐blocking
I/O
throughout
7. 7
HBase:
Strengths
&
Weaknesses
Weaknesses:
• Not
op-mized
(yet)
for
100%
possible
throughput
of
underlying
storage
layer
• And
HDFS
is
not
op-mized
fully
either
• Single
writer
issue
with
WALs
• Single
server
hot-‐spojng
with
non-‐distributed
keys
8. 8
PaNerns
• There
are
common
paNerns
in
many
common
use-‐
cases,
like
programming
paNerns.
• We
need
to
extract
these
common
paNerns
and
make
them
repeatable.
• Similar
to
the
“Gang
of
Four”
(Gamma,
Helm,
Johnson,
Vlissides),
or
the
“Three
Amigos”
(Booch,
Jacobson,
Rumbaugh)
10. 10
HBase
Dilemma
Although
HBase
can
host
many
applica-ons,
they
may
require
completely
opposite
features
Events Entities
Time Series Message Store
11. 11
This
talk
(at
this
event)
• Message
Store
• Informa-on
exchange
between
en--es
• Sending/Receiving
informa-on
is
an
event
• Time-‐Series
• Sequence
of
data
points
measure
at
successive
points
in
-me,
spaced
at
uniform
intervals
• Measuring
of
a
data
point
is
an
event
13. 13
HBase
“Indexes”
(cont.)
• Use
primary
keys,
aka
the
row
keys,
as
sorted
index
• One
sort
direc-on
only
• Use
“secondary
index”
to
get
reverse
sor-ng
• Lookup
table
or
same
table
• Use
secondary
keys,
aka
the
column
qualifiers,
as
sorted
index
within
main
record
• Use
prefixes
within
a
column
family
or
separate
column
families
16. 16
HBase
Message
Store
Use-‐Case:
• Store
incoming
messages
in
HBase,
such
as
Emails,
SMS,
MMS,
IM
• Constant
updates
of
exis-ng
en--es
• e.g.
Email
read,
flagged,
starred,
moved,
deleted
• Reading
of
top-‐N
entries,
sorted
by
-me
• Newest
20
messages,
last
20
conversa-ons
• Examples:
• Facebook
Messages
17. 17
Problem
Descrip-on
• Records
are
of
varying
size
• Large
ones
hinder
smaller
ones
• Massive
index
issue
• User
can
sort,
filter
by
everything
• At
the
same
-me
reading
top-‐N
should
be
fast
• But
what
to
do
for
automated
accounts?
80/20
rule?
• Only
doable
with
heuris-cs
• Only
create
minimal
indexes
• Create
addi-onal
ones
when
user
asks
for
it
• Cross
mailbox
issues
with
Conversa-ons
• Similar
to
-meline
in
Facebook
• Overall
requirements
for
I/O
18. 18
Interlude I: Compaction
Details
Write Amplification in HBase
19. 19
Compac-ons
in
HBase
• Must
happen
to
keep
data
in
check
• Combine
small
flush
files
into
larger
ones
• Remove
old
data
(during
major
compac-ons)
• Two
types:
Minor
and
Major
Compac-ons
• Minor
are
triggered
with
API
muta-on
calls
• Major
are
-me
scheduled
(or
auto-‐promoted)
• Both
can
be
triggered
manually
if
needed
• Add
extra
background
I/O
that
grows
over
-me
• Write
amplifica-on!
• Have
to
be
tuned
for
heavy
write
systems
38. 38
Addi-onal
Notes
#1
There
are
a
few
more
sejngs
for
compac-ons:
• hbase.hstore.compaction.max = 10
Limit
per
maximum
number
of
files
per
compac-on
• hbase.hstore.compaction.max.size =
Long.MAX_VALUE
Exclude
files
larger
than
that
sejng
(0.92+)
• hbase.hregion.majorcompaction = 1d
Scheduled
major
compac-ons
39. 39
Addi-onal
Notes
#2
• hbase.hstore.compaction.kv.max = 10
Limits
internal
scanner
caching
during
read
of
files
to
be
compacted
• hbase.hstore.blockingStoreFiles = 7
Enforces
upper
limit
of
files
for
compac-ons
to
catch
up
-‐
blocks
user
opera-ons!
• hbase.hstore.blockingWaitTime = 90s
Upper
limit
on
blocking
user
opera-ons
41. 41
Writes:
Flushes
and
Compac-ons
Older NewerTIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations
Unique Row Inserts
We are looking at two specific rows,
one is never changed, the other
frequently
50. 50
Compac-on
Summary
• Compac-on
tuning
is
important
• Do
not
be
too
aggressive
or
write
amplifica-on
is
no-ceable
under
load
• Use
-mestamp/-‐ranges
in
Get/Scan
to
limit
files
Ra+o
Effect
1.0
Dampened,
causes
more
store
files,
needs
to
be
combined
with
an
effec-ve
Bloom
filter
usage
(non
random)
1.2
Default
value,
moderate
sejng
1.4
More
aggressive,
keeps
number
of
files
low,
causes
more
auto
promoted
major
compac-ons
to
occur
53. 53
Background
on
Bloom
Filters
• Bit
arrays
of
m
bits,
an
k
hash
func-ons
• HBase
uses
Hash
folding
• Returns
“No”
or
“Maybe”
only
• Error
rate
tunable,
usually
about
1%
• At
1%
error
rate,
op-mal
k
9.6
bits
per
key
m=18, k=3
55. 55
Read
Time
Series
Entry
• Event
record
is
wriNen
once
and
never
deleted
or
updated
• Keeps
en-re
record
in
specific
loca-on
in
storage
files
• Use
-me
range
to
indicate
what
is
needed
• {Get|Scan}.setTimeRange()
• Helps
system
to
skip
unnecessary
(older)
files
• Bloom
Filter
helps
for
given
row
key(s)
and
column
qualifiers
• Can
skip
files
not
containing
requested
details
56. 56
Writes:
Flushes
and
Compac-ons
Older NewerTIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations
Unique Row Inserts
Single
Block
Read
(64K)
Block
filter
and/or
-me
range
eliminates
all
other
store
files
57. 57
Read
Updateable
En-ty
• Data
is
updated
regularly,
aging
out
at
intervals
• Reading
en-ty
needs
to
read
all
details
to
recons-tute
the
current
state
• Deletes
mask
out
aNributes
• Updates
overrides
(or
complements)
aNributes
• Bloom
filters
will
have
a
hard
-me
to
say
“no”
since
most
files
might
contain
en-ty
aNributes
• Time
filter
on
scans
or
gets
also
has
few
op-ons
to
skip
files
since
older
aNributes
might
s-ll
be
important
58. 58
Writes:
Flushes
and
Compac-ons
Older NewerTIME
SIZE (MB)
1000
0
250
500
750
Bloom
Filter
returns
“yes”
for
all
but
two
files:
7+
block
loads
(64KB)
needed
yes
yes
yes
yes
yes
no
yes
yes
no
59. 59
Bloom
Filter
Op-ons
There
are
three
choices:
• NONE
Duh!
Use
this
when
the
Bloom
Filter
is
not
useful
based
on
the
use-‐case
(Default
sejng)
• ROW
Index
only
row
key,
needs
an
entry
per
row
key
in
Bloom
Filter
• ROWCOL
Index
row
and
column
key,
requires
an
entry
in
the
Filter
for
every
column
cell
(KeyValue)
61. 61
Bloom
Filter
Summary
• They
help
a
lot
-‐
but
not
always
• Highly
depends
on
write
paNerns
• Keep
an
eye
on
size,
since
they
are
cached
• HFile
v2
helps
here
as
it
only
loads
root
index
info
“Bloom
filters
can
get
as
large
as
100
MB
per
HFile,
which
adds
up
to
2
GB
when
aggregated
over
20
regions.
Block
indexes
can
grow
as
large
as
6
GB
in
aggregate
size
over
the
same
set
of
regions.”
Source:
hNp://hbase.apache.org/book/hfilev2.html
64. 64
Write-‐ahead
Log
-‐
Overview
• One
file
per
Region
Server
• All
regions
have
a
reference
to
this
file
• Actually
a
wrapper
around
the
physical
file
• The
file
is
in
the
end
a
Hadoop
SequenceFile
• Stored
in
HDFS
so
it
can
be
recovered
ater
a
server
failure
• There
is
a
synchroniza+on
barrier
that
impacts
all
parallel
writers,
aka
clients
• Overall
performance
is
BAD,
maybe
10MB/s
65. 65
Write-‐ahead
Log
-‐
Workarounds
• Enable
log
compression
hbase.regionserver.wal.enablecompression
• Disable
WAL
for
secondary
records
• Restore
indexes
or
derived
records
from
main
one
• But
be
careful
to
use
coprocessor
hook
as
it
cannot
access
currently
replaying
region
• Work
on
upstream
JIRAs
• Mul+ple
logs
per
server
• Fix
single
writer
issue
in
HDFS
66. 66
Back to the main
theme...
Yes, message stores.
67. 67
Schema
• Every
line
is
an
inbox
• Indexes
as
CFs
or
separate
tables
• Random
updates
and
inserts
cause
storage
file
churn
• Facebook
used
more
than
4
or
5
schema
itera+ons
• Not
representa-ve
really:
pure
blob
storage
• Evolved
over
-me
to
be
more
HBase
like
• Another
customer
iterated
about
the
same
-me
over
various
schemas
• Difficult
to
keep
indexes
up
to
date
73. 73
Notes
on
Facebook
Schema
1
This
is
basically
the
same
as
the
NameNode,
i.e.
the
applica-on
only
writes
edits
and
those
are
merged
with
a
snapshot
of
the
data.
The
applica-on
does
not
use
HBase
as
an
opera-onal
store,
but
all
data
is
cached
in
memory.
Writes
occasionally
large
chunks,
and
reads
only
a
few
-mes
to
merge
or
recover.
74. 74
Notes
on
Facebook
Schema
1
Three
column
families:
• Snapshot,
Ac+ons,
Keywords
Sejngs
changes:
• DFS
Block
Size:
256MB
• Since
large
KVs
are
wriNen
• Efficiency
of
HFile
block
index
a
concern
• Compac-on
ra-o:
1.4
• Be
more
aggressive
to
clean
up
files
• Split
Size:
2TB
• Manage
splijng
manually
• Major
Compac-ons:
3
days
76. 76
Notes
on
Facebook
Schema
2
• Eight
column
families
• Snapshots
per
thread
(user
to
user)
Sejngs
changes:
• Block
Cache
Size:
55%
• Cache
more
data
on
HBase
side
• Blocking
Store
Files:
25
• Allow
more
files
to
be
around
• Compac-on
Min
Size:
4MB
• Reduce
number
of
uncondi-onally
selected
files
• Major
Compac-ons:
14
days
78. 78
Notes
on
Facebook
Schema
3
• Eleven
column
families
• Twenty
regions
per
server
• One
hundred
server
per
cluster
Sejngs
changes:
• Block
Cache
Size:
60%
• Cache
more
data
on
HBase
side
• Region
Slop:
5%
(from
20%)
• Keep
strict
boundaries
on
regions
per
server
80. 80
Note
the
imbalance!
Recall
flushes
are
interconnected
and
causes
compac-on
storms.
81. 81
FB
Messages
Summary
• Triggered
many
changes
in
HBase:
• Change
compac-on
selec-on
algorithm
• Upper
bounds
on
file
sizes
• Pools
for
small
and
large
compac-ons
• Online
schema
changes
• Finer
grained
metrics
• Lazy
seeking
in
files
• Point-‐seek
op-miza-ons
• …
82. 82
FB
Messages
Summary
• Went
from
“Snapshot”
to
more
proper
schema
• Needed
to
wait
for
schema
to
seNle
• Could
sustain
warped
load
for
a
while
• Eventually
uses
HBase
more
as
KV
store
• Tweaked
sejngs
depending
on
schema
• Tuned
compac-ons
from
aggressive
to
relaxed
• Changed
block
sizes
to
fit
KV
sizes
• Strict
limit
on
I/O
• 100
server
• 20
regions
per
server
• 50
million
users
per
cluster
84. 84
Events
make
big
data
big
• Majority
use
cases
are
dealing
with
event
based
data
• Especially
on
HDFS
and
MapReduce
level
• Machine
Scale
vs.
Human
Scale
• Event
has
aNributes
• Type
• Iden-fier
• Actor
• Other
aNributes
85. 85
Events
contd.
• Accessing
event
data
• Give
me
everything
about
event
e_id1
• Give
me
everything
in
[t1,t2]
• Give
me
everything
for
event
type
e_t1
in
[t1,t2]
• Give
me
everything
for
actor
a1
in
[t1,t2]
• Give
me
everything
for
event
type
e_t1
by
actor
a1
in
[t1,t2]
• Aggregate
based
on
some
parameters
(like
above)
and
report
• Find
events
that
match
some
other
given
criteria
86. 86
HBase
and
Time
Series
• Access
paNerns
suited
for
HBase
• Random
access
to
event
data
or
aggregate
data
• Serving…
Not
real
-me
compu-ng
(that’s
Impala)
• Schema
design
is
the
tricky
thing
• OpenTSDB
does
this
well
(but
limited)
• Key
principle:
• Collocate
data
you
want
to
read
together
• Spread
out
as
much
as
possible
at
write
-me
• The
above
two
are
conflic-ng
in
a
lot
of
cases.
So,
you
decide
on
trade
off
87. 87
Time
Series
design
paNerns
• Ingest
• Flume
or
direct
wri-ng
via
app
• HDFS
• Batch
queries
in
Hive
• Faster
queries
in
Impala
• No
user
-me
serving
• HBase
• Serve
individual
events
(OpenTSDB)
• Serve
pre-‐computed
aggregates
(OpenTSDB,
FB
Insights)
• Solr
• To
make
individual
events
searchable
88. 88
Time
Series
design
paNerns
• Land
data
in
HDFS
and
HBase
• Aggregate
in
HDFS
and
write
to
HBase
• HBase
can
do
some
aggregates
too
(counters)
• Keep
serve-‐able
data
in
HBase.
Then
discard
(TTL
tw)
• Keep
all
data
in
HDFS
for
future
use
89. 89
The
story
with
only
HBase
• Landing
des-na-on
• Aggregates
via
counters
• Serving
end
users
• Event
-‐>
Flume/App
-‐>
HBase
• Raw
entry
in
HBase
for
exact
value
• Mul-ple
counter
increments
for
aggregates
• OSS
implementa-on
-‐
OpenTSDB
91. 91
Applica-ons
in
HBase
Requires
working
with
schema
peculiari-es
and
implementa-on
idiosyncrasies.
Important
is
to
compute
write
rate
and
un-‐op+mize
schema
to
fit
given
hardware.
If
hardware
is
no
issue
then
the
op-mum
is
achievable.
Trifacta
of
good
performance:
Compac+ons,
Bloom
Filters,
and
key
design.
(but
also
look
out
for
Memstore
and
Blockcache
sejngs)