2. Agenda
Course Credit
Hands-on
Install tm-puppet
Client API: The basics
Hands-on
Write your own CRUD codes~
Client API: Advanced Feature
All refers to
Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-
Lars-George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1
2
3. General Notes (1/2)
Any mutate data operations are atomic
on a per-row basis
Create HTable instances only once for
each thread
HTable is not thread-safe
Can use HTablePool
Familiar with the API docs
3
4. General Notes (2/2)
Configuration
Load hbase-default & hbase-site.xml in
CLASSPATH
Set properties in hadoop CLI
Set properties in Java code
Set in Java code > hadoop CLI > hbase-
site.xml > hbase-default.xml
4
5. Client API: The basics
Put
Get
Delete
Batch operations
Row Locks
Scan
Source code: https://github.com/larsgeorge/hbase-book
5
6. Put method - Single Put
ch03/client.PutExample
Notice the timestamp
ts will set by HBase if user not provide
ts determines version in HBase (default is 3)
ts may confuse the HBase versioning if
Client‟s timezone is not identical
6
7. Put method - KeyValue
The low level data bytes in Client APIs
<row-key>/<family>:<qualifier>/<version>/<type>/<value-
length>
Put.add(KeyValue kv);
Map<byte[], List<KeyValue>> Put.getFamilyMap();
7
8. Put method –
Client-side write buffer (1/3)
collects put operations so that they are
sent in one RPC call to the server(s)
ch03/client.PutWriteBufferExample
sample code
long getWriteBufferSize()
void setWriteBufferSize(long
writeBufferSize)
Default is 2 MB bytes
Configuration property
○ hbase.client.write.buffer
8
9. Put method –
Client-side write buffer (2/3)
Round-trip time
Is the time it takes for a client to send a
request and the server to send a response
over the network
Not include the Data-Transfer Time
○ Data size is a factor
On average, 1ms on a LAN
○ 1000 round-trips per second
Usecase
Small data size but many requests to send
9
11. Put method – List of Puts
ch03/client.PutListExample
ch03/client. PutListErrorExample2
11
12. Put method –
Atomic compare-and-set
A check before put
ch03/client.CheckAndPutExample
Can not cross the row
12
13. Get method – Single Gets
client.GetExample
Result Class
Contains all the matching cells
13
14. Get method – List of Gets
client.GetListExample
client. GetListErrorExample
client.GetRowOrBeforeExample
Find the specified rowKey
Previous row if not found
Null if no any found
14
18. Batch Operations
client.BatchExample
No client-side buffer, just like Put operations
18
19. Row Locks (1/3)
Two types of lock
Server side lock
○ Servers will create a lock implicitly on your
behalf, just for the duration of the call
Client side lock
○ Clients can also acquire explicit locks and use
them across multiple operations on the same
row
○ RowLock Class
client.RowLockExample
19
20. Row Locks (2/3)
Avoid using row locks whenever
possible
Do Gets require a Lock ?
No, while a mutation is in progress, all
reading clients will be seeing the previous
state of all columns
20
21. Row Locks (3/3)
When to release RowLock ?
Current lock has been released
The lease on the lock has expired
○ Configuration Property on the server side
hbase.regionserver.lease.period
Default is 1 min.
○ org.apache.hadoop.hbase.regionserver.LeaseExce
ption:
21
22. Scans (1/3)
A technique akin to cursors in database
systems
which make use of the underlying
sequential, sorted storage layout HBase is
providing
Narrowing the scan‟s scope is playing
into the strengths of HBase
Since data is stored in column families, you
will not read the unrelated families storage
files at all
22
24. Scans (3/3)
ch03/client.ScanExample
Scans do not ship all the matching rows in
one RPC to the client
one call would use up too many resources, and
take a long time
ResultScanner wraps the Result instance for
each row into an iterator functionality
An iterator functionality
Just like JDBC ResultSet
24
25. Scans – Caching (1/2)
Deal with small data rows with huge data
set size
Table level
void HTable.setScannerCaching(int
scannerCaching)
Scan level
void Scan.setCaching(int caching)
In configuration file (hbase-site.xml)
hbase.client.scanner.caching
Will take effect depends on you put it on the
client or server side
25
26. Scans – Caching (2/2)
Need to find a sweet spot between
A low number of RPCs
The memory used on client and server side
Using the same lease-based
mechanisms with RowLock
org.apache.hadoop.hbase.client.Scanne
rTimeoutException: 65094ms passed
since the last invocation, timeout is
currently set to 60000
ch03/client.ScanTimeoutExample
26
27. Scans – Batching
Deal with very large rows
Those do not fit into the memory of the client
process
batching works on the column level
void Scan.setBatch(int batch)
For example, your row has 17 columns
and you set the batch to 5…
You‟ll get four Result instances, with 5, 5, 5,
and 2
27
30. Scans – Caching & Batching
(3/3)
1 Table, 9 Rows, with some columns
Caching set to 6, batch set to 3
30
31. Hands-On –
Write your own CRUD codes
(1/3)
In hbase shell
Create table
○ A Table „MY_SECOND_TABLE‟
○ With two column families „FAM_1‟, „FAM_2‟
In java code
Put
○ Two values
Scan Table
Delete
○ One value
Get the last one value
31
32. Hands-On –
Write your own CRUD codes
(2/3)
Environment
Let project_home = ${git_home}/hbase-
training/002/hands-on/${your_name}
mkdir ${project_home}
cp –rf ${git_home}/hbase-training/002/projects/training-
002 ${project_home}
Write java codes in
${project_home}/src/main/java/client/CrudTest.java
32
33. Hands-On –
Write your own CRUD codes
(3/3)
Requirements
After you completed your codes
Run command in ${project_home}
○ Build the jar file
mvn clean package
○ Run the jar file you built
sh bin/run.sh > output.txt
○ output.txt
Ran successfully and output the Hbase data
I will verify this file in git
Commit and push your git
33
35. Filters
Get
Direct access to data
Scan
Use start/end key
Filters
More limiting selectors to the query
Applied on the server side
Including
○ Column families, column
qualifiers, timestamps or ranges, version
number
35
37. Filters –
Hierarchy
Various Filter
impl.s for your
needs
You also can write
your own impl.
37
38. Filters – Comparison Filters
They take the comparison operator and
comparator instance
Class Name Description
RowFilter It is used to filter based on the row key
FamilyFilter It is used to filter based on the column family
QualifierFilter It is used to filter based on the column qualifier
ValueFilter It is used to filter based on column value
DependentColumnFil It uses the timestamp of the reference column and
ter includes all other columns that have the same
timestamp 38
42. Filters – Dedicated Filters
Mainly used in the Scan, they basically
filter out entire rows
Class Name Description
SingleColumnValueFil It is used to filter cells based on value
ter
SingleColumnValueEx Opposite with SingleColumnValueFilter
cludeFilter
PrefixFilter All rows that match this prefix are returned to the client
PageFilter It controls how many rows per page should be returned
KeyOnlyFilter It access just the keys of each KeyValue, while omitting the
actual data
FirstKeyOnlyFilter It access the key of first column in each row, and bypass the
rest 42
43. Filters – Dedicated Filters
Class Name Description
InclusiveStopFilter Change the Scan [startRow, stopRow) to [startRow, stopRow]
TimestampFilter It returns only cells whose timestamp (version) is in the
specified list of timestamps (versions)
ColumnCountGetFilter It returns first N columns on row only, for HBase test purpose
ColumnPaginationFilter Similar to the PageFilter, this one can be used to page
through columns in a row
ColumnPrefixFilter Analog to the PrefixFilter, which worked by filtering on row
key prefixes, this filter does the same for columns
RandomRowFilter It including random rows into the result
43
44. Filters – Decorating Filters
Class Name Description
SkipFilter wraps a given filter and extends it to exclude an entire row,
when the wrapped filter hints for a KeyValue to be skipped
WhileMatchFilter It aborts the entire scan once a piece of information is
filtered
44
45. Filters - FilterList
In practice, you may want to have more
than one filter being applied to reduce
the data returned to your client
application
Operators
45
46. Filters - FilterList
ch04/filters.FilterListExample
First scan filters is like
Second scan filters is like
46
47. Filters – Custom Filters
If there is no any Filters
can help your needs
You could make one by
yourself !!
Make a Filter Impl.
extended from
Filter
FilterBase
47
48. Filters – Custom Filters
ch04/filters.CustomFilter
ch04/filters.CustomFilterExample
Custom Filters Deployment (costly)
1. Build jar file
2. Put jar file on the every region server
3. Append jar file path into $CLASSPATH in
hbase-env.sh
4. Restart all HBase daemons
48
51. Counters
Many applications that collect statistics
such as clicks or views in online advertising
were used to collect the data in logfiles that
would subsequently be analyzed
The Counter is all you need !!
51
52. Counters - shell
Create a table
create 'counters', 'daily', 'weekly', 'monthly„
Initial a counter
incr 'counters', '20110101', 'daily:hits', 1
Let‟s do it twice
Get your counter
get_counter 'counters', '20110101',
'daily:hits'
52
53. Counters - shell
You can also fine-tune your counter
incr 'counters', '20110101', 'daily:hits', 0
incr 'counters', '20110101', 'daily:hits', -1
Do not use put as incr, despite counter
is also a value
Data type issue, long V.S. String
Use get_counter not get
It is more human readable~
53
54. Counters - API
Single Counters
ch04/client.IncrementSingleExample
Multiple Counters
ch04/client.IncrementMultipleExample
54
55. Coprocessors
With the coprocessor feature in HBase,
you can even move part of the
computation to where the data lives
As a small MapReduce framework,
which can distribute the work across the
entire cluster
55
56. Coprocessors
Two types
Observer
○ Trigger-like
Endpoint
○ Stored procedure-like
Usecases
Aggregate functions, sum(), avg()
Integrity Checks, put some data and other data
must exist
Authentication, authorization and auditing
○ Based on Coprocessors from 0.92 HBase
56
61. Coprocessor – Loading from
Configuration
Add following description in hbase-site.xml
Region, master, wal are different Observers
The order of Class fully-qualified names in value, will
determine the execution order
And follow the Custom-Filter deployment way
For every table and region
61
62. Coprocessor – Loading from
table descriptor
Use HTableDescriptor.setValue(String key, String value)
Key spec.
COPROCESSOR[$<number>]
Ex.
○ COPROCESSOR$1
Value spec.
<jarFilePath>|<classFullyQualifiedName>|<priority>
Ex.
○ “hdfs://localhost:8020/users/leon/test.jar|coprocessor.Test|SYSTEM”
jarFilePath could be any protocol supported by Hadoop FileSystem Class
Ch04/coprocessor.LoadWithTableDescriptorExample
Only for regions of specified table
62
63. Coprocessor - Observer
callback functions (hooks) are executed
when certain events occur
Known as Triggers in DBMS
Observer Type Description
RegionObserver Observse events bound to the regions of a table
MasterObserver Observe evens bound to administrative or DDL-type
operations (cluster-wide event)
WALObserver Observe events bound to WAL log (Write-ahead log)
processing
63
68. Coprocessor - Endpoint
User code can be deployed to the
servers hosting the data to, for example,
perform server-local computations
Known as Stored procedures in DBMS
Can be combined with observer
implementations to directly interact with
the server-side state
68
71. HTablePool
Creating an HTable instance takes a few
seconds to complete
It is not be capable in highly contended
environment with thousands of requests
per second
Keep one HTable instance for multiple
uses, but it is not thread-safe
71
75. Connection Handling –
Features
Share ZooKeeper connections
initial lookup of where user table regions are
located
Cache common resources
Location is cached on the client side after first
round-trips with ZooKeeper and other servers
When a lookup fails
○ Ex. A region was split
○ A built-in retry mechanism to refresh the stale
cache information
Do not forget to release your shared Connection
HTable.close()
HTablePool.closeTablePool(…)
75