002 hbase clientapi

Scott Miao 2012/7/5

1

Agenda
 Course Credit
 Hands-on
 Install tm-puppet
 Client API: The basics
 Hands-on
 Write your own CRUD codes~
 Client API: Advanced Feature
 All refers to
Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-
Lars-George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1
2

General Notes (1/2)
 Any mutate data operations are atomic
on a per-row basis

 Create HTable instances only once for
each thread
 HTable is not thread-safe
 Can use HTablePool
 Familiar with the API docs

3

General Notes (2/2)
 Configuration
 Load hbase-default & hbase-site.xml in
CLASSPATH
 Set properties in hadoop CLI
 Set properties in Java code
 Set in Java code > hadoop CLI > hbase-
site.xml > hbase-default.xml

4

Client API: The basics
 Put
 Get
 Delete
 Batch operations
 Row Locks
 Scan

Source code: https://github.com/larsgeorge/hbase-book

5

Put method - Single Put
 ch03/client.PutExample

 Notice the timestamp
 ts will set by HBase if user not provide
 ts determines version in HBase (default is 3)
 ts may confuse the HBase versioning if
Client‟s timezone is not identical

6

Put method - KeyValue
 The low level data bytes in Client APIs
 <row-key>/<family>:<qualifier>/<version>/<type>/<value-
length>

 Put.add(KeyValue kv);
 Map<byte[], List<KeyValue>> Put.getFamilyMap();

7

Put method –
Client-side write buffer (1/3)
 collects put operations so that they are
sent in one RPC call to the server(s)
 ch03/client.PutWriteBufferExample
sample code
 long getWriteBufferSize()
 void setWriteBufferSize(long
writeBufferSize)
 Default is 2 MB bytes
 Configuration property
○ hbase.client.write.buffer

8

Put method –
 Round-trip time
 Is the time it takes for a client to send a
request and the server to send a response
over the network
 Not include the Data-Transfer Time
○ Data size is a factor
 On average, 1ms on a LAN
○ 1000 round-trips per second
 Usecase
 Small data size but many requests to send

9

Put method –

10

Put method – List of Puts

 ch03/client.PutListExample
 ch03/client. PutListErrorExample2

11

Put method –
Atomic compare-and-set
 A check before put

 ch03/client.CheckAndPutExample

 Can not cross the row

12

Get method – Single Gets

 client.GetExample
 Result Class
 Contains all the matching cells

13

Get method – List of Gets
 client.GetListExample
 client. GetListErrorExample
 client.GetRowOrBeforeExample
 Find the specified rowKey
 Previous row if not found
 Null if no any found

14

Delete method – Single Deletes
 client.DeleteExample

15

Delete method – List of Deletes
 client.DeleteListExample
 client.DeleteListErrorExample

16

Delete method –
Atomic compare-and-delete

 client.CheckAndDeleteExample

17

Batch Operations
 client.BatchExample
 No client-side buffer, just like Put operations

18

Row Locks (1/3)
 Two types of lock
 Server side lock
○ Servers will create a lock implicitly on your
behalf, just for the duration of the call
 Client side lock
○ Clients can also acquire explicit locks and use
them across multiple operations on the same
row
○ RowLock Class

 client.RowLockExample

19

Row Locks (2/3)
 Avoid using row locks whenever
possible

 Do Gets require a Lock ?
 No, while a mutation is in progress, all
reading clients will be seeing the previous
state of all columns

20

Row Locks (3/3)
 When to release RowLock ?
 Current lock has been released
 The lease on the lock has expired
○ Configuration Property on the server side
 hbase.regionserver.lease.period
 Default is 1 min.
○ org.apache.hadoop.hbase.regionserver.LeaseExce
ption:

21

Scans (1/3)
 A technique akin to cursors in database
systems
 which make use of the underlying
sequential, sorted storage layout HBase is
providing

 Narrowing the scan‟s scope is playing
into the strengths of HBase
 Since data is stored in column families, you
will not read the unrelated families storage
files at all

22

Scans (2/3)
 Scan(byte[] startRow, byte[] stopRow)
 [startRow, stopRow)
 Scan addFamily(byte [] family)
 Scan addColumn(byte[] family, byte[]
qualifier)
 Scan setTimeRange(long minStamp,
long maxStamp) throws IOException
 Scan setTimeStamp(long timestamp)
 Scan setMaxVersions()
 Scan setMaxVersions(int maxVersions)
 Scan setFilter(Filter filter)
23

Scans (3/3)
 ch03/client.ScanExample

 Scans do not ship all the matching rows in
one RPC to the client
 one call would use up too many resources, and
take a long time
 ResultScanner wraps the Result instance for
each row into an iterator functionality
 An iterator functionality
 Just like JDBC ResultSet

24

Scans – Caching (1/2)
 Deal with small data rows with huge data
set size
 Table level
 void HTable.setScannerCaching(int
scannerCaching)
 Scan level
 void Scan.setCaching(int caching)
 In configuration file (hbase-site.xml)
 hbase.client.scanner.caching
 Will take effect depends on you put it on the
client or server side
25

Scans – Caching (2/2)
 Need to find a sweet spot between
 A low number of RPCs
 The memory used on client and server side

 Using the same lease-based
mechanisms with RowLock
 org.apache.hadoop.hbase.client.Scanne
rTimeoutException: 65094ms passed
since the last invocation, timeout is
currently set to 60000
 ch03/client.ScanTimeoutExample

26

Scans – Batching
 Deal with very large rows
 Those do not fit into the memory of the client
process
 batching works on the column level
 void Scan.setBatch(int batch)
 For example, your row has 17 columns
and you set the batch to 5…
 You‟ll get four Result instances, with 5, 5, 5,
and 2

27

Scans – Caching & Batching
(1/3)
 ch03/client.ScanCacheBatchExample
 10 rows * 20 columns per row = 200 columns

28

(2/3)
 RPCs = (Rows * Cols per Row) / Min(Cols per
Row, Batch Size) / Scanner Caching
= (10 * 20) / Min(20, 20) / 5
= 200 / 20 / 5
=2
 2 + 1 or 2 requests for open/close Scanner
= 3 or 4

29

(3/3)
 1 Table, 9 Rows, with some columns
 Caching set to 6, batch set to 3

30

Hands-On –
Write your own CRUD codes
(1/3)
 In hbase shell
 Create table
○ A Table „MY_SECOND_TABLE‟
○ With two column families „FAM_1‟, „FAM_2‟
 In java code
 Put
○ Two values
 Scan Table
 Delete
○ One value
 Get the last one value

31

Hands-On –
(2/3)
 Environment
 Let project_home = ${git_home}/hbase-
training/002/hands-on/${your_name}
 mkdir ${project_home}
 cp –rf ${git_home}/hbase-training/002/projects/training-
002 ${project_home}

 Write java codes in
 ${project_home}/src/main/java/client/CrudTest.java

32

Hands-On –
(3/3)
 Requirements
 After you completed your codes
 Run command in ${project_home}
○ Build the jar file
 mvn clean package
○ Run the jar file you built
 sh bin/run.sh > output.txt
○ output.txt
 Ran successfully and output the Hbase data
 I will verify this file in git

 Commit and push your git
33

Client API: Advanced
Features
 Filters
 Counters
 Coprocessors
 HTable Pool
 Connection Handling

34

Filters
 Get
 Direct access to data
 Scan
 Use start/end key
 Filters
 More limiting selectors to the query
 Applied on the server side
 Including
○ Column families, column
qualifiers, timestamps or ranges, version
number
35

Filters – How Filters work

36

Filters –
Hierarchy
 Various Filter
impl.s for your
needs

 You also can write
your own impl.

37

Filters – Comparison Filters
They take the comparison operator and
comparator instance
Class Name Description

RowFilter It is used to filter based on the row key

FamilyFilter It is used to filter based on the column family

QualifierFilter It is used to filter based on the column qualifier

ValueFilter It is used to filter based on column value

DependentColumnFil It uses the timestamp of the reference column and
ter includes all other columns that have the same
timestamp 38

Filters –
CompareFilter
Operators

39

Filters –
CompareFilter Comparators

40

Filters – CompareFilter
example

 ch04/filters.RowFilterExample

41

Filters – Dedicated Filters
Mainly used in the Scan, they basically
filter out entire rows

SingleColumnValueFil It is used to filter cells based on value
ter
SingleColumnValueEx Opposite with SingleColumnValueFilter
cludeFilter
PrefixFilter All rows that match this prefix are returned to the client

PageFilter It controls how many rows per page should be returned

KeyOnlyFilter It access just the keys of each KeyValue, while omitting the
actual data
FirstKeyOnlyFilter It access the key of first column in each row, and bypass the
rest 42

Filters – Dedicated Filters


InclusiveStopFilter Change the Scan [startRow, stopRow) to [startRow, stopRow]

TimestampFilter It returns only cells whose timestamp (version) is in the
specified list of timestamps (versions)
ColumnCountGetFilter It returns first N columns on row only, for HBase test purpose

ColumnPaginationFilter Similar to the PageFilter, this one can be used to page
through columns in a row
ColumnPrefixFilter Analog to the PrefixFilter, which worked by filtering on row
key prefixes, this filter does the same for columns
RandomRowFilter It including random rows into the result
43

Filters – Decorating Filters


SkipFilter wraps a given filter and extends it to exclude an entire row,
when the wrapped filter hints for a KeyValue to be skipped

WhileMatchFilter It aborts the entire scan once a piece of information is
filtered

44

Filters - FilterList
 In practice, you may want to have more
than one filter being applied to reduce
the data returned to your client
application

 Operators

45

Filters - FilterList
 ch04/filters.FilterListExample
 First scan filters is like

 Second scan filters is like

46

Filters – Custom Filters
 If there is no any Filters
can help your needs
 You could make one by
yourself !!

 Make a Filter Impl.
extended from
 Filter
 FilterBase

47

Filters – Custom Filters
 ch04/filters.CustomFilter
 ch04/filters.CustomFilterExample

 Custom Filters Deployment (costly)
1. Build jar file
2. Put jar file on the every region server
3. Append jar file path into $CLASSPATH in
hbase-env.sh
4. Restart all HBase daemons

48

Filters - Summary

49

Filters - Summary

50

Counters
 Many applications that collect statistics
 such as clicks or views in online advertising
 were used to collect the data in logfiles that
would subsequently be analyzed

 The Counter is all you need !!

51

Counters - shell
 Create a table
 create 'counters', 'daily', 'weekly', 'monthly„

 Initial a counter
 incr 'counters', '20110101', 'daily:hits', 1
 Let‟s do it twice

 Get your counter
 get_counter 'counters', '20110101',
'daily:hits'
52

Counters - shell
 You can also fine-tune your counter
 incr 'counters', '20110101', 'daily:hits', 0
 incr 'counters', '20110101', 'daily:hits', -1

 Do not use put as incr, despite counter
is also a value
 Data type issue, long V.S. String
 Use get_counter not get
 It is more human readable~

53

Counters - API
 Single Counters
 ch04/client.IncrementSingleExample

 Multiple Counters
 ch04/client.IncrementMultipleExample

54

Coprocessors
 With the coprocessor feature in HBase,
you can even move part of the
computation to where the data lives

 As a small MapReduce framework,
which can distribute the work across the
entire cluster

55

Coprocessors
 Two types
 Observer
○ Trigger-like
 Endpoint
○ Stored procedure-like
 Usecases
 Aggregate functions, sum(), avg()
 Integrity Checks, put some data and other data
must exist
 Authentication, authorization and auditing
○ Based on Coprocessors from 0.92 HBase

56

Coprocessors –
Coprocessor Class
 Priorities defined in Coprocessor.Priority
enumeration

57

Coprocessors –
Coprocessor Class
 State defined in Coprocessor.State
enumeration

58

Coprossesor – Main
Classes

59

Coprossesor – Flow

60

Coprocessor – Loading from
Configuration
 Add following description in hbase-site.xml

 Region, master, wal are different Observers
 The order of Class fully-qualified names in value, will
determine the execution order
 And follow the Custom-Filter deployment way
 For every table and region
61

Coprocessor – Loading from
table descriptor
 Use HTableDescriptor.setValue(String key, String value)
 Key spec.
 COPROCESSOR[$<number>]
 Ex.
○ COPROCESSOR$1
 Value spec.
 <jarFilePath>|<classFullyQualifiedName>|<priority>
 Ex.
○ “hdfs://localhost:8020/users/leon/test.jar|coprocessor.Test|SYSTEM”
 jarFilePath could be any protocol supported by Hadoop FileSystem Class

 Ch04/coprocessor.LoadWithTableDescriptorExample
 Only for regions of specified table

62

Coprocessor - Observer
 callback functions (hooks) are executed
when certain events occur
 Known as Triggers in DBMS

Observer Type Description

RegionObserver Observse events bound to the regions of a table

MasterObserver Observe evens bound to administrative or DDL-type
operations (cluster-wide event)

WALObserver Observe events bound to WAL log (Write-ahead log)
processing
63

Coprocessor – Observer main
claases

64

Coprocessor – RegionObserver
and Region Life Cycle

65

Coprocessor – RegionObserver
Classes
• Handling region life cycle events
• Handling client API events
• ch04/coprocessor.RegionObserverExample

66

Coprocessor – MasterObserver
Classes
• ch04/coprocessor.MasterObserverExample

67

Coprocessor - Endpoint
 User code can be deployed to the
servers hosting the data to, for example,
perform server-local computations

 Known as Stored procedures in DBMS

 Can be combined with observer
implementations to directly interact with
the server-side state

68

Coprocessor – Endpoint main
Classes
• ch04/coprocessor.RowCountProtocol
• ch04/coprocessor.RowCountEndpoint
• ch04/coprocessor.EndpointExample
• ch04/coprocessor.EndpointProxyExample

69

Coprocessor –
Single Region V.S. Range of regions

70

HTablePool
 Creating an HTable instance takes a few
seconds to complete
 It is not be capable in highly contended
environment with thousands of requests
per second
 Keep one HTable instance for multiple
uses, but it is not thread-safe

71

HTablePool – Sample code

72

Connection Handling
 Use the shared Connection as you can

73

Connection Handling –
Main Classes

74

Connection Handling –
Features
 Share ZooKeeper connections
 initial lookup of where user table regions are
located
 Cache common resources
 Location is cached on the client side after first
round-trips with ZooKeeper and other servers
 When a lookup fails
○ Ex. A region was split
○ A built-in retry mechanism to refresh the stale
cache information
 Do not forget to release your shared Connection
 HTable.close()
 HTablePool.closeTablePool(…)

75

002 hbase clientapi

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a 002 hbase clientapi

Semelhante a 002 hbase clientapi (20)

Mais de Scott Miao

Mais de Scott Miao (9)

Último

Último (20)

002 hbase clientapi