O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Supporting Apache HBase
Troubleshooting and Supportability Improvements
2© Cloudera, Inc. All rights reserved.
Who we are
• Daisuke Kobayashi (d1ce_)
• Customer support at Cloudera since 2012, f...
3© Cloudera, Inc. All rights reserved.
Supporting HBase
• Typical Troubleshooting Scenario with HBase
• Fix performance de...
4© Cloudera, Inc. All rights reserved.
Agenda
• General approach to HBase performance issues with existing tools
• htop - ...
© Cloudera, Inc. All rights reserved.
General approach to HBase performance issues with existing
tools
(Logs and metrics a...
6 © Cloudera, Inc. All rights reserved.
• Performance issues are tough!
• Typical reasons
• “Hot Spot” Region
• Region wit...
7© Cloudera, Inc. All rights reserved.
Approach to Performance Troubleshooting
Source -
https://www.slideshare.net/brendan...
8© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
MemStoreBlockCache
RPC System (Handlers...
9© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
RPC System (Handlers / Queues)
HDFS Cli...
10© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
RPC System (Handlers / Queues)
HDFS Cl...
11© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
12© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
• Numer of RPC requests
• Incremented by one b...
13© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
• RPC queue length & request size
"name" : "Ha...
14© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
"name" : "Hadoop:service=HBase,name=RegionServ...
15© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
• Observability Improvements
• In case of slow...
16© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
17© Cloudera, Inc. All rights reserved.
RegionServer webui
Memstore Utilization & Saturation
Raw metrics
"name" : "Hadoop:...
18© Cloudera, Inc. All rights reserved.
Cloudera Manager chart:
select memstore_size
where category = HREGION
Memstore Uti...
19© Cloudera, Inc. All rights reserved.
Memstore Utilization & Saturation
• Log snippet where a flush finishes
• Frequency...
20© Cloudera, Inc. All rights reserved.
Memstore Utilization & Saturation
2019-05-13 17:12:08,001 INFO org.apache.hadoop.h...
21© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
22© Cloudera, Inc. All rights reserved.
Blockcache Utilization & Saturation
• Current block cache usage
• Cache eviction
"...
23© Cloudera, Inc. All rights reserved.
Cloudera Manager chart:
select block_cache_free_size
where roleType = REGIONSERVER...
24© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
25© Cloudera, Inc. All rights reserved.
HDFS Client Utilization & Saturation
"name" : "Hadoop:service=HBase,name=RegionSer...
© Cloudera, Inc. All rights reserved.
htop – Real-Time Monitoring Tool for HBase
27 © Cloudera, Inc. All rights reserved.
• HBASE-11062 htop
• Work in Progress!
• Unix top-like tool
• Real-time monitorin...
28 © Cloudera, Inc. All rights reserved.
• HBase UIs
• The metrics of the moment
• Can't see the metrics in time series
• ...
29 © Cloudera, Inc. All rights reserved.
htop motivation
HBase UI
Ganglia/OpenTSDB/
Cloudera Manager/
Ambari Metrics
htop
...
30 © Cloudera, Inc. All rights reserved.
htop features
htop screen
• Command to start htop:
• $ hbase top
• Similar to Uni...
31 © Cloudera, Inc. All rights reserved.
htop features
htop screen
• Demo (https://asciinema.org/a/247434)
32 © Cloudera, Inc. All rights reserved.
• Press d key and put a new refresh delay
• We can also change the default refres...
33 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247447)
htop features
Change refresh delay
34 © Cloudera, Inc. All rights reserved.
• Press m key and choose mode
• Namespace mode
• metrics per Namespace
• Table mo...
35 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247177)
htop features
Metrics per Namespace/Table...
36 © Cloudera, Inc. All rights reserved.
• Press f key and choose displayed fields (by pressing space key)
• We can also c...
37 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247306)
htop features
Choose displayed fields and...
38 © Cloudera, Inc. All rights reserved.
• Press f key and choose a sort field (by pressing s key)
• Switch to the descend...
39 © Cloudera, Inc. All rights reserved.
• ex) NAMESPACE==default, REQ/S>1000
• Operators: = (only needs a partial match),...
40 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247181)
htop features
Filter with the field values
41 © Cloudera, Inc. All rights reserved.
• Namespace -> Tables
• Table -> Regions
• RegionServer -> Regions
• Select a rec...
42 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247182)
htop features
Drill down
43 © Cloudera, Inc. All rights reserved.
• htop gets the metrics from ClusterMetrics from Admin.getClusterMetrics()
• It n...
44 © Cloudera, Inc. All rights reserved.
• Not committed yet and a work in progress
• Building htop for HBase 2.x
• The ba...
45 © Cloudera, Inc. All rights reserved.
• Support branch-1
• Add more metrics so that we can see more information from ht...
THANK YOU
47 © Cloudera, Inc. All rights reserved.
Q & A
Próximos SlideShares
Carregando em…5
×

Supporting Apache HBase : Troubleshooting and Supportability Improvements

304 visualizações

Publicada em

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Supporting Apache HBase : Troubleshooting and Supportability Improvements

  1. 1. Supporting Apache HBase Troubleshooting and Supportability Improvements
  2. 2. 2© Cloudera, Inc. All rights reserved. Who we are • Daisuke Kobayashi (d1ce_) • Customer support at Cloudera since 2012, focusing on HDFS and HBase specifically • Apache HBase contributor • Toshihiro Suzuki (brfrn169) • Apache HBase committer since 2018 • Sr. Software Engineer, Breakfix (HBase/Phoenix, HDFS) at Cloudera • Wrote and Published a book based on HBase for beginners in Japanese
  3. 3. 3© Cloudera, Inc. All rights reserved. Supporting HBase • Typical Troubleshooting Scenario with HBase • Fix performance degradation (Slowness) • Identify the reason of process being crashed • Fix inconsistencies
  4. 4. 4© Cloudera, Inc. All rights reserved. Agenda • General approach to HBase performance issues with existing tools • htop - Real-time monitoring tool for HBase
  5. 5. © Cloudera, Inc. All rights reserved. General approach to HBase performance issues with existing tools (Logs and metrics are strictly aligned to HBase 2.1 (CDH 6.2)
  6. 6. 6 © Cloudera, Inc. All rights reserved. • Performance issues are tough! • Typical reasons • “Hot Spot” Region • Region with Non-Local Data • Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk • Stop the world with long GC pauses in RegionServers • Slowness Due To High Processor Usage • Network Saturation, etc. • Source of truth • Logs (a lot!) • Metrics (a lot!) Troubleshooting Performance Issues
  7. 7. 7© Cloudera, Inc. All rights reserved. Approach to Performance Troubleshooting Source - https://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools • Understanding the issue • Top-down • USE Method (epecifically, focusing on U and S in this talk)
  8. 8. 8© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer MemStoreBlockCache RPC System (Handlers / Queues) HDFS Client
  9. 9. 9© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  10. 10. 10© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer RPC System (Handlers / Queues) HDFS Client Cache Size Cache Eviction Ratio Flush Size Frequency of requests Memstore Size Frequency of flush RPC Processed Time, Queue Length & Time Flush Queue MemStoreBlockCache Frequency of blocking updates
  11. 11. 11© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  12. 12. 12© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Numer of RPC requests • Incremented by one by the following actions at the RPC server level • doReplayBatchOp, closeRegion, compactRegion, flushRegion, getOnlineRegion, getRegionInfo, getServerInfo, openRegion, rollWALWriter, bulkLoadHFile, prepareBulkLoad, get, multi, mutate, scan "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "totalRequestCount" : 167130, HBASE-21207 made the columns sortable! Master webui Raw metrics
  13. 13. 13© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • RPC queue length & request size "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "queueSize" : 619211, "numCallsInGeneralQueue" : 5, "numCallsInPriorityQueue" : 0, Queue for hight priority handlers to deal with admin requests and system table operation requests. # of handler is controlled by hbase.regionserver.metahandler.count Queue for normal handlers. # of handler is controlled by hbase.regionserver.handler.count Running count of the size in bytes of all outstanding calls whether currently executing or queued waiting to be run. RegionServer webui
  14. 14. 14© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "ProcessCallTime_num_ops" : 10961, "QueueCallTime_num_ops" : 10961, Cloudera Manager chart: select ipc_process_rate, ipc_queue_rate where roleType = REGIONSERVER Raw metrics • Number of processed/queued requests • If queued > processed, time to check thread dump
  15. 15. 15© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Observability Improvements • In case of slowness on scan.next() call, the target region name was unknown in the past. • HBASE-16972 improved the logging by adding ‘scandetails’.2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer"} 2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer","scandetails":"table: cluster_test region: cluster_test,19999998,1557654024101.db9b3c6211849f53e8857e55279b8d12."}
  16. 16. 16© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  17. 17. 17© Cloudera, Inc. All rights reserved. RegionServer webui Memstore Utilization & Saturation Raw metrics "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "memStoreSize" : 5372418924, "name" : "Hadoop:service=HBase,name=RegionServer,sub=Regions", "Namespace_default_table_cluster_test_region_7cdc92fd59a4f1a96b431552d952560c_metric_memStoreSize" : 18295903, "Namespace_default_table_dice2_region_155bf45f338288ae19cc0e3841a5d013_metric_memStoreSize" : 0, "Namespace_default_table_cluster_test_region_d5349e089ff8129faa1e35dee2957e27_metric_memStoreSize" : 4642160, • Memstore size
  18. 18. 18© Cloudera, Inc. All rights reserved. Cloudera Manager chart: select memstore_size where category = HREGION Memstore Utilization & Saturation Cloudera Manager chart: select total_memstore_size_across_hregions where roleType = REGIONSERVER Compare the total memsore size across RegionServers Compare across regions in size in a RegionServer
  19. 19. 19© Cloudera, Inc. All rights reserved. Memstore Utilization & Saturation • Log snippet where a flush finishes • Frequency of flush (per hour) 2019-04-13 01:28:56,376 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished flush of dataSize ~105.70 MB/110836931, heapSize ~105.85 MB/110989816, currentSize=2.94 MB/3084019 for 3db6134cedc326474801068c3cb4f2a9 in 1625ms, sequenceid=4255, compaction requested=true Cell’s data alone, key bytes and value bytes, that is going to be flushed. This can be allocated off-heap too. Cell’s data on-heap along with its metadata and index (overhead of Java objects) Cell’s data alone on-heap after the flushEncoded region name How long did the flush take to complete? # grep "Finished flush of" <rs_log> | grep -o "^2019-..-.. .." | uniq -c 81 2019-05-13 17 6 2019-05-13 18 113 2019-05-15 02 18 2019-05-15 04 27 2019-05-15 12 133 2019-05-15 19 5 2019-05-15 20 198 2019-05-15 22 91 2019-05-15 23
  20. 20. 20© Cloudera, Inc. All rights reserved. Memstore Utilization & Saturation 2019-05-13 17:12:08,001 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Blocking updates: global memstore heapsize 403.0 M is >= blocking 403.0 M 2019-05-13 17:12:10,809 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Memstore is above high water mark and block 2808ms 2019-05-13 17:12:10,809 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Unblocking updates for server host-10-17-101-197.coe.cloudera.com,22101,1557773899580 • Indication of blocked updates due to high memstore utilization • Global memstore > hbase.regionserver.global.memstore.size • A memstore > hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size Why were updates blocked? How long was it blocked? Blocking updates finished 19/05/20 07:39:22 INFO client.RpcRetryingCallerImpl: Call exception, tries=7, retries=11, started=8164 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.RegionTooBusyException: Over memstore limit=128.0M, regionName=d5860b5e1a35025b6aab68dff4d944aa, server=host-10-17-101- 198.coe.cloudera.com,22101,1558363100074
  21. 21. 21© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  22. 22. 22© Cloudera, Inc. All rights reserved. Blockcache Utilization & Saturation • Current block cache usage • Cache eviction "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "blockCacheSize" : 406847872, "blockCacheFreeSize" : 6291459, "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "blockCacheEvictionCount" : 38257, Raw metrics RegionServer webui
  23. 23. 23© Cloudera, Inc. All rights reserved. Cloudera Manager chart: select block_cache_free_size where roleType = REGIONSERVER Blockcache Utilization & Saturation Cloudera Manager chart: select block_cache_evicted_rate where roleType = REGIONSERVER Compare the free size across RegionServers Compare the evicted blocks ratio across RegionServers
  24. 24. 24© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  25. 25. 25© Cloudera, Inc. All rights reserved. HDFS Client Utilization & Saturation "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "flushQueueLength" : 0, RegionServer webui Raw metrics Cloudera Manager chart: select flush_queue_size where roleType = REGIONSERVER • Flush queue size
  26. 26. © Cloudera, Inc. All rights reserved. htop – Real-Time Monitoring Tool for HBase
  27. 27. 27 © Cloudera, Inc. All rights reserved. • HBASE-11062 htop • Work in Progress! • Unix top-like tool • Real-time monitoring for hbase metrics htop overview
  28. 28. 28 © Cloudera, Inc. All rights reserved. • HBase UIs • The metrics of the moment • Can't see the metrics in time series • Ganglia/OpenTSDB/Cloudera Manager/Ambari Metrics (via Grafana) • The metrics in time series • Collecting the latest metrics takes a little bit time • htop • Real-time monitoring • A lot of features for real-time monitoring htop motivation
  29. 29. 29 © Cloudera, Inc. All rights reserved. htop motivation HBase UI Ganglia/OpenTSDB/ Cloudera Manager/ Ambari Metrics htop Metrics of the Moment ○ △ ○ Metrics in Time Series ☓ ○ ☓ Real-Time Monitoring △ △ ○
  30. 30. 30 © Cloudera, Inc. All rights reserved. htop features htop screen • Command to start htop: • $ hbase top • Similar to Unix top command • The metrics are refreshed in a certain period – 3 seconds by default • Vertical and Horizontal scrolling
  31. 31. 31 © Cloudera, Inc. All rights reserved. htop features htop screen • Demo (https://asciinema.org/a/247434)
  32. 32. 32 © Cloudera, Inc. All rights reserved. • Press d key and put a new refresh delay • We can also change the default refresh delay by specifying a command line argument: • ex) $ hbase top -delay 2 # the default refresh delay is 2 seconds htop features Change refresh delay
  33. 33. 33 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247447) htop features Change refresh delay
  34. 34. 34 © Cloudera, Inc. All rights reserved. • Press m key and choose mode • Namespace mode • metrics per Namespace • Table mode • metrics per Table • RegionServer mode • metrics per RegionServer • Region mode (default) • metrics per Region • We can also change the default mode by specifying a command line argument: • ex) $ hbase top -mode n # the default mode is Namespace mode htop features Metrics per Namespace/Table/RegionServer/Region
  35. 35. 35 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247177) htop features Metrics per Namespace/Table/RegionServer/Region
  36. 36. 36 © Cloudera, Inc. All rights reserved. • Press f key and choose displayed fields (by pressing space key) • We can also change the order of the fields in the same screen • Right key selects for move then Left key or Enter key comments htop features Choose displayed fields and change the order of fields
  37. 37. 37 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247306) htop features Choose displayed fields and change the order of fields
  38. 38. 38 © Cloudera, Inc. All rights reserved. • Press f key and choose a sort field (by pressing s key) • Switch to the descending/ascending order by pressing R key • Demo (https://asciinema.org/a/247180) htop features Sort the metrics by the field values
  39. 39. 39 © Cloudera, Inc. All rights reserved. • ex) NAMESPACE==default, REQ/S>1000 • Operators: = (only needs a partial match), == (needs a exact match), >, >=, <, <=, ! • o key: Add a filter with ignore case • O key: Add a filter with case sensitive • ctrl + o key: Show current filters • = key: Clear current filters htop features Filter with the field values
  40. 40. 40 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247181) htop features Filter with the field values
  41. 41. 41 © Cloudera, Inc. All rights reserved. • Namespace -> Tables • Table -> Regions • RegionServer -> Regions • Select a record (Namespace, Table or RegionServer) you want to drill down and Press i key htop features Drill down
  42. 42. 42 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247182) htop features Drill down
  43. 43. 43 © Cloudera, Inc. All rights reserved. • htop gets the metrics from ClusterMetrics from Admin.getClusterMetrics() • It needs to access only HBase Master • If we add more metrics, we first need to add them to ClusterMetrics • The metrics from JMX endpoints will give more metrics but it needs to access all RegionServers, which might cause scalability issues htop internals
  44. 44. 44 © Cloudera, Inc. All rights reserved. • Not committed yet and a work in progress • Building htop for HBase 2.x • The basic features have been implemented • The remaining tasks for htop • Some code refactoring • Adding some tests • Documentation Current status of htop
  45. 45. 45 © Cloudera, Inc. All rights reserved. • Support branch-1 • Add more metrics so that we can see more information from htop • Response time metrics ASAP • The metrics per Column Family/User/Operation (GET, PUT, SCAN, etc.) • System information like CPU usage and memory usage might be useful • Useful features in Unix top command • Color mapping • Batch mode, etc. htop in the future
  46. 46. THANK YOU
  47. 47. 47 © Cloudera, Inc. All rights reserved. Q & A

×