First of all, let us introduce ourselves.
My name is Daisuke Kobayashi. My team mates call me just Dice, or DiceK as a nickname. I have been working at Cloudera based in Japan since 2012. I’m actually working as backline support now to help customers and also internal support folks to resolve complicated issues. I’m also an HBase contributor.
Hello, my name is Toshihiro Suzuki.
I’m an HBase committer since last year.
And I’m a Sr. Software Engineer, Breakfix in the Support team at Cloudera. I mainly handle HBase/Phoenix and HDFS cases.
I have written and published a book based on HBase for beginners in Japanese.
So what does supporting HBase mean by at Cloudera? At cloudera, we have a big HBase user base and the number of nodes is quite widespread, from 10 nodes to 100, and 1000 nodes. They report various types of issues to our support team every single day and our job is simple. Just fix the issue and answer their questions. If I could summarize the problems reported by customers, these are typical scenarios we usually see. Fixing performance degradation, identifying the reason of process being crashed, and also fixing inconsistencies which is well known issue either in HBase 1 and in 2. But in this talk, we will specifically focus on the first one.
From my side, I‘m gonna introduce the general approach to performance issues and will show existing tools we usually use in the context of HBase troubleshooting. Later on, from my colleague Toshi, he will be talking about a new tool he’s now developing. It’s more intuitive and efficient for troubleshooting in real time.
So, fixing performance issues is tough. This is because the number of nodes is different across customers, they definitely run different versions with different configurations, different types of datasets and diffrent use cases. They are all different.
Various types of factors can lead to performance issues. Something like misconfigurations on HBase, unbalanced loads on regionservers, which is as known as hot spot, because of bad schema designs. Also all regionservers shoud be collocated with datanodes and if the particular region’s block doesn’t exist in the local datanode, it has to read the data remotely over another datanodes. Apart from that, there might be bad OS configuration, GC issues, hardware failures or network related issues.
Another thing which makes it difficult to troubleshoot these issues is there are various information exposed through logs and metrics regarding how the HBase cluster performs. Whenever we analyze problems, we have to pick up right log snippets and metrics to correlate to the root cause. In order to take advantages from the logs and metrics, it is obvious that we need to understand what they actually mean, why they are logged? and also when a particular metric is incremented? It's also important to understand what they are not.
For core HBase developers, these questions may be easy to answer, but HBase is widespread and used by many users at various types of industry. Over last couple years, I have been asked about the meaning of given metrics and log snippets over and over. So the aim of my talk is to share these basic information with others to help them to be able to narrow down the problems and dig into further.
So, to start performance troubleshooting, I think these are the typical and important approach. First off, we need to listen to customers in order to understand what they are complaining and what they are hitting, and also what they wanna resolve. This is the very first and important step to be on the same page with them, In order to narrow down performance issues, in general we should look at the system with top - down approarch. Specifically in HBase, we fist look at the cluster itself and see how resource usages are distributed across nodes. If something looks going wrong on a particular nodes, we need to dig into the node. All though the troubleshooting step, I like using the USE method, which is originally defined by Brendan Gregg at Netflix and ex-Sun guy.
The USE method is designed like an emergency checklist in a flight manual. So it’s intended to be simple, straightforward, complete, and fast. USE stands for Utilization, Saturation, and Errors. Utilization carries a question how busy is the particular resource? Saturation can be measured as the length of a wait queue, or time spent waiting on the queue. The Errors are explicit indications of something going wrong. It is obvious the USE method is not perfect, but it can be used as the very first checklist to identify the bottolneck quickly as possible.
So, the next question is what are the resources in HBase. You know RegionServer is the worker role and responsible for processing read and write requests
These are the typical resources in a single regionserver.
All user requests are coming into the rpc system first, they are queued and processed by handlers concurrently. For caching it goes to the memstore for write or block cache for read. The data is persisted to HDFS at some conditions. As you know the requests always go with the direction of the orange arrow. Which means we should always follow this way when checking resources.
So what typs of informations are exposed by each resource? For example at the rpc system, it exposes the number of requests, how many requests getting queued and processed. For memstore, it exposes the memstore size, what’s the size of flushed memstore, and also the frequency of flush. So, using these observability items, we can check how the resource is utlilized and saturated. From the next slides, let’s walk through each resource one by one
First, the RPC system
From this slide, I’m gonna show you the metrics, webui, and also logs that’s used for troubleshooting. Please note that all those are aligned to HBase 2.1 code base, more specifically CDH 6.2. As I mentioned, the RPC system is the place where all client requests arrive. So, we should be able to check how many number of requests are received by every single regionserver. Here in the gray area, I’m showing the raw metric that is exposed via JMX endpoint on a paritcular regionserver. The total request count is also exposed through the Master and regionserver webui. We can just simply compare the requests across regionservers. If there’s an outstanding value, it’s a chance to narrow down to the particular regionserver.
If you have been managing HBase and familiar with these webuis, you may be aware that the columns in the table are sortable. This is a simple but powerful change. We often have a screen sharing session with a customer to see the issue in a real time fashion. Every time we look at these webui, it was difficult to figure out the highest or the lowest servers without doing something tricky stuff. So this sorting functionality should make our life easier.
This number is incremented by various types of request call at the RPC server level as describing in the slide.
Next, to understand the saturation, the number of requests being queued at a particular point in time is exposed. That is what I’m showing in the gray area as raw metrics and the corresponding values in the webui below. As meta table is usually accessed frequently than others, it’s isolated from the queue for normal regions. If the queue size is constantly growing, it may be indicating something going wrong in processing the requests.
We can check how many requests are processed and queued so far by the RPC system. I’m showing the raw metric value in the gray area. Since it’s just an incremental value, Cloudera Manager converts this value into rate, which make it easy to understand how things are going over time. Ideally, both processed and queued should be same. The processed is the blue graph and the queued is the green one in this example. We can see both exactly matches since as things are going well. If the queued becomes bigger than processed, it’s the sign of RPC handlers getting slow with some reason. We should check the thread dump to dig into further
If the RPC system takes longer than 10 seconds to respond back for a given request, it informs the table and the region name in the process logs. However, in case of scan next call is slow, none of the target region name or row key was informed so we were really frustrated while troubleshooting. Fortunately, recent version gets this improved by logging the scan details as I'm showing with green makrer in the second example. With this hint, we should be able to narrow down to the particular region to see why it’s slow.
Alright, next let’s take a look at memstore.
Memstore utilization is exposed via several levels, from server, tables, and regions. Here I'm showing the server and the region level raw metrics along with the corresponding webui. I think it’s fairly easy to understand the memstore utilization
When using Cloudera Manager, we typically use this sort of queries to compare the total memstore utilization across regionservers. The above graph is indicating it. Also we can check if there’s any outstanding region which utilizes memstore than other regions in a single regionserver, which is in the below graph.
Flush persists data in memstore into the underlying HDFS, which means the memstore is fully utilized, or most likely saturated. This is an example of log snippet where a flush finishes. In HBase 2 data can be allocated off-heap for both read and write. Given this, the log informs the pure key-value data size and the on-heap occupation separately. It’s also showing how long does it take to flush. These numbers should be informative to see how a particular flush goes. If it takes longer, it may be time to look at the HDFS performance too.
Using this granular logging of flush, we can see the frequency of flush activity on a regionserver. In this example, I'm grouping the output on an hourly basis.
If the total memstore size across regions in a single regionserver goes beyond the limit of global memstore size, all updates are blocked by the regionserver until the utilization gets decreased less than the threshold. This is a typical log message in HBase 2.1.
There are three lines where each correlates. The first line indicates blocking updates started because the global memstore size becomes greater than blocking threashold. The second line shows how long it took, and the third line indicates blocking completed.
In the second example, the client gets the RegionTooBusyException for the particular region. This is because this region has too big memstore in size which is not flushed yet. This is also a typical indication of saturation regarding the specific memstore.
In the context of block cache, utilization is a simple cache usage which is available via raw metrics and also via webui. If a cache is evicted, in general, it means it’s saturated. I’m showing the raw metrics on the left hand side and the corresponding webui informations on the right hand side. From the top, it’s indicating how much the block cache resource is used and what’s the remaining memory for cache, and the number of evicted blocks.
Using Cloudera Manager, we can check the eviction rate, which is converted from the raw metric value. I’m showing an example in the graph below. If the utilization is higher enough, but the eviction rate is also higher, it’s the sign of block cache size is too small to handle the current workload appropriately. So it's time to think about increasing the cache size.
Alright, I’m gonna quickly cover the last resource in the picture. The HDFS resource utilization and saturation are basically tracked at the HDFS level metrics and logs. So I can't talk much in this session, but I am gonna show one related metric exposed at the HBase level.
That’s flush queue size. When flusing memstore, it’s queued first and persisted to HDFS later. The queue is maintained at the regionserver level and exposed as a metric through webui. It’s visible through Cloudera Manager chart as well. Typically, its utilization shouldn’t be grown, so if the queue is constantly growing it’s denoting flush is failing or getting slow with some reason. So it's time to look at the HDFS size.
That’s pretty much all I have prepared for this presentation. Alright, I have been talking about how to look at the resources in Hbase and their utilization and saturation mainly from metrics and sometimes from logs. I’m pretty sure that I couldn’t cover everything. We have to look further using different approach if we couldn’t find anything bad with this approach, but I wish you could find an idea from my talk. From Toshi, he’s gonna give a presentation about a new tool which should make our life better.
From my side, I’m going to talk about htop that’s a Real-Time Monitoring Tool for HBase.
So, overview of htop.
htop is the tool I’m developing now, which is raised in the JIRA ticket, HBASE-11062.
This is an Unix top-like tool, and we can do real-time monitoring for the hbase metrics with it.
And, the motivation of htop.
As Dice mentioned, a first approach when we are facing performance issues is to check the current status of the cluster.
At this time, we can see HBase UIs to check the metrics. And it shows the metrics of the moment, but we can't see them in time series from it.
If you want to see the metrics in time series, we have Ganglia, OpenTSDB, Cloudera Manager and Ambari Metrics. In Ambari metrics, we can see the metrics via Grafana. They are useful when we want to see the metrics in time series, but if you're going to do real-time monitoring, they are not very useful because collecting the latest metrics takes a little bit time in those tools.
For real-time monitoring, I have started to develop htop.
I’ll explain the features of htop later in this talk.
To clarify the position of htop, I made this matrix of the features of those tools.
If you just want to see the metrics of the moment, you can use any tool of them.
However, in Ganglia, OpenTSDB, Cloudera Manager and Ambari Metrics, collecting the latest metrics takes a little bit time.
If you want to see the metrics in time series, you need to use Ganglia, OpenTSDB, Cloudera Manager or Ambari Metrics.
And If you want to do real-time monitoring, htop is the most useful of them as it has a lot of features to do that.
From here, I will talk about the features of htop with demonstrations.
Firstly, about htop screen.
We can start htop by running hbase top command.
The UI is similar to Unix top command.
The metrics are refreshed in a certain period – 3 seconds by default
And you can do vertical and horizontal scrolling.
I’ll show you demo of htop screen.
Actually, this is not a live demo, but a terminal recording.
And we can see this demo anytime in this URL.
To start htop, run hbase top command.
This is the screen of htop.
The metrics in this screen are refreshed per 3 seconds.
It consists of 2 parts, Summary part and Metrics part.
In Summary part, you can see the HBase version, cluster ID, the number of region servers, the region count, Average Cluster Load and aggregated Request count per second.
In Metrics part, you can see the metrics. In this case, you can see the metrics per region and it shows naamesapce name, table name, encoded region name, RegionServer name, Request count per second, read request count per second and so on.
You can scroll down to see all metrics like this. you can also do horizontal scrolling like this.
As mentioned, the refresh delay is 3 seconds by default.
But you can change it by pressing ‘d’ key and put the new refresh delay.
And we can also change the default refresh delay by specifying a command line argument “-delay”
I’ll show you the demo of it.
If you press ‘d’ key in htop screen, you can put a new refresh delay.
In this demo, trying to change it to 1 seconds.
Yeah, it has been changed.
Currently, htop can show the metrics per Namespace, Table, RegionServer and Region.
And they are called respectively Namespace mode, Table mode, RegionServer mode and Region mode.
The default is region mode.
We can change this mode by pressing ‘m’ key in htop screen.
And we can also change the default mode by specifying a command line argument “-mode”
So, I’ll show you demo of it.
Now, you see the metrics per region, and we can change it to Namespace or Table or RegionServer by pressing ‘m’ key.
For example, we can see the metrics per Namespace like this or you can also see the metrics per Table like this.
In addition to that, we can choose which fields are displayed in the screen.
By pressing ‘f’ key, you can choose displayed fields.
We can also change the order of fields in the same screen.
I’ll show you the demo of it.
By pressing ‘f’ key, move to this screen where you can choose displayed fields.
For now, in region mode, these fields here can be displayed.
And For example, if you don’t need Namespace and Table fields, and if you need Region name field, then you can remove and add these fields like this.
And as you can see, the fields are removed and added.
Also, we can change the order of fields in the same screen.
Go back to the screen by pressing ’f’ key,
and select the field you want to move and press Right key.
And then move the field to anywhere you want to move it and press Left key.
So you can see the order of the fields is changed.
It’s also possible to sort the metrics by the field values.
And we can switch to descending or ascending order by pressing ‘R’ key.
I’ll show you demo of it.
Press ‘f’ key to move to the previous screen. And you can also choose a sort field on the same screen.
If you want to sort the metrics by “Request count per second,”
choose the field and press ‘s’ key.
So the current sort field is changed to “Request count per second”
And then you can see the metrics sorted by the field.
So next is Filter feature that’s very important.
For example, if you want to see the metrics of “default” Namespace only, you can specify this filter NAMESPACE==default.
Or if you want to see the metrics that have more then 1000 requests per second, then you can specify a filter like this REQ/S>1000
In this Filter feature, we can use the general operators like those:
When we press o key in the htop screen, we can add a filter with ignore case.
When we press O key, we can add a filter with case sensitive.
Also, when we press ctrl + o key, we can see the current filters.
And, when we press = key, we can clear the current filters.
Let me show you demo of it.
If you want to see the metrics in “default” namespace only, press ’o’ key and you can specify a filter like this.
As you can see, only the metrics in “default” Namespace are shown now.
And, if you want to see the metrics of the ”test” table only, press ’o’ key again and you can add a filter like this.
So now only the metrics in “default” Namespace and “test” table are shown.
Furthermore, if you want the metrics that have more than 1000 requests, then you can add a filter like this.
So, we can see only the metrics more than 1000 requests.
We can see the specified filters by pressing ctrl + ‘o‘ key like this.
These are the current filters.
We can clear the current filters by pressing ‘=’ key like this.
The filters are cleared.
The last feature I’d like to introduce here is the drill-down feature.
We can drill down from Namespace to Tables, from Table to Regions, or from RegionServer to Regions.
With this feature, we can find the “Hot Spot” region easily.
We can drill down by selecting a record you want to drill down and pressing i key.
I’ll show you demo of it.
If you want to drill down the “default” namespace to the tables,
you can move to the namespace mode
and select the “default” namespace and then press ‘i’ key.
So you can see the metrics for the tables in the “default” namespace.
Furthermore, if you want to drill down from the “test” table to the regions,
select “test” table and press ‘i’ key,
so you can see the metrics for the regions of the “test” table.
Similarly, you can drill down from a RegionServer to regions.
Move to the RegionServer mode and select one of the RegionServers and press ‘i’ key.
So you can see the metrics for the regions on the selected RegionServer.
That’s it for the demonstrations of the features of htop.
Next, let me talk about the internals of htop.
Currently, htop gets the metrics from ClusterMetrics class from Admin.getCusterMetrics method because that needs to access only HBase Master to do that.
So if we add more metrics to htop, we first need to add more metrics to ClusterMetrics class.
Actually, the metrics from JMX endpoints will give more metrics to us, but it needs to access all RegionServers, which might cause scalability issues.
So I decided not to use JMX endpoints for htop.
In this slide, I’ll talk about the current status of htop.
As mentioned, htop hasn’t been committed yet, and it’s a work in progress actually.
However, the basic features have been implemented as I showed you in the demonstrations.
The remaining tasks for it are some code refactoring and adding some tests. I also need to make documentation for it.
Maybe, it will be ready for review next month, and once the review is passed, it will be committed.
And, htop in the future.
Currently, I’m developing this tool for the master branch and branch-2. So as a next step, we need to support branch-1.
And we should add more metrics so that we can see more information from htop.
Especially, adding response time metrics is required because they are very important for performance troubleshooting.
And we can add the metrics per Column Family, User and Operation like GET, PUT, SCAN.
And I’m thinking about adding system information like CPU usage and memory usage, which might be useful.
In addition to that, we can add the useful features in Unix top command like Color mappings or Batch mode.
That’s all from my side. We hope this presentation was informative for you. Thank you very much.