SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
                         HBase‐0.20.0 Performance Evaluation 
                               Anty Rao and Schubert Zhang, August 21, 2009 
                    {ant.rao, schubert.zhang}@gmail.com,http://cloudepr.blogspot.com/ 
 
Background 
 
We  have  been  using  HBase  for  around  a  year  in  our  development  and  projects,  from  0.17.x  to 
0.19.x. We and all in the community know the critical performance and reliability issues of these 
releases. 
 
Now,  the  great  news  is  that  HBase‐0.20.0  will  be  released  soon.  Jonathan  Gray  from  Streamy, 
Ryan  Rawson  from  StumbleUpon,  Michael  Stack  from  Powerset/Microsoft,  Jean‐Daniel  Cryans 
from  OpenPlaces,  and  other  contributors  had  done  a  great  job  to  redesign  and  rewrite  many 
codes to promote HBase. The two presentations [1] [2] provide more details of this release. 
 
The primary themes of HBase‐0.20.0: 
     − Performance 
                Real‐time and Unjavafy software implementations. 
                HFile, based on BigTable’s SSTable. New file format limits index size. 
                New API 
                New Scanners 
                New Block Cache 
                Compression (LZO, GZ) 
                Almost a RegionServer rewrite 
     − ZooKeeper integration, multiple masters (partly, 0.21 will rewrite Master with better ZK 
           integration) 
Then, we will get a bran‐new, high performance (Random Access, Scan, Insert, …), and stronger 
HBase. HBase‐0.20.0 shall be a great milestone, and we should say thanks to all developers. 
 
Following items are very important for us:   
     − Insert performance: We have very big datasets, which are generated fast. 
     − Scan performance: For data analysis in MapReduce framework. 
     − Random Access performance: Provide real‐time services. 
     − Less memory and I/O overheads: We are neither Google nor Yahoo!, we cannot operate 
           big cluster with hundreds or thousands of machines, but we really have big data. 
     − The HFile: Same as SSTable. It should be a common and effective storage element. 
 
Testbed Setup 
 
Cluster:   
     − 4 slaves + 1 master 
     − Machine: 4 CPU cores (2.0G), 2x500GB 7200RPM SATA disks, 8GB RAM. 
     − Linux: RedHat 5.1 (2.6.18‐53.el5), ext3, no RAID 


                                                                                                         1
−     1Gbps network, all nodes under the same switch. 
         −     Hadoop‐0.20.0 (1GB heap), HBase‐0.20.0 RC2 (4GB heap), Zookeeper‐3.2.0 
 
By now, HBase‐0.20.0 RC2 is available for evaluation: 
http://people.apache.org/~stack/hbase‐0.20.0‐candidate‐2/. Refer to [2] and [4] for installation. 
 
Hadoop‐0.20.0 configuration important parameters: 
core‐site.xml: 
    parameter                           value           notes 
    io.file.buffer.size                 65536           Irrelevant to this evaluation. We like use this size to improve 
                                                        file I/O. [5] 
    io.seqfile.compress.blocksize       327680          Irrelevant to this evaluation. We like use this size to compress 
                                                        data block. 
 
hdfs‐site.xml 
    parameter                           value           notes 
    dfs.namenode.handler.count          20              [5] 
    dfs.datanode.handler.count          20              [5] [6] 
    dfs.datanode.max.xcievers           3072            Under heavy read load, you may see lots of DFSClient 
                                                        complains about no live nodes hold a particular block. [6] 
    dfs.replication                     2               Our cluster is very small. 2 replicas are enough. 
 
mapred‐site.xml 
    parameter                                     value                              notes 
    io.sort.factor                                25                                 [5] 
    io.sort.mb                                    250                                10 * io.sort.factor [5] 
    mapred.tasktracker.map.tasks.maximum  3                                          Since our cluster is very small, we 
                                                                                     set 3 to avoid client side bottleneck. 
    mapred.child.java.opts                        ‐Xmx512m                           JVM GC option 
                                                  ‐XX:+UseConcMarkSweepGC
    mapred.reduce.parallel.copies                 20                                  
    mapred.job.tracker.handler.count              10                                 [5] 
 
hadoop‐env.sh 
    parameter                 value                                                           notes 
    HADOOP_CLASSPATH          ${HADOOP_HOME}/../hbase‐0.20.0/hbase‐0.20.0.jar:                Use HBase jar and 
                              ${HADOOP_HOME}/../hbase‐0.20.0/conf:${HADOOP_                   configurations. Use 
                              HOME}/../hbase‐0.20.0/lib/zookeeper‐r785019‐hbas                Zookeeper jar. 
                              e‐1329.jar 
    HADOOP_OPTS               "‐server ‐XX:+UseConcMarkSweepGC"                               JVM GC option 
 
HBase‐0.20.0 configuration important parameters: 
hbase‐site.xml 


                                                                                                                            2
parameter                                    value    notes 
    hbase.cluster.distributed                    true     Fully‐distributed with unmanaged ZooKeeper Quorum 
    hbase.regionserver.handler.count             20        
 
hbase‐env.sh 
    parameter                    value                                           notes 
    HBASE_CLASSPATH              ${HBASE_HOME}/../hadoop‐0.20.0/conf             Use hadoop configurations. 
    HBASE_HEAPSIZE               4000                                            Give HBase enough heap size 
    HBASE_OPTS                   "‐server ‐XX:+UseConcMarkSweepGC"               JVM GC option 
    HBASE_MANAGES_ZK             false                                           Refers to hbase.cluster.distributed 
 
Performance Evaluation Programs 
 
We  modified  the  class  org.apache.hadoop.hbase.PerformanceEvaluation,  since  the  code  has 
following problems: The patch is available at http://issues.apache.org/jira/browse/HBASE‐1778. 
     − It is not updated according to hadoop‐0.20.0. 
     − The  approach  to  split  maps  is  not  strict.  Need  to  provide  correct  InputSplit  and 
          InputFormat  classes.  Current  code  uses  TextInputFormat  and  FileSplit,  it  is  not 
          reasonable.   
 
The evaluation programs use MapReduce to do parallel operations against an HBase table.   
     − Total rows: 4,194,280. 
     − Row size: 1000 bytes for value, and 10 bytes for rowkey. Then we have ~4GB of data. 
     − Sequential  row  ranges:  40.  (It’s  also  used  to  define  the  total  number  of  MapTasks  in 
          each evaluation.)   
     − Rows of each sequential range: 104,857 
The principle is same as the evaluation programs described in Section 7, Performance Evaluation, 
of the Google BigTable paper [3], pages 8‐10. Since we have only 4 nodes to work as clients, we 
set mapred.tasktracker.map.tasks.maximum=3 to avoid client side bottleneck. 
 
Performance Evaluation 
 
Report‐1: Normal, autoFlush=false, writeToWAL=true 
                                                                                                   Google paper
                                 Eventual                                          Total
                                                          rows/s      ms/row                       (rows/s in
    Experiment                   Elapsed        row/s                              throughput
                                                          per node    per node                     single node
                                 Time(s)                                           (MB/s)
                                                                                                   cluster)
    random reads                          948      4424        1106      0.90              4.47                1212
    random writes(i)                      390     10755        2689      0.37             10.86                8850
    random writes                         370     11366        2834      0.35             11.45                8850
    sequential reads                      193     21732        5433      0.18             21.95                4425
    sequential writes(i)                  359     11683        2921      0.34             11.80                8547
    sequential writes                     204     20560        5140      0.19             20.77                8547
    scans                                  68     61681       15420      0.06             62.30              15385
 

                                                                                                                        3
Report‐2: Normal, autoFlush=true, writeToWAL=true 
                                                                                             Google paper
                             Eventual                                       Total
                                                   rows/s      ms/row                        (rows/s in
    Experiment               Elapsed     row/s                              throughput
                                                   per node    per node                      single node
                             Time(s)                                        (MB/s)
                                                                                             cluster)
    sequential writes(i)           461     9098        2275         0.44             9.19                8547
    sequential writes              376    11155        2789         0.36            11.27                8547
    random writes                  600     6990        1748         0.57             7.06                8850
 
Random writes  (i)  and  sequential  write  (i)  are  operations  against  a  new  table,  i.e. initial  writes. 
Random  writes  and  sequential  writes  are  operations  against  an  existing  table  that  is  already 
distributed  on  all  region  servers,  i.e.  distributed  writes.  Since  there  is  only  one  region  server  is 
accessed at the beginning of inserting, the performance is not so good until the regions are split 
across all region servers. The difference between initial writes and distributed writes for random 
writes is not so obvious, but that is distinct for sequential writes. 
 
In Google’s BigTable paper [3], the performance of random writes and sequential writes is very 
close. But in our result, sequential writes are faster than random writes, it seems the client side 
write‐buffer  (12MB,  autoFlush=false)  taking  effect,  since  it  can  reduce  the  number  of  RPC 
packages. And it may imply that further RPC optimizations can gain better performance. If we set 
autoFlush true, they will be close too, and less than the number of random writes in report‐1. The 
report‐2 shows our inference is correct. 
 
Random reads are far slower than all other operations. Each random read involves the transfer of 
a  64KB  HFlie  block  from  HDFS  to  a  region  server,  out  of  which  only  a  single  1000‐byte  value  is 
used. So the real throughput is approximately 1106*64KB=70MB/s of data read from HDFS, it is 
not low. [3] [8] 
 
The average time to random read a row is sub‐ms here (0.90ms) on average per node, that seems 
about as  good  as  we  can  get  on  our  hardware.  We're already  showing  10X better  performance 
than a disk seeking (10ms). It should be major contributed by the new BlockCache and new HFile 
implementations.  Any  other  improvements  will  have  to  come  from  HDFS  optimizations,  RPC 
optimizations, and of course we can always get better performance by loading up with more RAM 
for  the  file‐system  cache.  Try  16GB  or  more  RAM,  we  might  get  greater  performance.  But 
remember, we're serving out of memory and not disk seeking. Adding more memory (and region 
server  heap)  should  help  the  numbers  across  the  board.  The  BigTable  paper  [3]  shows  1212 
random  reads  per  second  on  a  single  node.  That's  sub‐ms  for  random  access,  it’s  clearly  not 
actually  doing  disk  seeks  for  most  gets.  (Thanks  for  good  comments  from  Jonathan  Gray  and 
Michael Stack.) [9] 
 
The performance of sequential reads and scans is very good here, even better than the Google 
paper’s  [3].  We  believe  this  performance  will  greatly  support  HBase  for  data  analysis 
(MapReduce). In Google’s BigTable paper [3], the performance of writes will be better than reads, 
includes  sequential  reads.  But  in  our  test  result,  the  sequential  reads  are  better  than  writes. 
Maybe there are rooms to improve the performance of writes in the future. 

                                                                                                                  4
 
And remember, the dataset size in our tests is not big enough (only average 1GB per node), so the 
hit ratio of BlockCache is very high (>40%). If a region server serves large dataset (e.g. 1TB), the 
power of BlockCache would be downgraded.   
 
Bloom filters can reduce the unnecessary search in HFiles, which can speed up reads, especially 
for large datasets when there are multiple files in an HBase region. 
 
We  can  also  consider  RAID0  on  multiple  disks,  and  mount  local  file  system  with  noatime  and 
nodiratime options, for performance improvements. 
 
Performance Evaluation (none WAL) 
 
In some use cases, such as bulk loading a large dataset into an HBase table, the overhead of the 
Write‐Ahead‐Logs  (commit‐logs)  are  considerable,  since  the  bulk  inserting  causes  the  logs  get 
rotated often and produce many disk I/O. Here we consider to disable WAL in such use cases, but 
the risk is data loss when region server crash. Here I cite a post of Jean‐Daniel Cryans on HBase 
mailing list [7]. 
             “As you may know, HDFS still does not support appends. That means that the write ahead logs or WAL 
       that  HBase  uses  are  only  helpful  if  synced  on  disk.  That  means  that  you  lose  some  data  during  a  region 
       server crash or a kill ‐9. In 0.19 the logs could be opened forever if they had under 100000 edits. Now in 0.20 
       we fixed that by capping the WAL to ~62MB and we also rotate the logs after 1 hour. This is all good because 
       it means far less data loss until we are able to append to files in HDFS. 
             Now to why this may slow down your import, the job I was talking about had huge rows so the logs 
       got rotated much more often whereas in 0.19 only the number of rows triggered a log rotation. Not writing 
       to the WAL has the advantage of using far less disk IO but, as you can guess, it means huge data loss in the 
       case  of  a  region  server  crash.  But,  in  many  cases,  a  RS  crash  still  means  that  you  must  restart  your  job 
       because log splitting can take more than 10 minutes so many tasks times out (I am currently working on that 
       for 0.21 to make it really faster btw).” 
 
So,  here  we  call  put.setWriteToWAL(false)  to  disable  WAL,  and  expect  get  better  writing 
performance. This table is the evaluation result. 
 
Report‐3: NonWAL (autoFlush=false, writeToWAL=false) 
                                                                                                           Google paper
                                    Eventual                                              Total
                                                             rows/s        ms/row                          (rows/s in
    Experiment                      Elapsed        row/s                                  throughput
                                                             per node      per node                        single node
                                    Time(s)                                               (MB/s)
                                                                                                           cluster)
    random reads                         1001       4190          1048          0.95              4.23                  1212
    random writes(i)                       260     16132          4033          0.25             16.29                  8850
    random writes                          194     21620          5405          0.19             21.84                  8850
    sequential reads                       187     22429          5607          0.18             22.65                  4425
    sequential writes(i)                   241     17404          4351          0.23             17.58                  8547
    sequential writes                      122     34379          8595          0.12             34.72                  8547
    scans                                   62     67650         16912          0.06             68.33                 15385

                                                                                                                                 5
 
We  can  see  the  performance  of  sequential  writes  and  random  writes  are  far  better  (~double) 
than  the  normal  case  (with  WAL).  So,  we  can  consider  using  this  method  to  solve  some  bulk 
loading problems which need high performance for inserting. But we suggest calling admin.flush() 
to flush the data in memstores to HDFS, immediately after each bulk loading job, to persist data 
and avoid loss as much as possible. The above evaluation does not include the time of flush.   
 
Conclusions 
 
Compares to the metrics in Google’s BigTable paper [3], the write performance is still not so good, 
but this result is much better than any previous HBase release, especially for the random reads. 
We even got better result than the paper [3] on sequential reads and scans. This result gives us 
more confidence.   
 
HBase‐0.20.0 should be good to be used: 
      − as a data store to be inserted/loaded large datasets fast 
      − to store large datasets to be analyzed by MapReduce jobs 
      − to provide real‐time query services 
 
We  have  made  a  comparison  of  the  performance  to  run  MapReduce  jobs  which  sequentially 
retrieve or scan data from HBase tables, and from HDFS files. The latter is trebly faster. And the 
gap is even bigger for sequential writes. 
 
HBase‐0.20.0 is not the final word on performance and features. We are looking forward to and 
researching following features, and need to read code detail.   
      − Other RPC‐related improvements 
      − Other Java‐related improvements 
      − New master implementation 
      − Bloom Filter 
      − Bulk‐load [10] 
 
In the world of big data, best performance usually comes from tow primary schemes:   
(1) Reducing disk I/O and disk seek.   
     If  we  come  back  to  review  the  BigTable  paper  [3]  Section  6,  Refinements,  we  can  find  the 
     goals of almost all those refinements are related to reducing disk I/O and disk seek, such as 
     Locality  Group,  Compression,  Caching  (Scan  Cache  and  Block  Cache),  Bloom  filters,  Group 
     commit, etc. 
(2) Sequential data access.   
     When implementing our applications and organize/store our big data, we should always try 
     our  best  to  write  and  read  data  in  sequential  mode.  Maybe  sometimes  we  cannot  find  a 
     generalized  access  scheme  like  that  in  the  traditional  database  world,  to  access  data  in 
     various views, but this may be the trait of big data world. 
 
 


                                                                                                           6
References: 
     [1]   Ryan Rawson’s Presentation on NOSQL. 
           http://blog.oskarsson.nu/2009/06/nosql‐debrief.html 
     [2]   HBase goes Realtime, The HBase presentation at HadoopSummit2009 by Jonathan Gray 
           and Jean‐Daniel Cryans 
           http://wiki.apache.org/hadoop‐data/attachments/HBase(2f)HBasePresentations/attach
           ments/HBase_Goes_Realtime.pdf 
     [3]   Google paper, Bigtable: A Distributed Storage System for Structured Data 
           http://labs.google.com/papers/bigtable.html 
     [4]   HBase‐0.20.0‐RC2 Documentation, 
           http://people.apache.org/~stack/hbase‐0.20.0‐candidate‐2/docs/ 
     [5]   Cloudera, Hadoop Configuration Parameters. 
           http://www.cloudera.com/blog/category/mapreduce/page/2/ 
     [6]   HBase Troubleshooting, http://wiki.apache.org/hadoop/Hbase/Troubleshooting 
     [7]   Jean‐Daniel Cryans’s post on HBase mailing list, on Aug 12, 2009: Tip when migrating 
           your data loading MR jobs from 0.19 to 0.20. 
     [8]   ACM Queue, Adam Jacobs, 1010data Inc., The Pathologies of Big Data, 
           http://queue.acm.org/detail.cfm?id=1563874 
     [9]   Our another evaluation report:   
           http://docloud.blogspot.com/2009/08/hbase‐0200‐performance‐evaluation.html 
    [10]   Yahoo! Research, Efficient Bulk Insertion into a Distributed Ordered Table,
           http://research.yahoo.com/files/bulkload.pdf 




                                                                                              7

Mais conteúdo relacionado

Mais procurados

Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at XiaomiHBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at XiaomiHBaseCon
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon
 
HBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBaseHBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBaseHBaseCon
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low LatencyNick Dimiduk
 

Mais procurados (20)

Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at XiaomiHBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at Xiaomi
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 
HBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBaseHBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBase
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 

Semelhante a HBase 0.20.0 Performance Evaluation

User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1상욱 송
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentationIlya Bogunov
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Performance tuning jvm
Performance tuning jvmPerformance tuning jvm
Performance tuning jvmPrem Kuppumani
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Michael Renner
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trencheswchevreuil
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 

Semelhante a HBase 0.20.0 Performance Evaluation (20)

User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Performance tuning jvm
Performance tuning jvmPerformance tuning jvm
Performance tuning jvm
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 

Mais de Schubert Zhang

Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and InfrastructureSchubert Zhang
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSchubert Zhang
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile DevelopmentSchubert Zhang
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processingSchubert Zhang
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Schubert Zhang
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aSchubert Zhang
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor IntroductionSchubert Zhang
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Schubert Zhang
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBaseSchubert Zhang
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on HadoopSchubert Zhang
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-streamSchubert Zhang
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南Schubert Zhang
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionSchubert Zhang
 

Mais de Schubert Zhang (20)

Blockchain in Action
Blockchain in ActionBlockchain in Action
Blockchain in Action
 
科普区块链
科普区块链科普区块链
科普区块链
 
Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and Infrastructure
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluation
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile Development
 
Career Advice
Career AdviceCareer Advice
Career Advice
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223a
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor Introduction
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBase
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on Hadoop
 
Fans of running gump
Fans of running gumpFans of running gump
Fans of running gump
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-stream
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
Big data and cloud
Big data and cloudBig data and cloud
Big data and cloud
 

Último

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 

Último (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 

HBase 0.20.0 Performance Evaluation

  • 1.   HBase‐0.20.0 Performance Evaluation  Anty Rao and Schubert Zhang, August 21, 2009  {ant.rao, schubert.zhang}@gmail.com,http://cloudepr.blogspot.com/    Background    We  have  been  using  HBase  for  around  a  year  in  our  development  and  projects,  from  0.17.x  to  0.19.x. We and all in the community know the critical performance and reliability issues of these  releases.    Now,  the  great  news  is  that  HBase‐0.20.0  will  be  released  soon.  Jonathan  Gray  from  Streamy,  Ryan  Rawson  from  StumbleUpon,  Michael  Stack  from  Powerset/Microsoft,  Jean‐Daniel  Cryans  from  OpenPlaces,  and  other  contributors  had  done  a  great  job  to  redesign  and  rewrite  many  codes to promote HBase. The two presentations [1] [2] provide more details of this release.    The primary themes of HBase‐0.20.0:  − Performance   Real‐time and Unjavafy software implementations.   HFile, based on BigTable’s SSTable. New file format limits index size.   New API   New Scanners   New Block Cache   Compression (LZO, GZ)   Almost a RegionServer rewrite  − ZooKeeper integration, multiple masters (partly, 0.21 will rewrite Master with better ZK  integration)  Then, we will get a bran‐new, high performance (Random Access, Scan, Insert, …), and stronger  HBase. HBase‐0.20.0 shall be a great milestone, and we should say thanks to all developers.    Following items are very important for us:    − Insert performance: We have very big datasets, which are generated fast.  − Scan performance: For data analysis in MapReduce framework.  − Random Access performance: Provide real‐time services.  − Less memory and I/O overheads: We are neither Google nor Yahoo!, we cannot operate  big cluster with hundreds or thousands of machines, but we really have big data.  − The HFile: Same as SSTable. It should be a common and effective storage element.    Testbed Setup    Cluster:    − 4 slaves + 1 master  − Machine: 4 CPU cores (2.0G), 2x500GB 7200RPM SATA disks, 8GB RAM.  − Linux: RedHat 5.1 (2.6.18‐53.el5), ext3, no RAID    1
  • 2. 1Gbps network, all nodes under the same switch.  − Hadoop‐0.20.0 (1GB heap), HBase‐0.20.0 RC2 (4GB heap), Zookeeper‐3.2.0    By now, HBase‐0.20.0 RC2 is available for evaluation:  http://people.apache.org/~stack/hbase‐0.20.0‐candidate‐2/. Refer to [2] and [4] for installation.    Hadoop‐0.20.0 configuration important parameters:  core‐site.xml:  parameter  value  notes  io.file.buffer.size  65536  Irrelevant to this evaluation. We like use this size to improve  file I/O. [5]  io.seqfile.compress.blocksize  327680  Irrelevant to this evaluation. We like use this size to compress  data block.    hdfs‐site.xml  parameter  value  notes  dfs.namenode.handler.count  20  [5]  dfs.datanode.handler.count  20  [5] [6]  dfs.datanode.max.xcievers  3072  Under heavy read load, you may see lots of DFSClient  complains about no live nodes hold a particular block. [6]  dfs.replication  2  Our cluster is very small. 2 replicas are enough.    mapred‐site.xml  parameter  value  notes  io.sort.factor  25  [5]  io.sort.mb  250  10 * io.sort.factor [5]  mapred.tasktracker.map.tasks.maximum  3  Since our cluster is very small, we  set 3 to avoid client side bottleneck.  mapred.child.java.opts  ‐Xmx512m  JVM GC option  ‐XX:+UseConcMarkSweepGC mapred.reduce.parallel.copies  20    mapred.job.tracker.handler.count  10  [5]    hadoop‐env.sh  parameter  value  notes  HADOOP_CLASSPATH  ${HADOOP_HOME}/../hbase‐0.20.0/hbase‐0.20.0.jar: Use HBase jar and  ${HADOOP_HOME}/../hbase‐0.20.0/conf:${HADOOP_ configurations. Use  HOME}/../hbase‐0.20.0/lib/zookeeper‐r785019‐hbas Zookeeper jar.  e‐1329.jar  HADOOP_OPTS  "‐server ‐XX:+UseConcMarkSweepGC"  JVM GC option    HBase‐0.20.0 configuration important parameters:  hbase‐site.xml    2
  • 3. parameter  value  notes  hbase.cluster.distributed  true  Fully‐distributed with unmanaged ZooKeeper Quorum  hbase.regionserver.handler.count  20      hbase‐env.sh  parameter  value  notes  HBASE_CLASSPATH  ${HBASE_HOME}/../hadoop‐0.20.0/conf  Use hadoop configurations.  HBASE_HEAPSIZE  4000  Give HBase enough heap size  HBASE_OPTS  "‐server ‐XX:+UseConcMarkSweepGC"  JVM GC option  HBASE_MANAGES_ZK  false  Refers to hbase.cluster.distributed    Performance Evaluation Programs    We  modified  the  class  org.apache.hadoop.hbase.PerformanceEvaluation,  since  the  code  has  following problems: The patch is available at http://issues.apache.org/jira/browse/HBASE‐1778.  − It is not updated according to hadoop‐0.20.0.  − The  approach  to  split  maps  is  not  strict.  Need  to  provide  correct  InputSplit  and  InputFormat  classes.  Current  code  uses  TextInputFormat  and  FileSplit,  it  is  not  reasonable.      The evaluation programs use MapReduce to do parallel operations against an HBase table.    − Total rows: 4,194,280.  − Row size: 1000 bytes for value, and 10 bytes for rowkey. Then we have ~4GB of data.  − Sequential  row  ranges:  40.  (It’s  also  used  to  define  the  total  number  of  MapTasks  in  each evaluation.)    − Rows of each sequential range: 104,857  The principle is same as the evaluation programs described in Section 7, Performance Evaluation,  of the Google BigTable paper [3], pages 8‐10. Since we have only 4 nodes to work as clients, we  set mapred.tasktracker.map.tasks.maximum=3 to avoid client side bottleneck.    Performance Evaluation    Report‐1: Normal, autoFlush=false, writeToWAL=true  Google paper Eventual Total rows/s ms/row (rows/s in Experiment Elapsed row/s throughput per node per node single node Time(s) (MB/s) cluster) random reads 948 4424 1106 0.90 4.47 1212 random writes(i) 390 10755 2689 0.37 10.86 8850 random writes 370 11366 2834 0.35 11.45 8850 sequential reads 193 21732 5433 0.18 21.95 4425 sequential writes(i) 359 11683 2921 0.34 11.80 8547 sequential writes 204 20560 5140 0.19 20.77 8547 scans 68 61681 15420 0.06 62.30 15385     3
  • 4. Report‐2: Normal, autoFlush=true, writeToWAL=true  Google paper Eventual Total rows/s ms/row (rows/s in Experiment Elapsed row/s throughput per node per node single node Time(s) (MB/s) cluster) sequential writes(i) 461 9098 2275 0.44 9.19 8547 sequential writes 376 11155 2789 0.36 11.27 8547 random writes 600 6990 1748 0.57 7.06 8850   Random writes  (i)  and  sequential  write  (i)  are  operations  against  a  new  table,  i.e. initial  writes.  Random  writes  and  sequential  writes  are  operations  against  an  existing  table  that  is  already  distributed  on  all  region  servers,  i.e.  distributed  writes.  Since  there  is  only  one  region  server  is  accessed at the beginning of inserting, the performance is not so good until the regions are split  across all region servers. The difference between initial writes and distributed writes for random  writes is not so obvious, but that is distinct for sequential writes.    In Google’s BigTable paper [3], the performance of random writes and sequential writes is very  close. But in our result, sequential writes are faster than random writes, it seems the client side  write‐buffer  (12MB,  autoFlush=false)  taking  effect,  since  it  can  reduce  the  number  of  RPC  packages. And it may imply that further RPC optimizations can gain better performance. If we set  autoFlush true, they will be close too, and less than the number of random writes in report‐1. The  report‐2 shows our inference is correct.    Random reads are far slower than all other operations. Each random read involves the transfer of  a  64KB  HFlie  block  from  HDFS  to  a  region  server,  out  of  which  only  a  single  1000‐byte  value  is  used. So the real throughput is approximately 1106*64KB=70MB/s of data read from HDFS, it is  not low. [3] [8]    The average time to random read a row is sub‐ms here (0.90ms) on average per node, that seems  about as  good  as  we  can  get  on  our  hardware.  We're already  showing  10X better  performance  than a disk seeking (10ms). It should be major contributed by the new BlockCache and new HFile  implementations.  Any  other  improvements  will  have  to  come  from  HDFS  optimizations,  RPC  optimizations, and of course we can always get better performance by loading up with more RAM  for  the  file‐system  cache.  Try  16GB  or  more  RAM,  we  might  get  greater  performance.  But  remember, we're serving out of memory and not disk seeking. Adding more memory (and region  server  heap)  should  help  the  numbers  across  the  board.  The  BigTable  paper  [3]  shows  1212  random  reads  per  second  on  a  single  node.  That's  sub‐ms  for  random  access,  it’s  clearly  not  actually  doing  disk  seeks  for  most  gets.  (Thanks  for  good  comments  from  Jonathan  Gray  and  Michael Stack.) [9]    The performance of sequential reads and scans is very good here, even better than the Google  paper’s  [3].  We  believe  this  performance  will  greatly  support  HBase  for  data  analysis  (MapReduce). In Google’s BigTable paper [3], the performance of writes will be better than reads,  includes  sequential  reads.  But  in  our  test  result,  the  sequential  reads  are  better  than  writes.  Maybe there are rooms to improve the performance of writes in the future.    4
  • 5.   And remember, the dataset size in our tests is not big enough (only average 1GB per node), so the  hit ratio of BlockCache is very high (>40%). If a region server serves large dataset (e.g. 1TB), the  power of BlockCache would be downgraded.      Bloom filters can reduce the unnecessary search in HFiles, which can speed up reads, especially  for large datasets when there are multiple files in an HBase region.    We  can  also  consider  RAID0  on  multiple  disks,  and  mount  local  file  system  with  noatime  and  nodiratime options, for performance improvements.    Performance Evaluation (none WAL)    In some use cases, such as bulk loading a large dataset into an HBase table, the overhead of the  Write‐Ahead‐Logs  (commit‐logs)  are  considerable,  since  the  bulk  inserting  causes  the  logs  get  rotated often and produce many disk I/O. Here we consider to disable WAL in such use cases, but  the risk is data loss when region server crash. Here I cite a post of Jean‐Daniel Cryans on HBase  mailing list [7].  “As you may know, HDFS still does not support appends. That means that the write ahead logs or WAL  that  HBase  uses  are  only  helpful  if  synced  on  disk.  That  means  that  you  lose  some  data  during  a  region  server crash or a kill ‐9. In 0.19 the logs could be opened forever if they had under 100000 edits. Now in 0.20  we fixed that by capping the WAL to ~62MB and we also rotate the logs after 1 hour. This is all good because  it means far less data loss until we are able to append to files in HDFS.  Now to why this may slow down your import, the job I was talking about had huge rows so the logs  got rotated much more often whereas in 0.19 only the number of rows triggered a log rotation. Not writing  to the WAL has the advantage of using far less disk IO but, as you can guess, it means huge data loss in the  case  of  a  region  server  crash.  But,  in  many  cases,  a  RS  crash  still  means  that  you  must  restart  your  job  because log splitting can take more than 10 minutes so many tasks times out (I am currently working on that  for 0.21 to make it really faster btw).”    So,  here  we  call  put.setWriteToWAL(false)  to  disable  WAL,  and  expect  get  better  writing  performance. This table is the evaluation result.    Report‐3: NonWAL (autoFlush=false, writeToWAL=false)  Google paper Eventual Total rows/s ms/row (rows/s in Experiment Elapsed row/s throughput per node per node single node Time(s) (MB/s) cluster) random reads 1001 4190 1048 0.95 4.23 1212 random writes(i) 260 16132 4033 0.25 16.29 8850 random writes 194 21620 5405 0.19 21.84 8850 sequential reads 187 22429 5607 0.18 22.65 4425 sequential writes(i) 241 17404 4351 0.23 17.58 8547 sequential writes 122 34379 8595 0.12 34.72 8547 scans 62 67650 16912 0.06 68.33 15385   5
  • 6.   We  can  see  the  performance  of  sequential  writes  and  random  writes  are  far  better  (~double)  than  the  normal  case  (with  WAL).  So,  we  can  consider  using  this  method  to  solve  some  bulk  loading problems which need high performance for inserting. But we suggest calling admin.flush()  to flush the data in memstores to HDFS, immediately after each bulk loading job, to persist data  and avoid loss as much as possible. The above evaluation does not include the time of flush.      Conclusions    Compares to the metrics in Google’s BigTable paper [3], the write performance is still not so good,  but this result is much better than any previous HBase release, especially for the random reads.  We even got better result than the paper [3] on sequential reads and scans. This result gives us  more confidence.      HBase‐0.20.0 should be good to be used:  − as a data store to be inserted/loaded large datasets fast  − to store large datasets to be analyzed by MapReduce jobs  − to provide real‐time query services    We  have  made  a  comparison  of  the  performance  to  run  MapReduce  jobs  which  sequentially  retrieve or scan data from HBase tables, and from HDFS files. The latter is trebly faster. And the  gap is even bigger for sequential writes.    HBase‐0.20.0 is not the final word on performance and features. We are looking forward to and  researching following features, and need to read code detail.    − Other RPC‐related improvements  − Other Java‐related improvements  − New master implementation  − Bloom Filter  − Bulk‐load [10]    In the world of big data, best performance usually comes from tow primary schemes:    (1) Reducing disk I/O and disk seek.    If  we  come  back  to  review  the  BigTable  paper  [3]  Section  6,  Refinements,  we  can  find  the  goals of almost all those refinements are related to reducing disk I/O and disk seek, such as  Locality  Group,  Compression,  Caching  (Scan  Cache  and  Block  Cache),  Bloom  filters,  Group  commit, etc.  (2) Sequential data access.    When implementing our applications and organize/store our big data, we should always try  our  best  to  write  and  read  data  in  sequential  mode.  Maybe  sometimes  we  cannot  find  a  generalized  access  scheme  like  that  in  the  traditional  database  world,  to  access  data  in  various views, but this may be the trait of big data world.        6
  • 7. References:  [1] Ryan Rawson’s Presentation on NOSQL.  http://blog.oskarsson.nu/2009/06/nosql‐debrief.html  [2] HBase goes Realtime, The HBase presentation at HadoopSummit2009 by Jonathan Gray  and Jean‐Daniel Cryans  http://wiki.apache.org/hadoop‐data/attachments/HBase(2f)HBasePresentations/attach ments/HBase_Goes_Realtime.pdf  [3] Google paper, Bigtable: A Distributed Storage System for Structured Data  http://labs.google.com/papers/bigtable.html  [4] HBase‐0.20.0‐RC2 Documentation,  http://people.apache.org/~stack/hbase‐0.20.0‐candidate‐2/docs/  [5] Cloudera, Hadoop Configuration Parameters.  http://www.cloudera.com/blog/category/mapreduce/page/2/  [6] HBase Troubleshooting, http://wiki.apache.org/hadoop/Hbase/Troubleshooting  [7] Jean‐Daniel Cryans’s post on HBase mailing list, on Aug 12, 2009: Tip when migrating  your data loading MR jobs from 0.19 to 0.20.  [8] ACM Queue, Adam Jacobs, 1010data Inc., The Pathologies of Big Data,  http://queue.acm.org/detail.cfm?id=1563874  [9] Our another evaluation report:    http://docloud.blogspot.com/2009/08/hbase‐0200‐performance‐evaluation.html  [10] Yahoo! Research, Efficient Bulk Insertion into a Distributed Ordered Table, http://research.yahoo.com/files/bulkload.pdf    7