SlideShare uma empresa Scribd logo
1 de 40
1
Rigorous and Multi-tenant HBase Performance
Govind Kamat, Yanpei Chen
Performance Engineering
2
Bio
Govind Kamat
• Member of the Performance Engineering Team at Cloudera
• Focuses on Hadoop and HBase performance and scalability
• Experience includes the development of large-scale software systems,
microprocessor architecture, compilers and electronic design
Yanpei Chen
• Member of the Performance Engineering Team at Cloudera
• Works on cross-component performance - Hadoop, HBase, Search and Impala
• Ph.D. from UC Berkeley, focus on performance measurement method and theory
3
Outline
• Apache HBase overview
• Measuring performance + YCSB basics
• Cluster setup best practices
• Techniques for rigorous measurement
• HBase in a multi-tenant environment
4
HBase Overview
• Distributed, "NoSQL" key-value store
• Column-oriented, sorted map
• Keys are lexicographically sorted
• Multiple regions across “regionservers”
• Built on HDFS, MapReduce not required
5
Measuring HBase Performance is Hard!
• Numbers not reproducible
• Large run-to-run variation
• Testbeds not clearly defined/properly setup
• Various workloads have been used
• Configuration parameters not specified
• State of regionservers not taken into account
• Reported numbers not comparable
6
Cluster is
down … sigh!
7
Workloads for Performance Measurement
• Set of transactions to be imposed against it
• read, update, insert, scan and mixes thereof
• Initial data to be loaded into the DB
• Insert
• Transaction load intensity variation over time
• Possible HBase workloads:
• Actual customer/production workloads (best)
• PerformanceEvaluation (not really a workload )
• YCSB (Yahoo! Cloud Serving Benchmark, commonly used)
8
Yahoo! Cloud Serving Benchmark (YCSB) Basics
• Performance evaluation framework for key-value
databases, such as:
• HBase, Cassandra, Sherpa, Accumulo, Voldemort
• Abstracts out the client from the DB
• Flexible and configurable
• Comes with a standard “core” workload
• Reports throughput and latency metrics
9
YCSB Basics - Running YCSB
• Create a table called "usertable" in HBase
$ ycsb [load | run] hbase
-p workload=
com.yahoo.ycsb.workloads.CoreWorkload
-p columnfamily=cf
-p operationcount=1000000
-P workloads/randomWrite
-threads 10
-s
10
YCSB Basics – YCSB Parameters
• Specified like so: '-p property=value’
• columnfamily, fieldcount, fieldlength
• recordcount, operationcount
• readproportion, updateproportion, scanproportion, ..
• readallfields, writeallfields
• requestdistribution
• maxscanlength, scanlengthdistribution
• maxexecutiontime
11
YCSB Basics - YCSB Output 1/2
2014-05-28 17:08:34:025 1310 sec: 2951422 operations; 2737.33 current
ops/sec; [READ AverageLatency(us)=8098.29]
2014-05-28 17:08:44:026 1320 sec: 2972315 operations; 2089.09 current
ops/sec; [READ AverageLatency(us)=8671.15]
[OVERALL], RunTime(ms), 1334884.0
[OVERALL], Throughput(ops/sec), 2247.3862897450267
[READ], Operations, 3000000
[READ], AverageLatency(us), 8876.560442666667
[READ], MinLatency(us), 205
[READ], MaxLatency(us), 2530720
[READ], 95thPercentileLatency(ms), 9
[READ], 99thPercentileLatency(ms), 15
12
YCSB Basics - YCSB Output 2/2
[READ], 0, 2168499
[READ], 1, 445777
[READ], 2, 29748
[READ], 3, 32264
[READ], 4, 28154
[READ], 5, 26195
[READ], 6, 32222
[READ], 7, 39343
[READ], 8, 44038
[READ], 9, 41481
[...]
[READ], >1000, 11925
13
Cluster Setup Best Practices
• Setting up the cluster
• Configuring HBase
• Creating tables
• Pre-splitting tables
• Loading data
14
HBase Cluster Configuration Best Practices
• Use the appropriate hardware, correctly sized: memory, disk
• Dedicate separate nodes for master services and worker roles
• No Task Trackers and Node Managers on regionserver nodes
• Segregate clients from the regionservers
• Configure HBase properly:
• Block cache (read), memstore (write)
• Bloom filters, compression, compaction, short-circuit reads, etc.
• Use the appropriate data set size, number of regions, etc.
• Monitor the cluster constantly
15
16
Data Loading – Several Options
• Real, actual, production (hot) data 
• Custom loader
• PerformanceEvaluation
• Loading using YCSB
• HFileGenerator followed by bulk-load
17
Data Loading - Pre-split the Table
• Auto-splitting has significant overhead
• RegionSplitter utility
• UniformSplit
• HexStringSplit
• YCSB: user100000 .. user999999
hbase(main):1:0> create 'usertable', 'cf’,
{ SPLITS=> (1..(50-1)).map {|i| "user#{1000 +
i*9000/50}" } } #50 splits
• Set maximum region file size to a large value
18
Techniques for Rigorous Measurement
• Keep the input data set fixed
• Warm up the cache
• Set the target throughput
• Use the correct workload distribution
19
Keep the Input Data Set Fixed!
20
Keep the Input Data Set Fixed!
A beginning is the time for taking the most
delicate care that the balances are correct.
The manual of Muad’Dib
From “Dune” by Frank Herbert
21
Cluster is
down … sigh!
22
Warm Up the Cache
• Performance depends significantly on memory
• HBase block cache and OS page cache for reads
• Memstore and WAL for writes
• Load all the rows in the table
• Write until data starts getting flushed
• Compaction can affect performance significantly
• Carry out long-running tests
• Repeat till steady-state
• Otherwise, performance can vary a lot
23
Warm Up the Cache
24
Set the Target Throughput
• Two parameters to set desired throughput
• -threads
• -target
• Actual throughput will match target throughput ...
• ... until the DB hits its limit
• Performance may then begin to degrade
• This throughput defines maximum cluster performance
• Can be used to evaluate different HBase releases
• Otherwise, HBase is never stressed beyond saturation
25
Set the Target Throughput
26
Use the Appropriate Workload Distribution
• Various types possible
• Uniform (default, but unrealistic)
• Latest
• Hotspot
• Zipfian
27
Rigorous Measurement Techniques
• Set the cluster up properly
• Keep the input data set fixed
• Pre-split the key space
• Warm up the cache properly
• Set the target throughput
• Use the correct workload distribution
• Monitor cluster statistics continually
28 ©2014 Cloudera, Inc. All rights reserved.
• Multi-tenant as in different compute frameworks
Multi-tenant HBase Performance
29 ©2014 Cloudera, Inc. All rights reserved.
HBase in a Multi-tenant Environment
Integration
Storage
Resource Management
Metadata
Processing
Batch
MR
…
Interactive
SQL
Impala
Interactive
Search
Solr
Interactive
Serving
HBase
Machine
Learning
System
Management
Data
Management
Support
Security
30 ©2014 Cloudera, Inc. All rights reserved.
• Customer wants to do free-text search on data in HBase
• Explore relevant data beyond just key look-up
• This is “multi-tenant” as in multiple frameworks
• HBase + MapReduce + Cloudera Search (Apache Solr)
• Data indexed into Solr via MapReduce (or Lily HBase Indexer)
• Challenge is to not impact HBase and Solr performance
Real Multi-tenant Use Case
31 ©2014 Cloudera, Inc. All rights reserved.
• Inevitable constraints
• More processing, different processing on the same hardware
• Multi-tenant performance of each framework < stand-alone perf.
• Good multi-tenant performance means
• Efficient - good aggregate performance across HBase/MR/Search
• Fair - performance of each reflects assigned share of resources
• Elastic - transient spare resources get quickly and fully used
Multi-tenant Performance is Hard!
32 ©2014 Cloudera, Inc. All rights reserved.
• Configure HBase, Search, and MapReduce
• Large set of performance-relevant parameters for each
• Configure each for achieve a desired resource share
• Many implicit resource controls
• Setup the datasets for high performance
• How many regions for the HBase table
• How many shards for the Solr collection
Practically doing HBase  Solr via MapReduce
33 ©2014 Cloudera, Inc. All rights reserved.
Start with stand-alone performance
• Stand-alone MR indexing rate of HBase  Search
• Should be no lower than that for HDFS  Search
34 ©2014 Cloudera, Inc. All rights reserved.
• Stand-alone MR indexing rate of HBase  Search
• Should be no lower than that for HDFS  Search
Start with stand-alone performance
time
MapReduce indexing
HBase  Solr
resource
HBase, MR,
Solr all idle
HBase, MR,
Solr all idle
capacity
35 ©2014 Cloudera, Inc. All rights reserved.
• MR indexing HBase  Solr while both are active
• Test efficiency, fairness, elasticity
Multi-tenant Performance
HBase
transactions
HBase transactions HBase
transactions
MR indexing
HBase  Solr
Search
queries
Search
queriesSearch queries
time
resource
capacity
36 ©2014 Cloudera, Inc. All rights reserved.
• HBase essential to an enterprise data hub
• Need for multiple frameworks to analyze HBase data
• Challenging to define/measure multi-tenant performance
• Not tractable without rigorous techniques
• Look for discipline and rigor in performance numbers!
Recap
37 ©2014 Cloudera, Inc. All rights reserved.
• gkamat@cloudera.com
• yanpei@cloudera.com
Thanks!
38 ©2014 Cloudera, Inc. All rights reserved.
Backup slides
39
Building YCSB
$ git clone http://github.com/brianfrankcooper/YCSB
$ mvn package –DskipTests
diff --git a/pom.xml b/pom.xml
- <maven.assembly.version>2.2.1</maven.assembly.version>
- <hbase.version>0.92.1</hbase.version>
+ <maven.assembly.version>2.4</maven.assembly.version>
+ <hbase.version>0.98.1-hadoop2</hbase.version>
40
Building YCSB (contd.)
diff --git a/hbase/pom.xml b/hbase/pom.xml
- <artifactId>hbase</artifactId>
+ <artifactId>hbase-client</artifactId>
- <artifactId>hadoop-core</artifactId>
- <version>1.0.0</version>
+ <artifactId>hadoop-common</artifactId>
+ <version>2.3.0</version>

Mais conteúdo relacionado

Mais procurados

HBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBaseHBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBaseHBaseCon
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationSchubert Zhang
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHBaseCon
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceCloudera, Inc.
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBaseHBaseCon
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBaseHBaseCon
 

Mais procurados (20)

HBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBaseHBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBase
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBase
 

Semelhante a Rigorous and Multi-tenant HBase Performance Measurement

HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoicebazaarvoice_engineering
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsDataWorks Summit
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGPradeep MG
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagationRegunath B
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Big data hadoop training in pune course content advanto software
Big data hadoop training in pune course content advanto softwareBig data hadoop training in pune course content advanto software
Big data hadoop training in pune course content advanto softwareAdvanto Software
 

Semelhante a Rigorous and Multi-tenant HBase Performance Measurement (20)

HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoice
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Big data hadoop training in pune course content advanto software
Big data hadoop training in pune course content advanto softwareBig data hadoop training in pune course content advanto software
Big data hadoop training in pune course content advanto software
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Rigorous and Multi-tenant HBase Performance Measurement

  • 1. 1 Rigorous and Multi-tenant HBase Performance Govind Kamat, Yanpei Chen Performance Engineering
  • 2. 2 Bio Govind Kamat • Member of the Performance Engineering Team at Cloudera • Focuses on Hadoop and HBase performance and scalability • Experience includes the development of large-scale software systems, microprocessor architecture, compilers and electronic design Yanpei Chen • Member of the Performance Engineering Team at Cloudera • Works on cross-component performance - Hadoop, HBase, Search and Impala • Ph.D. from UC Berkeley, focus on performance measurement method and theory
  • 3. 3 Outline • Apache HBase overview • Measuring performance + YCSB basics • Cluster setup best practices • Techniques for rigorous measurement • HBase in a multi-tenant environment
  • 4. 4 HBase Overview • Distributed, "NoSQL" key-value store • Column-oriented, sorted map • Keys are lexicographically sorted • Multiple regions across “regionservers” • Built on HDFS, MapReduce not required
  • 5. 5 Measuring HBase Performance is Hard! • Numbers not reproducible • Large run-to-run variation • Testbeds not clearly defined/properly setup • Various workloads have been used • Configuration parameters not specified • State of regionservers not taken into account • Reported numbers not comparable
  • 7. 7 Workloads for Performance Measurement • Set of transactions to be imposed against it • read, update, insert, scan and mixes thereof • Initial data to be loaded into the DB • Insert • Transaction load intensity variation over time • Possible HBase workloads: • Actual customer/production workloads (best) • PerformanceEvaluation (not really a workload ) • YCSB (Yahoo! Cloud Serving Benchmark, commonly used)
  • 8. 8 Yahoo! Cloud Serving Benchmark (YCSB) Basics • Performance evaluation framework for key-value databases, such as: • HBase, Cassandra, Sherpa, Accumulo, Voldemort • Abstracts out the client from the DB • Flexible and configurable • Comes with a standard “core” workload • Reports throughput and latency metrics
  • 9. 9 YCSB Basics - Running YCSB • Create a table called "usertable" in HBase $ ycsb [load | run] hbase -p workload= com.yahoo.ycsb.workloads.CoreWorkload -p columnfamily=cf -p operationcount=1000000 -P workloads/randomWrite -threads 10 -s
  • 10. 10 YCSB Basics – YCSB Parameters • Specified like so: '-p property=value’ • columnfamily, fieldcount, fieldlength • recordcount, operationcount • readproportion, updateproportion, scanproportion, .. • readallfields, writeallfields • requestdistribution • maxscanlength, scanlengthdistribution • maxexecutiontime
  • 11. 11 YCSB Basics - YCSB Output 1/2 2014-05-28 17:08:34:025 1310 sec: 2951422 operations; 2737.33 current ops/sec; [READ AverageLatency(us)=8098.29] 2014-05-28 17:08:44:026 1320 sec: 2972315 operations; 2089.09 current ops/sec; [READ AverageLatency(us)=8671.15] [OVERALL], RunTime(ms), 1334884.0 [OVERALL], Throughput(ops/sec), 2247.3862897450267 [READ], Operations, 3000000 [READ], AverageLatency(us), 8876.560442666667 [READ], MinLatency(us), 205 [READ], MaxLatency(us), 2530720 [READ], 95thPercentileLatency(ms), 9 [READ], 99thPercentileLatency(ms), 15
  • 12. 12 YCSB Basics - YCSB Output 2/2 [READ], 0, 2168499 [READ], 1, 445777 [READ], 2, 29748 [READ], 3, 32264 [READ], 4, 28154 [READ], 5, 26195 [READ], 6, 32222 [READ], 7, 39343 [READ], 8, 44038 [READ], 9, 41481 [...] [READ], >1000, 11925
  • 13. 13 Cluster Setup Best Practices • Setting up the cluster • Configuring HBase • Creating tables • Pre-splitting tables • Loading data
  • 14. 14 HBase Cluster Configuration Best Practices • Use the appropriate hardware, correctly sized: memory, disk • Dedicate separate nodes for master services and worker roles • No Task Trackers and Node Managers on regionserver nodes • Segregate clients from the regionservers • Configure HBase properly: • Block cache (read), memstore (write) • Bloom filters, compression, compaction, short-circuit reads, etc. • Use the appropriate data set size, number of regions, etc. • Monitor the cluster constantly
  • 15. 15
  • 16. 16 Data Loading – Several Options • Real, actual, production (hot) data  • Custom loader • PerformanceEvaluation • Loading using YCSB • HFileGenerator followed by bulk-load
  • 17. 17 Data Loading - Pre-split the Table • Auto-splitting has significant overhead • RegionSplitter utility • UniformSplit • HexStringSplit • YCSB: user100000 .. user999999 hbase(main):1:0> create 'usertable', 'cf’, { SPLITS=> (1..(50-1)).map {|i| "user#{1000 + i*9000/50}" } } #50 splits • Set maximum region file size to a large value
  • 18. 18 Techniques for Rigorous Measurement • Keep the input data set fixed • Warm up the cache • Set the target throughput • Use the correct workload distribution
  • 19. 19 Keep the Input Data Set Fixed!
  • 20. 20 Keep the Input Data Set Fixed! A beginning is the time for taking the most delicate care that the balances are correct. The manual of Muad’Dib From “Dune” by Frank Herbert
  • 22. 22 Warm Up the Cache • Performance depends significantly on memory • HBase block cache and OS page cache for reads • Memstore and WAL for writes • Load all the rows in the table • Write until data starts getting flushed • Compaction can affect performance significantly • Carry out long-running tests • Repeat till steady-state • Otherwise, performance can vary a lot
  • 23. 23 Warm Up the Cache
  • 24. 24 Set the Target Throughput • Two parameters to set desired throughput • -threads • -target • Actual throughput will match target throughput ... • ... until the DB hits its limit • Performance may then begin to degrade • This throughput defines maximum cluster performance • Can be used to evaluate different HBase releases • Otherwise, HBase is never stressed beyond saturation
  • 25. 25 Set the Target Throughput
  • 26. 26 Use the Appropriate Workload Distribution • Various types possible • Uniform (default, but unrealistic) • Latest • Hotspot • Zipfian
  • 27. 27 Rigorous Measurement Techniques • Set the cluster up properly • Keep the input data set fixed • Pre-split the key space • Warm up the cache properly • Set the target throughput • Use the correct workload distribution • Monitor cluster statistics continually
  • 28. 28 ©2014 Cloudera, Inc. All rights reserved. • Multi-tenant as in different compute frameworks Multi-tenant HBase Performance
  • 29. 29 ©2014 Cloudera, Inc. All rights reserved. HBase in a Multi-tenant Environment Integration Storage Resource Management Metadata Processing Batch MR … Interactive SQL Impala Interactive Search Solr Interactive Serving HBase Machine Learning System Management Data Management Support Security
  • 30. 30 ©2014 Cloudera, Inc. All rights reserved. • Customer wants to do free-text search on data in HBase • Explore relevant data beyond just key look-up • This is “multi-tenant” as in multiple frameworks • HBase + MapReduce + Cloudera Search (Apache Solr) • Data indexed into Solr via MapReduce (or Lily HBase Indexer) • Challenge is to not impact HBase and Solr performance Real Multi-tenant Use Case
  • 31. 31 ©2014 Cloudera, Inc. All rights reserved. • Inevitable constraints • More processing, different processing on the same hardware • Multi-tenant performance of each framework < stand-alone perf. • Good multi-tenant performance means • Efficient - good aggregate performance across HBase/MR/Search • Fair - performance of each reflects assigned share of resources • Elastic - transient spare resources get quickly and fully used Multi-tenant Performance is Hard!
  • 32. 32 ©2014 Cloudera, Inc. All rights reserved. • Configure HBase, Search, and MapReduce • Large set of performance-relevant parameters for each • Configure each for achieve a desired resource share • Many implicit resource controls • Setup the datasets for high performance • How many regions for the HBase table • How many shards for the Solr collection Practically doing HBase  Solr via MapReduce
  • 33. 33 ©2014 Cloudera, Inc. All rights reserved. Start with stand-alone performance • Stand-alone MR indexing rate of HBase  Search • Should be no lower than that for HDFS  Search
  • 34. 34 ©2014 Cloudera, Inc. All rights reserved. • Stand-alone MR indexing rate of HBase  Search • Should be no lower than that for HDFS  Search Start with stand-alone performance time MapReduce indexing HBase  Solr resource HBase, MR, Solr all idle HBase, MR, Solr all idle capacity
  • 35. 35 ©2014 Cloudera, Inc. All rights reserved. • MR indexing HBase  Solr while both are active • Test efficiency, fairness, elasticity Multi-tenant Performance HBase transactions HBase transactions HBase transactions MR indexing HBase  Solr Search queries Search queriesSearch queries time resource capacity
  • 36. 36 ©2014 Cloudera, Inc. All rights reserved. • HBase essential to an enterprise data hub • Need for multiple frameworks to analyze HBase data • Challenging to define/measure multi-tenant performance • Not tractable without rigorous techniques • Look for discipline and rigor in performance numbers! Recap
  • 37. 37 ©2014 Cloudera, Inc. All rights reserved. • gkamat@cloudera.com • yanpei@cloudera.com Thanks!
  • 38. 38 ©2014 Cloudera, Inc. All rights reserved. Backup slides
  • 39. 39 Building YCSB $ git clone http://github.com/brianfrankcooper/YCSB $ mvn package –DskipTests diff --git a/pom.xml b/pom.xml - <maven.assembly.version>2.2.1</maven.assembly.version> - <hbase.version>0.92.1</hbase.version> + <maven.assembly.version>2.4</maven.assembly.version> + <hbase.version>0.98.1-hadoop2</hbase.version>
  • 40. 40 Building YCSB (contd.) diff --git a/hbase/pom.xml b/hbase/pom.xml - <artifactId>hbase</artifactId> + <artifactId>hbase-client</artifactId> - <artifactId>hadoop-core</artifactId> - <version>1.0.0</version> + <artifactId>hadoop-common</artifactId> + <version>2.3.0</version>

Notas do Editor

  1. 2014-06-04 12:30pm