SlideShare uma empresa Scribd logo
1 de 45
Hadoop/HBase POC v1 Review

 A framework for Hadoop/HBase POC
POC
• Proof Of Concept, usually in competition with
  another product.
• Given use case:
  – Performance: critical path (speed), most
    benchmark read performance,shard for write
    performance
  – Cost: H/W + administrative cost
  – Look at Hbase+Hadoop vs. MongoDB
HBase
• Transactional store; 70k messages/sec
  1.5kb/message. >1GB ethernet speeds
• What is Hbase?, Sources
Cloudera HBase Training Materials
• Exercises:
  http://async.pbworks.com/w/file/55596671/H
  Base_exercise_instructions.pdf
• Training Slides:
  http://async.pbworks.com/w/file/54915308/Cl
  oudera_hbase_training.pdf
• Training VM; 2GB put somewhere else.
System Design on working
      components
HDFS vs. Hbase
• Replication and distributed FS. Think NFS not just replicas.
  Metadata at central NameNode, single point of failure.
  Secondary NN as hot backup. Failure and recovery protocol
  testing not part of POC
• Blocks, larger is better. Blocks are replicated. Not cells.
• HDFS write once, was modified to append to file for HBase.
• MapR HDFS compatible:
   –   fast adoption w/Hbase; snapshots
   –   Cross data center mirroring, consistent mirroring
   –   Star replication vs. chain replication
   –   FileServer vs. TaskTracker, Warden vs. NN. No Single point failure
RS + DN on same machine
HBase Memory(book)
Hbase Disks(book)
• No RAID on slaves, master ok. Use IOPS
HBase Networking(book)
Transactional Write Perf.
• Factor out network, multiple clients, any disk
  seeks from test program
• Create test packets in memory only.
• Write perf function of Instance
  memory, packet size,
HBase Write Path
Run on Amazon AWS first
• INSTANCES:
  – SMALL INSTANCE: 1.7GB
  – LARGE INSTANCE: 7.5GB
  – HIMEM XLARGE: 17GB, 34GB, 68GB
  – SSD DRIVES!!
Write performance, 300k m/s 1500
          bytes synthetic data.
                   Series 2
3500

3000

2500

2000

1500                                    Series 2

1000

500

   0
       1.7   7.5   17         34   68
Dell Notes:
•   MapR says 16GB/Cloudera 24GB,
•   plot heap size instead.
•   Dell, is this slowing down performance?
•   Take out a dimm?
•   Reproduce results first?
HBase write perf, 1M byte/s
• http://www.slideshare.net/jdcryans/performa
  nce-bof12, 100k-40k/second 10 byte packets
Write test code
• No network, no disk accesses. Run on local
  node
Hbase AWS Packet Size 16-1500 bytes
• http://async.pbworks.com/w/file/55320973/A
  WSHBasePerf16_1500bytepacket.xlsx
Hbase Write Perf, 1500 byte packets
• Single thread, single node. Should be >>
  w/more threads or async client
• 16 Byte: 11235p/s
• 40 Byte: 8064p/s
• 80 Byte: 5263p/s
• 1500 Byte:3311p/s
• 8GB Heap, big regions(optimizations in file
  names), etc…12-20 tried, 4 make diff
AWS Reduce #RPC
• Batch Mode, 1000 inserts = 1000 RPCs, reduce
  to 1 RPC w/Batch, 3610 p/s (5.4Mb/s, pass
  error check, m22xlarge instance). Note:mongo
           2.5


            2


           1.5

                                  Series1
            1


           0.5


            0
                   1
                  16
                  31
                  46
                  61
                  76
                  91
                 106
                 121
                 136
                 151
                 166
                 181
                 196
                 211
                 226
                 241
                 256
                 271
                 286
Dell H/W Perf. (default config) worse
       2262p/s vs 3311(aws)
       http://async.pbworks.com/w/file/55225682/graphdell1500bytepacket8gb.txt
0.1
                0.2
                      0.3
                            0.4
                                            0.5
                                                  0.6
                                                        0.7
                                                              0.8
                                                                    0.9




      0
                                                                          1
  1
  8
 15
 22
 29
 36
 43
 50
 57
 64
 71
 78
 85
 92
 99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
253
260
267
274
281
288
295
                                                                              DELL WAL off, 2262->2688(+18.5%)




                                  Series1
0
          1
              2
                            3
                                4
                                    5
                                        6
  1
  8
 15
 22
 29
 36
 43
 50
 57
 64
 71
 78
 85
 92
 99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
253
260
                                            >3557p/s. 57% increase




267
274
281
288
295
                                        Dell WAL disabled,big heap, big
                                        regions, need more time 2262-




                  Series1
AWS SSD(3267p/s) vs.
      EBS(4042p/s), no compaction. Red
 5
       m2large. Maybe AWS using SSD?
4.5


 4


3.5


 3


2.5                                      Series1
                                         Series2
 2


1.5


 1


0.5


 0
        1
        8
       15
       22
       29
       36
       43
       50
       57
       64
       71
       78
       85
       92
       99
      106
      113
      120
      127
      134
      141
      148
      155
      162
      169
      176
      183
      190
      197
      204
      211
      218
      225
      232
      239
      246
      253
      260
      267
      274
      281
      288
      295
AWS(3500-4k packets/sec) vs. DELL
• AWS 3-4k p/s default configuration w/o optimization.
• Dell (3557p/s) slower than AWS(3610 optimized
  m22xlarge, 4240p/s m2large)
• Faster h/w instances in AWS makes a difference.
  Lesson(4210p/s): contolling the regions and
  compactions have impact on performance, fast IO.
  Spend time later on this.
• User error w/Dell h/w somewhere. Can’t be that slow!
• Could run a benchmark on m22xlarge over 24h period
  to see variability in perf. Not worth time investment
Dell Tuning
• Ext3/4 5% diff in benchmarks. No diff in p/s
  performance.
• Raid levels? JBOD not avail.
• Maybe m2.2xlarge are high perf AWS drives
  are SSD? Seems funny w/pricing structure.
• Noatime, 256k block sizes,
• Goal: 4k P/S?
Bulk Load (worth time investment?)
• Quick error check
• Take existing table, export it, bulk load.
  Command line; very rough.
• Should redo w/Java program. WAL off is
  approximation
Write Clients for NOSQL
• HBase, Mongo, Cassandra have threads
  behind them, need a threaded or async client
  to get full performance.
• need more time, higher priority than dist
  mode, needed in dist mode
• lock timeout behavior; insert 1 row
• Need a threaded or async client. Most get
  threaded design wrong?
Write Load Tool (multiple clients)
• 300k rows single thread single client: 14430
  ms, 2079p/s; about right….
• 300k rows 3 threads:22804ms
• M/R 30 mappers:24289
• M/R better when need to do combining or
  processing of input data. M/R & Threads
  comparison about right. Threads should
  increase performance… ok writing my own…
Application Level Perf
• Not transactional…
• Simulate reporting store; writes concurrent
  w/web page read.
• Compare w/SQL Server, MongoDB which have
  column indexes.
• You may not need column indexes if designed
  correctly. ESN not key, will need consecutive
  keys to split into balanced regions.
Web GUI
• Demo, webpage & writes into DB. Test MS SQL
  Server packets/sec using same.
• Do a like %asdf% with no data to see if there
  is a timeout
Read Performance
• Index search through webpage w/writer is fast, 50-100ms, <10-
  20ms if in cache
• Don’t do all table scans. Like in hbase shell count ‘table name’
    – Count * from table
• PIG/HIVE are faster on top of Hbase b/c they store metadata
• All table scan:
    10 rows:18ms
    100 rows:11-166ms
    1000 rows: 638 ms
    10k rows: 4.3 s
    100k rows: 38 s (not printed)
• Use filters for search, exact match, regex, substring, more
Read Path/SCAN/Filters
SingleValueColumn Filter
• Search for specific
  value, constant, regex, prefix. Did not try
  others
• Same queries as before, search for specific
  values testing 100k-1M rows.
• W/O filters, use iterator to hold result set and
  iterate through each result, test each result
  value. Like DB drivers. Filter reduces result set
  size from all rows to only rows which meet
  condition
Column Value
• Filter filter = new
  SingleColumnValueFilter(“CF”, “Key5”, Compar
  eOp.EQUAL, “bob”).
• Filter f = new
  SingleColumnValutFilter(“CF”,”COLUMN”, Com
  pareOp.EQUAL, new
  RegexStringComparator(“z*”));
• 565ms for 200k rows, 115 result set returned
  (printed), small result sets are faster.
Column Value Searches
• 100k row table
  – Returning .1% of results , (10):5s
  – Returning 1% of results, (100): 11.29s
• 1M row table
  – 1% results:212 s (10k)
  – .1% results:204.057s (1k)
Compose row key w/values or index
              tables
• Add second table where the row keys are
  composed partially of the values
Secondary table Consistency, don’t need for a
reporting system? Consistent on inserts or bulk
import.
Build Environment
• Ready for CI, (Jenkins)
• Ubuntu specific process for changing
  code, make all, make deb, make apt, then
  install using apt-get install hadoop* hbase*.
• Need to start over for yum for centos.
• Demo
• Also ready for command line w/o GUI
Hbase org.apache.hadoop.hbase.PerfEval xx xx
Distributed mode
• Setup build environment
• Distributed mode setup. Zookeeper error
  message:
• Disable ipv6? Debugging
Docs:
• Bigtop/updated version of CDH
• Installation:
• Build Docs: Ubuntu/deb; big change to rpms;
  takes time to document and debug. Can do
  both, takes time.
• Distributed Mode:
• NXServer/NXClient:
• Screen:

Mais conteúdo relacionado

Mais procurados

Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
marvin herrera
 
MySQL Performance Tuning
MySQL Performance TuningMySQL Performance Tuning
MySQL Performance Tuning
FromDual GmbH
 
Cloud computing 3702
Cloud computing 3702Cloud computing 3702
Cloud computing 3702
Jess Coburn
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
adunne
 
SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...
SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...
SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...
Michael Noel
 
Tuning the Performance of Your ColdFusion Environment to Racecar Specs!
Tuning the Performance of Your ColdFusion Environment to Racecar Specs!Tuning the Performance of Your ColdFusion Environment to Racecar Specs!
Tuning the Performance of Your ColdFusion Environment to Racecar Specs!
Hostway|HOSTING
 

Mais procurados (20)

Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
 
MySQL Performance Tuning
MySQL Performance TuningMySQL Performance Tuning
MySQL Performance Tuning
 
Postgres in Amazon RDS
Postgres in Amazon RDSPostgres in Amazon RDS
Postgres in Amazon RDS
 
Pascal benois performance_troubleshooting-spsbe18
Pascal benois performance_troubleshooting-spsbe18Pascal benois performance_troubleshooting-spsbe18
Pascal benois performance_troubleshooting-spsbe18
 
Configuring Sage 500 for Performance
Configuring Sage 500 for PerformanceConfiguring Sage 500 for Performance
Configuring Sage 500 for Performance
 
Cloud computing 3702
Cloud computing 3702Cloud computing 3702
Cloud computing 3702
 
09 yong.luo-ceph in-ctrip
09 yong.luo-ceph in-ctrip09 yong.luo-ceph in-ctrip
09 yong.luo-ceph in-ctrip
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
 
VMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing Databases
 
Lesson 1 configuring
Lesson 1   configuringLesson 1   configuring
Lesson 1 configuring
 
Deploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDeploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQL
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
The SQL Stack Design And Configurations
The SQL Stack Design And ConfigurationsThe SQL Stack Design And Configurations
The SQL Stack Design And Configurations
 
SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...
SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...
SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...
 
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data stores
 
Tuning the Performance of Your ColdFusion Environment to Racecar Specs!
Tuning the Performance of Your ColdFusion Environment to Racecar Specs!Tuning the Performance of Your ColdFusion Environment to Racecar Specs!
Tuning the Performance of Your ColdFusion Environment to Racecar Specs!
 
Velocity2011 chef-workshop
Velocity2011 chef-workshopVelocity2011 chef-workshop
Velocity2011 chef-workshop
 
Deep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature AnnouncementsDeep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature Announcements
 

Destaque

Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
DataWorks Summit
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
DataWorks Summit
 

Destaque (20)

Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 
projects_with_descriptions
projects_with_descriptionsprojects_with_descriptions
projects_with_descriptions
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Veracity think bugdata #2 6.7.2015
Veracity think bugdata #2   6.7.2015Veracity think bugdata #2   6.7.2015
Veracity think bugdata #2 6.7.2015
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Big data hbase
Big data hbase Big data hbase
Big data hbase
 
Transactions Over Apache HBase
Transactions Over Apache HBaseTransactions Over Apache HBase
Transactions Over Apache HBase
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
 
Big Data Proof of Concept
Big Data Proof of ConceptBig Data Proof of Concept
Big Data Proof of Concept
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 

Semelhante a Hadoop/HBase POC framework

The Secret Guide to Cloud Performance - Cloudlook
The Secret Guide to Cloud Performance - CloudlookThe Secret Guide to Cloud Performance - Cloudlook
The Secret Guide to Cloud Performance - Cloudlook
gidgreen
 
download it from here
download it from heredownload it from here
download it from here
webhostingguy
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
Internet World
 

Semelhante a Hadoop/HBase POC framework (20)

VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
cosbench-openstack.pdf
cosbench-openstack.pdfcosbench-openstack.pdf
cosbench-openstack.pdf
 
The Secret Guide to Cloud Performance - Cloudlook
The Secret Guide to Cloud Performance - CloudlookThe Secret Guide to Cloud Performance - Cloudlook
The Secret Guide to Cloud Performance - Cloudlook
 
download it from here
download it from heredownload it from here
download it from here
 
NVMe over Fabric
NVMe over FabricNVMe over Fabric
NVMe over Fabric
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
 
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesSQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 

Mais de Doug Chang (13)

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Hapi
HapiHapi
Hapi
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitectures
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notes
 
Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming Info
 
Capital onehadoopclass
Capital onehadoopclassCapital onehadoopclass
Capital onehadoopclass
 
Training
TrainingTraining
Training
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
L'Oreal Tech Talk
L'Oreal Tech TalkL'Oreal Tech Talk
L'Oreal Tech Talk
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargeting
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Hadoop/HBase POC framework

  • 1. Hadoop/HBase POC v1 Review A framework for Hadoop/HBase POC
  • 2. POC • Proof Of Concept, usually in competition with another product. • Given use case: – Performance: critical path (speed), most benchmark read performance,shard for write performance – Cost: H/W + administrative cost – Look at Hbase+Hadoop vs. MongoDB
  • 3. HBase • Transactional store; 70k messages/sec 1.5kb/message. >1GB ethernet speeds • What is Hbase?, Sources
  • 4. Cloudera HBase Training Materials • Exercises: http://async.pbworks.com/w/file/55596671/H Base_exercise_instructions.pdf • Training Slides: http://async.pbworks.com/w/file/54915308/Cl oudera_hbase_training.pdf • Training VM; 2GB put somewhere else.
  • 5.
  • 6. System Design on working components
  • 7. HDFS vs. Hbase • Replication and distributed FS. Think NFS not just replicas. Metadata at central NameNode, single point of failure. Secondary NN as hot backup. Failure and recovery protocol testing not part of POC • Blocks, larger is better. Blocks are replicated. Not cells. • HDFS write once, was modified to append to file for HBase. • MapR HDFS compatible: – fast adoption w/Hbase; snapshots – Cross data center mirroring, consistent mirroring – Star replication vs. chain replication – FileServer vs. TaskTracker, Warden vs. NN. No Single point failure
  • 8. RS + DN on same machine
  • 9.
  • 10.
  • 11.
  • 13. Hbase Disks(book) • No RAID on slaves, master ok. Use IOPS
  • 14.
  • 16. Transactional Write Perf. • Factor out network, multiple clients, any disk seeks from test program • Create test packets in memory only. • Write perf function of Instance memory, packet size,
  • 18. Run on Amazon AWS first • INSTANCES: – SMALL INSTANCE: 1.7GB – LARGE INSTANCE: 7.5GB – HIMEM XLARGE: 17GB, 34GB, 68GB – SSD DRIVES!!
  • 19. Write performance, 300k m/s 1500 bytes synthetic data. Series 2 3500 3000 2500 2000 1500 Series 2 1000 500 0 1.7 7.5 17 34 68
  • 20. Dell Notes: • MapR says 16GB/Cloudera 24GB, • plot heap size instead. • Dell, is this slowing down performance? • Take out a dimm? • Reproduce results first?
  • 21. HBase write perf, 1M byte/s • http://www.slideshare.net/jdcryans/performa nce-bof12, 100k-40k/second 10 byte packets
  • 22. Write test code • No network, no disk accesses. Run on local node
  • 23. Hbase AWS Packet Size 16-1500 bytes • http://async.pbworks.com/w/file/55320973/A WSHBasePerf16_1500bytepacket.xlsx
  • 24. Hbase Write Perf, 1500 byte packets • Single thread, single node. Should be >> w/more threads or async client • 16 Byte: 11235p/s • 40 Byte: 8064p/s • 80 Byte: 5263p/s • 1500 Byte:3311p/s • 8GB Heap, big regions(optimizations in file names), etc…12-20 tried, 4 make diff
  • 25. AWS Reduce #RPC • Batch Mode, 1000 inserts = 1000 RPCs, reduce to 1 RPC w/Batch, 3610 p/s (5.4Mb/s, pass error check, m22xlarge instance). Note:mongo 2.5 2 1.5 Series1 1 0.5 0 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286
  • 26. Dell H/W Perf. (default config) worse 2262p/s vs 3311(aws) http://async.pbworks.com/w/file/55225682/graphdell1500bytepacket8gb.txt
  • 27. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 1 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295 DELL WAL off, 2262->2688(+18.5%) Series1
  • 28. 0 1 2 3 4 5 6 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 >3557p/s. 57% increase 267 274 281 288 295 Dell WAL disabled,big heap, big regions, need more time 2262- Series1
  • 29. AWS SSD(3267p/s) vs. EBS(4042p/s), no compaction. Red 5 m2large. Maybe AWS using SSD? 4.5 4 3.5 3 2.5 Series1 Series2 2 1.5 1 0.5 0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260 267 274 281 288 295
  • 30. AWS(3500-4k packets/sec) vs. DELL • AWS 3-4k p/s default configuration w/o optimization. • Dell (3557p/s) slower than AWS(3610 optimized m22xlarge, 4240p/s m2large) • Faster h/w instances in AWS makes a difference. Lesson(4210p/s): contolling the regions and compactions have impact on performance, fast IO. Spend time later on this. • User error w/Dell h/w somewhere. Can’t be that slow! • Could run a benchmark on m22xlarge over 24h period to see variability in perf. Not worth time investment
  • 31. Dell Tuning • Ext3/4 5% diff in benchmarks. No diff in p/s performance. • Raid levels? JBOD not avail. • Maybe m2.2xlarge are high perf AWS drives are SSD? Seems funny w/pricing structure. • Noatime, 256k block sizes, • Goal: 4k P/S?
  • 32. Bulk Load (worth time investment?) • Quick error check • Take existing table, export it, bulk load. Command line; very rough. • Should redo w/Java program. WAL off is approximation
  • 33. Write Clients for NOSQL • HBase, Mongo, Cassandra have threads behind them, need a threaded or async client to get full performance. • need more time, higher priority than dist mode, needed in dist mode • lock timeout behavior; insert 1 row • Need a threaded or async client. Most get threaded design wrong?
  • 34. Write Load Tool (multiple clients) • 300k rows single thread single client: 14430 ms, 2079p/s; about right…. • 300k rows 3 threads:22804ms • M/R 30 mappers:24289 • M/R better when need to do combining or processing of input data. M/R & Threads comparison about right. Threads should increase performance… ok writing my own…
  • 35. Application Level Perf • Not transactional… • Simulate reporting store; writes concurrent w/web page read. • Compare w/SQL Server, MongoDB which have column indexes. • You may not need column indexes if designed correctly. ESN not key, will need consecutive keys to split into balanced regions.
  • 36. Web GUI • Demo, webpage & writes into DB. Test MS SQL Server packets/sec using same. • Do a like %asdf% with no data to see if there is a timeout
  • 37. Read Performance • Index search through webpage w/writer is fast, 50-100ms, <10- 20ms if in cache • Don’t do all table scans. Like in hbase shell count ‘table name’ – Count * from table • PIG/HIVE are faster on top of Hbase b/c they store metadata • All table scan: 10 rows:18ms 100 rows:11-166ms 1000 rows: 638 ms 10k rows: 4.3 s 100k rows: 38 s (not printed) • Use filters for search, exact match, regex, substring, more
  • 39. SingleValueColumn Filter • Search for specific value, constant, regex, prefix. Did not try others • Same queries as before, search for specific values testing 100k-1M rows. • W/O filters, use iterator to hold result set and iterate through each result, test each result value. Like DB drivers. Filter reduces result set size from all rows to only rows which meet condition
  • 40. Column Value • Filter filter = new SingleColumnValueFilter(“CF”, “Key5”, Compar eOp.EQUAL, “bob”). • Filter f = new SingleColumnValutFilter(“CF”,”COLUMN”, Com pareOp.EQUAL, new RegexStringComparator(“z*”)); • 565ms for 200k rows, 115 result set returned (printed), small result sets are faster.
  • 41. Column Value Searches • 100k row table – Returning .1% of results , (10):5s – Returning 1% of results, (100): 11.29s • 1M row table – 1% results:212 s (10k) – .1% results:204.057s (1k)
  • 42. Compose row key w/values or index tables • Add second table where the row keys are composed partially of the values Secondary table Consistency, don’t need for a reporting system? Consistent on inserts or bulk import.
  • 43. Build Environment • Ready for CI, (Jenkins) • Ubuntu specific process for changing code, make all, make deb, make apt, then install using apt-get install hadoop* hbase*. • Need to start over for yum for centos. • Demo • Also ready for command line w/o GUI Hbase org.apache.hadoop.hbase.PerfEval xx xx
  • 44. Distributed mode • Setup build environment • Distributed mode setup. Zookeeper error message: • Disable ipv6? Debugging
  • 45. Docs: • Bigtop/updated version of CDH • Installation: • Build Docs: Ubuntu/deb; big change to rpms; takes time to document and debug. Can do both, takes time. • Distributed Mode: • NXServer/NXClient: • Screen: