This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Hadoop & Big Data benchmarking
1. Benchmarking
Hadoop & Big Data benchmarking
Dr. ir. ing. Bart Vandewoestyne
Sizing Servers Lab, Howest, Kortrijk
IWT TETRA User Group Meeting - November 28, 2014
1 / 62
5. Benchmarking
Intro: Hadoop essentials
Hadoop 1.0
Source: Apache Hadoop YARN : moving beyond
MapReduce and batch processing with Apache Hadoop 2,
Hortonworks, 2014)
MapReduce and HDFS are the
core components, while other
components are built around the
core.
5 / 62
6. Benchmarking
Intro: Hadoop essentials
Hadoop 2.0
Source: Apache Hadoop YARN : moving beyond
MapReduce and batch processing with Apache Hadoop 2,
Hortonworks, 2014)
YARN adds a more general
interface to run non-MapReduce
jobs within the Hadoop
framework.
6 / 62
21. Benchmarking
Benchmarks
Why benchmark?
My three reasons for using benchmarks:
1 Evaluating the eect of a hardware/software upgrade:
OS, Java VM,. . .
Hadoop, Cloudera CDH, Pig, Hive, Impala,. . .
2 Debugging:
Compare with other clusters or published results.
3 Performance tuning:
E.g. Cloudera CDH default con
25. Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO
Read and write test for HDFS.
Helpful for
getting an idea of how fast your cluster is in terms of I/O,
stress testing HDFS,
discover network performance bottlenecks,
shake out the hardware, OS and Hadoop setup of your cluster
machines (particularly the NameNode and the DataNodes).
23 / 62
27. les of size 1 GB for a total of 10 GB:
$ hadoop jar hadoop-*test*.jar
TestDFSIO -write -nrFiles 10 -fileSize 1000
TestDFSIO is designed to use 1 map task per
30. Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO: write test output
Typical output of write test
----- TestDFSIO ----- : write
Date time: Mon Oct 06 10:21:28 CEST 2014
Number of files: 10
Total MBytes processed: 10000.0
Throughput mb/sec: 12.874702111579893
Average IO rate mb/sec: 13.013071060180664
IO rate std deviation: 1.4416050051562712
Test exec time sec: 114.346
25 / 62
37. les, each of size 1 GB:
$ hadoop jar hadoop-*test*.jar
TestDFSIO -read -nrFiles 10 -fileSize 1000
27 / 62
38. Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO: read test output
Typical output of read test
----- TestDFSIO ----- : read
Date time: Mon Oct 06 10:56:15 CEST 2014
Number of files: 10
Total MBytes processed: 10000.0
Throughput mb/sec: 402.4306813151435
Average IO rate mb/sec: 492.8257751464844
IO rate std deviation: 196.51233829270575
Test exec time sec: 33.206
28 / 62
39. Benchmarking
Benchmarks
Micro Benchmarks
In
uence of HDFS replication factor
When interpreting TestDFSIO results, keep in mind:
The HDFS replication factor plays an important role!
A higher replication factor leads to slower writes.
For three identical TestDFSIO write runs (units are MB/s):
HDFS replication factor
1 2 3
Throughput 190 25 13
Average IO-rate 190 10 25 3 13 1
29 / 62
40. Benchmarking
Benchmarks
Micro Benchmarks
TeraSort
Goal
Sort 1TB of data (or any other amount of data) as fast as possible.
Probably most well-known Hadoop benchmark.
Combines testing the HDFS and MapReduce layers of an
Hadoop cluster.
Typical areas where TeraSort is helpful
Iron out your Hadoop con
49. Benchmarking
Benchmarks
Micro Benchmarks
NNBench
Goal
Load test the NameNode hardware and software.
Generates a lot of HDFS-related requests with normally very
small payloads.
Purpose: put a high HDFS management stress on the
NameNode.
Can simulate requests for creating, reading, renaming and
deleting
52. les using 12 maps and 6 reducers:
$ hadoop jar hadoop-*test*.jar nnbench
-operation create_write
-maps 12
-reduces 6
-blockSize 1
-bytesToWrite 0
-numberOfFiles 1000
-replicationFactorPerFile 3
-readFileAfterOpen true
-baseDir /user/bart/NNBench-`hostname -s`
38 / 62
53. Benchmarking
Benchmarks
Micro Benchmarks
MRBench
Goal
Loop a small job a number of times.
checks whether small job runs are responsive and running
eciently on the cluster
complimentary to TeraSort
puts its focus on the MapReduce layer
impact on the HDFS layer is very limited
39 / 62
54. Benchmarking
Benchmarks
Micro Benchmarks
MRBench: example
Run a loop of 50 small test jobs:
$ hadoop jar hadoop-*test*.jar
mrbench -baseDir /user/bart/MRBench
-numRuns 50
40 / 62
55. Benchmarking
Benchmarks
Micro Benchmarks
MRBench: example
Run a loop of 50 small test jobs:
$ hadoop jar hadoop-*test*.jar
mrbench -baseDir /user/bart/MRBench
-numRuns 50
Example output:
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 28822
! average
56. nish time of executed jobs was 28 seconds.
41 / 62
59. Benchmarking
Benchmarks
BigBench
BigBench
Big Data benchmark based on TPC-DS.
Focus is mostly on MapReduce engines.
Collaboration between industry and academia.
https://github.com/intel-hadoop/Big-Bench/
History
Launched at First Workshop on Big Data Benchmarking
(May 8-9, 2012).
Full kit at Fifth Workshop on Big Data Benchmarking
(August 5-6, 2014).
44 / 62
60. Benchmarking
Benchmarks
BigBench
BigBench data model
Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013.
45 / 62
61. Benchmarking
Benchmarks
BigBench
BigBench: Data Model - 3 V's
Variety
BigBench data is
structured,
semi-structured,
unstructured.
Velocity
Periodic refreshes for all data.
Dierent velocity for dierent areas:
Vstructured Vunstructured Vsemistructured
Volume
TPC-DS: discrete scale factors
(100, 300, 1000, 3000, 10000, 3000 and 100000).
BigBench: continuous scale factor.
46 / 62
80. Benchmarking
Conclusions
Conclusions
Use Hadoop distributions!
Hadoop cluster administration ! Cloudera Manager.
Micro-benchmarks $ BigBench.
Your best benchmark is your own application!
61 / 62