Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden

Performance Testing of
Big Data
26 april 2016

6
Big Data refers to data that, because of its size,
speed, or format-- that is, its volume, velocity, or
variety-- cannot be easily stored, manipulated or
analyzed with traditional methods, like
spreadsheets, relational databases, or common
statistical software.

8
Production like
Big Data Cluster
Testdata
TeraBytes / PetaBytes
Testdata
Management ?
Load Generating Cluster

10
Corporate Data Architecture
Data is Fast Before it’s Big. Data often comes in streams into data systems.
Events happening hundreds to tens of thousands of times a second.
http://www.internetlivestats.com/
The things we do with Fast Data :
• Ingest – get millions of events per second into the system
• Decide – make a data-driven decision on each event
• Analyze in real time – provide visibility into operational trends of the events

11
Lambda
http://www.ericsson.com/research-blog/data-knowledge/data-processing-architectures-lambda-and-kappa/

12
Kappa
http://www.ericsson.com/research-blog/data-knowledge/data-processing-architectures-lambda-and-kappa/

13
Component Performance Testing: These systems are made up of multiple components, and
it is essential to test each of these components in isolation.

15
Storm is a distributed
real-time computation
system for processing
large volumes of high-
velocity data. Storm is
extremely fast, with the
ability to process over a
million records per
second per node on a
cluster of modest size.
Bolts can do anything from filtering,
functions, aggregations, joins, talking to
databases, and more.
A spout is a source of streams in a topology.
Streams are composed of tuples
The logic for a realtime application is packaged into a Storm topology. A Storm
topology is analogous to a MapReduce job.
The tuple is the main data structure in
Storm. A tuple is a named list of values,
where each value can be any type.

17
Due to a lack of real-world streaming benchmarks, we
developed one to compare Apache Flink, Apache Storm
and Apache Spark Streaming. It is released as open
source: https://github.com/yahoo/streaming-benchmarks
Storm Benchmark tools authored by Taylor Goetz -
https://github.com/ptgoetz/storm-benchmark
Storm Benchmark authored by Manu Zhang -
https://github.com/manuzhang/storm-benchmark

13-04-201618
Apache distribution
• TestDFSIO read and write test for HDFS
• TeraSort The goal of TeraSort is to sort 1TB of data (or any other amount)
as fast as possible. It is a benchmark that combines testing the HDFS
and MapReduce layers of an Hadoop cluster.
• NNBench Is used for load testing the NameNode hardware and configuration.
• MRBench Checks whether small jobs are responsive and running efficiently on
your cluster.
HiBench, a Hadoop benchmark suite consisting of both micro-benchmarks and real
world applications
https://software.intel.com/en-us/blogs/2012/10/15/use-hibench-as-a-representative-proxy-for-benchmarking-hadoop-
applications

19
Chukwa is an open source data collection system for monitoring and
analyzing large distributed systems. It is built on top of Hadoop and
includes a powerful and flexible toolkit for monitoring, analyzing, and
viewing results. Many components of Chukwa are pluggable, allowing
easy customization and enhancement.
Monitoring

20
Dr. Elephant is a performance monitoring and tuning tool for
Hadoop and Spark. It automatically gathers all the metrics,
runs analysis on them, and presents them in a simple way
for easy consumption.
Open sourced by at 08-04-2016

21
Thinking Scalability
Scalability is the ability of the software to keep up the performance even under
increasing load by adding resources linearly. But achieving scalability requires more
than just adding resources and tuning performance. To achieve scalability one
needs to think holistically about software design, quality, maintainability and
performance aspects.
Necessary conditions for Scalability
• Software has sound architecture and high quality
• Software is easy to release, monitor and tweak.
• Software performance can keep up with additional load
by adding resources linearly.

Q & A
Praegus B.V. - Experts in Testing & Test Automation 24
www.praegus.nl

26
Docker lets you limit a container’s CPU resources with the –cpu-shares flag
DataBase
@1024 ~66%
WebServer
@512 ~14%
Total Shares 1536
DataBase
@1024 ~28%
WebServer
@512 ~33%
Total Shares 3584
ApplicationServer
@2048 ~57%
CPU shares differ from memory limits in that they’re enforced only when
there is contention for time on the CPU. If other processes and containers are
idle, then the container may burst well beyond its limits.

Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden

Semelhante a Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden (20)

Último

Último (20)

Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden

Notas do Editor