Agile, Continous Intergration, DevOps, Big data are not longer buzzwords but part of the day today process of everyone working in software development and delivery. To cope with applications that need to be deployed in production almost the same moment they were created, software development has changed, impacting the way of working for everyone in the team. In this talk, Roland will discuss the challenges performance testers face with Big Data applications and how Architecture, Agile, Continous Intergration and DevOps come together to create solutions.
6. 6
Big Data refers to data that, because of its size,
speed, or format-- that is, its volume, velocity, or
variety-- cannot be easily stored, manipulated or
analyzed with traditional methods, like
spreadsheets, relational databases, or common
statistical software.
10. 10
Corporate Data Architecture
Data is Fast Before it’s Big. Data often comes in streams into data systems.
Events happening hundreds to tens of thousands of times a second.
http://www.internetlivestats.com/
The things we do with Fast Data :
• Ingest – get millions of events per second into the system
• Decide – make a data-driven decision on each event
• Analyze in real time – provide visibility into operational trends of the events
13. 13
Component Performance Testing: These systems are made up of multiple components, and
it is essential to test each of these components in isolation.
15. 15
Storm is a distributed
real-time computation
system for processing
large volumes of high-
velocity data. Storm is
extremely fast, with the
ability to process over a
million records per
second per node on a
cluster of modest size.
Bolts can do anything from filtering,
functions, aggregations, joins, talking to
databases, and more.
A spout is a source of streams in a topology.
Streams are composed of tuples
The logic for a realtime application is packaged into a Storm topology. A Storm
topology is analogous to a MapReduce job.
The tuple is the main data structure in
Storm. A tuple is a named list of values,
where each value can be any type.
17. 17
Due to a lack of real-world streaming benchmarks, we
developed one to compare Apache Flink, Apache Storm
and Apache Spark Streaming. It is released as open
source: https://github.com/yahoo/streaming-benchmarks
Storm Benchmark tools authored by Taylor Goetz -
https://github.com/ptgoetz/storm-benchmark
Storm Benchmark authored by Manu Zhang -
https://github.com/manuzhang/storm-benchmark
18. 13-04-201618
Apache distribution
• TestDFSIO read and write test for HDFS
• TeraSort The goal of TeraSort is to sort 1TB of data (or any other amount)
as fast as possible. It is a benchmark that combines testing the HDFS
and MapReduce layers of an Hadoop cluster.
• NNBench Is used for load testing the NameNode hardware and configuration.
• MRBench Checks whether small jobs are responsive and running efficiently on
your cluster.
HiBench, a Hadoop benchmark suite consisting of both micro-benchmarks and real
world applications
https://software.intel.com/en-us/blogs/2012/10/15/use-hibench-as-a-representative-proxy-for-benchmarking-hadoop-
applications
19. 19
Chukwa is an open source data collection system for monitoring and
analyzing large distributed systems. It is built on top of Hadoop and
includes a powerful and flexible toolkit for monitoring, analyzing, and
viewing results. Many components of Chukwa are pluggable, allowing
easy customization and enhancement.
Monitoring
20. 20
Dr. Elephant is a performance monitoring and tuning tool for
Hadoop and Spark. It automatically gathers all the metrics,
runs analysis on them, and presents them in a simple way
for easy consumption.
Open sourced by at 08-04-2016
21. 21
Thinking Scalability
Scalability is the ability of the software to keep up the performance even under
increasing load by adding resources linearly. But achieving scalability requires more
than just adding resources and tuning performance. To achieve scalability one
needs to think holistically about software design, quality, maintainability and
performance aspects.
Necessary conditions for Scalability
• Software has sound architecture and high quality
• Software is easy to release, monitor and tweak.
• Software performance can keep up with additional load
by adding resources linearly.
26. 26
Docker lets you limit a container’s CPU resources with the –cpu-shares flag
DataBase
@1024 ~66%
WebServer
@512 ~14%
Total Shares 1536
DataBase
@1024 ~28%
WebServer
@512 ~33%
Total Shares 3584
ApplicationServer
@2048 ~57%
CPU shares differ from memory limits in that they’re enforced only when
there is contention for time on the CPU. If other processes and containers are
idle, then the container may burst well beyond its limits.
Notas do Editor
In 2014 I spoke about the importance of mobile performance testing. Recent research revealed that performance is number 2 on the list of problems users encounter with apps. So there is still a lot todo for performance testing for mobile. Today I will not talk about Mobile performance, but about performance testing for Big Data.
My experience with Big Data started in 2000 when I was working for Global Crossing where I was part of the global engineering team building a Pan European Network. A lot of telco´s were doing the same, KPN, KPNQwest, Deutsche Telecom, BT and Worldcom to name a few. At that time however companies didn´t need all this capacity, when Global Crossing went bankrupt (dot com bubble), only 10% of the capacity of the network was used by customers. Today with the explosive use of internet and big data, all this capacity finally gets utilised. After the bankruptcy of Global Crossing I started to work as software tester, quickly moving to testautomation and performance testing due to my technical background.
Testing was done in those days in waterfall fashion on 2 and 3 tier applications running at real hardware. The way bridges were build in the past is comparable with this way of software development and implementation, after completion the bridge was tested with fully loaded trucks, hoping it wouldn’t collapse.
No need to talk about the explosion of Mobile usage and social media apps. We moved from Waterfall to Agile, from physical hardware to virtualisation. Large Hadron Collider 15 Petabytes of data a year.Less know usage is Big data and offshore windturbines. Being offshore maintance is costly, you want do maintance just in time. A Dutch company has developed software using Big data to compare sensor results from every turbine with others enabeling them todo maintance just in time.
What is Big Data ? The three VVV
Last version of the Big Data Landscape tools and applications overview. I’m not going to cover each and every tool in this presentation. I will focus on some common used solutions.
Let’s do Performance testing on Big Data ! We need a production like cluster a test enviroment, and a second cluster to generate the load. Of course we need test data, lot’s of it, TeraBytes or even PetaBytes.
Oops, this is going to be expensive, and which performance test tools support end to end Big Data testing ?
How do I get all this test data ? When it’s data from the wind turbines, it’s fairly easy. Social media data, web shop data, basically any personal data would need to be anonymized. For my project at Staples it took 3 months to get all data for 500 customers and 100 articles setup and synchronized in all systems. This performance testing approach is clearly not an option.
Let’s step back and look at the developments in engineering, nowadays bridges are not longer build and tested after construction hoping they can take the load. Sophisticated tooling helps you to determine the load on each element of the bridge and calculate how strong it needs to be. Some tooling is even able to calculate the impact of temperature and strong winds. Translated to Big Data this means we need to engineer and test the individual elements that form a big data solution.
Let’s have a look at the Corporate Data Architecture, Big Data starts with fast data, lot’s of streams with relatively small amounts of data, over time becoming Big Data. This data is ingested by our system, evaluated to make a data driven decision and analyzed in near real time to provide insights in developing trends.
The Lambda Architecture is composed of three layers: batch, speed, and serving.
The batch layer has two major tasks: (a) managing historical data; and (b) re-computing results such as machine learning models. Specifically, the batch layer receives arriving data, combines it with historical data and re-computes results by iterating over the entire combined data set. The batch layer operates on the full data and thus allows the system to produce the most accurate results. However, the results come at the cost of high latency due to high computation time.
The speed layer is used in order to provide results in a low-latency, near real-time fashion. The speed layer receives the arriving data and performs incremental updates to the batch layer results. Thanks to the incremental algorithms implemented at the speed layer, computation cost is significantly reduced.
Finally, the serving layer enables various queries of the results sent from the batch and speed layers.
One of the important motivations for inventing the Kappa architecture was to avoid maintaining two separate code bases for the batch and speed layers. The key idea is to handle both real-time data processing and continuous data reprocessing using a single stream processing engine. Data reprocessing is an important requirement for making visible the effects of code changes on the results. As a consequence, the Kappa architecture is composed of only two layers: stream processing and serving. The stream processing layer runs the stream processing jobs. Normally, a single stream processing job is run to enable real-time data processing. Data reprocessing is only done when some code of the stream processing job needs to be modified. This is achieved by running another modified stream processing job and replying all previous data.
Example from Travel Bird on a MeetUp presentation. TravelBird brings you the best 6 holiday deals every day, both for domestic breaks and foreign getaways. We select the best offers to bring you the ultimate travel experience. Our aim is to surprise and inspire you through a customized and diversified holiday offering — our streamlined offer choice will help you find the best deal. They use all kinds of data, browser, time, weather, location to profile each visitor combined with trends from their Big Data system (Lambda Architecture).
Question: Who has been attending Meetups ?
Let’s say I want to become a carpenter, I could buy some books about woodworking and tools, probably I could make a shelf or cupboard, but it would be average. To learn the real tricks of the trade I would contact a carpenter, teach me all the little details you won’t find in the books. So go to meetups; talk and interview real people in the tech field to collect knowledge and information. There are several Big Data meetups Amsterdam, and in Istanbul also, as I checked for this presentation.
For processing of Fast data I choose to talk about Apache Storm. Topologies are being build by developers to extract the data needed by the business for real time dashboards. A topology consist of several elements, the way Bolts are developed can have big impact on performance. Performance engineering and testing on Bolts is similar to unit performance testing in application development
Spark & Samza are other well known fast data processing engines. Based on project requirements you need to determine which one would suit you best. The three frameworks use different vocabularies for similar concepts.
Running pre-established benchmarks can be a very helpful way to test performance and scaling your cluster without having to develop a Storm topology from scratch. Benchmarking with artificial data eliminates the need for test data. If your fast data cluster can handle only 7K/sec messages with benchmarking were 10K is required, there is work to do.
Back in 2008, Yahoo! set a record by sorting 1 TB of data in 209 seconds – on an Hadoop cluster of 910 nodes as Owen O’Malley of the Yahoo! Grid Computing Team reports. Benchmarks for Hadoop are part of the installation package. Intel open sourced HiBench, a benchmark suite for Hadoop
Most of the monitoring tools out there, whether open source or proprietary, are designed to collect system resource metrics and monitor cluster resources. They are focused on simplifying the deployment and management of Big Data clusters. Be aware that these tools only tell you that you run out of resources, like the fuel gauge in your car, and are not refilling it. Starting the race with not enough fuel will lead to failure. Similar like racing by benchmarking and measuring fuel consumption you are able to determine the amount of resources needed for you Big Data solution.
While we can always optimize the underlying hardware resources, network infrastructure, OS, and other components of the Big Data solution, only users have control over optimizing the jobs that run on the cluster.
Dr. Elephant it’s goal is to improve developer productivity and increase cluster efficiency by making it easier to tune the jobs. It analyzes the Hadoop and Spark jobs using a set of pluggable, configurable, rule-based heuristics that provide insights on how a job performed, and then uses the results to make suggestions about how to tune the job to make it perform more efficiently. Dr. Elephant prevents this way that non optimized jobs are causing performance issues in production.
Most important for Big Data is scalability, it’s starts with architecture, engineering, quality in development and environment were additional resources can be added easy.
This is how Architecture, Engineering, Agile and DevOps need come together for Big Data solutions to provide scalability.
The Performance engineer is the spider in the web, part of engineering and architecture, part of development and operations. Advices about capacity planning and of course does modelling and performance testing on individual components.
Current project uses Docker for test environments, every time we test with same test data, same scripts, same Docker container setup, only the release is different. Results however vary a lot without explanation. So I did a test, running the same performance test against the same release for 14 days, again the results show a lot of variation. Let’s dive deeper in Docker resource management.
With 2 containers, the database can get 66% of available CPU and the webserver can get 33%, when we add a container, the 2 already running containers will get less, database drops to 28% of available CPU without changing any settings. At my customer all departments use the same Dockercloud, it’s therefore not clear how many containers are active at any given moment in time. A possible solution could be to create a separate Dockercloud for performancetesting.