SlideShare uma empresa Scribd logo
1 de 26
Performance Testing of
Big Data
26 april 2016
2
3
13-04-20164
5
6
Big Data refers to data that, because of its size,
speed, or format-- that is, its volume, velocity, or
variety-- cannot be easily stored, manipulated or
analyzed with traditional methods, like
spreadsheets, relational databases, or common
statistical software.
7
8
Production like
Big Data Cluster
Testdata
TeraBytes / PetaBytes
Testdata
Management ?
Load Generating Cluster
9
10
Corporate Data Architecture
Data is Fast Before it’s Big. Data often comes in streams into data systems.
Events happening hundreds to tens of thousands of times a second.
http://www.internetlivestats.com/
The things we do with Fast Data :
• Ingest – get millions of events per second into the system
• Decide – make a data-driven decision on each event
• Analyze in real time – provide visibility into operational trends of the events
11
Lambda
http://www.ericsson.com/research-blog/data-knowledge/data-processing-architectures-lambda-and-kappa/
12
Kappa
http://www.ericsson.com/research-blog/data-knowledge/data-processing-architectures-lambda-and-kappa/
13
Component Performance Testing: These systems are made up of multiple components, and
it is essential to test each of these components in isolation.
14
15
Storm is a distributed
real-time computation
system for processing
large volumes of high-
velocity data. Storm is
extremely fast, with the
ability to process over a
million records per
second per node on a
cluster of modest size.
Bolts can do anything from filtering,
functions, aggregations, joins, talking to
databases, and more.
A spout is a source of streams in a topology.
Streams are composed of tuples
The logic for a realtime application is packaged into a Storm topology. A Storm
topology is analogous to a MapReduce job.
The tuple is the main data structure in
Storm. A tuple is a named list of values,
where each value can be any type.
16
17
Due to a lack of real-world streaming benchmarks, we
developed one to compare Apache Flink, Apache Storm
and Apache Spark Streaming. It is released as open
source: https://github.com/yahoo/streaming-benchmarks
Storm Benchmark tools authored by Taylor Goetz -
https://github.com/ptgoetz/storm-benchmark
Storm Benchmark authored by Manu Zhang -
https://github.com/manuzhang/storm-benchmark
13-04-201618
Apache distribution
• TestDFSIO read and write test for HDFS
• TeraSort The goal of TeraSort is to sort 1TB of data (or any other amount)
as fast as possible. It is a benchmark that combines testing the HDFS
and MapReduce layers of an Hadoop cluster.
• NNBench Is used for load testing the NameNode hardware and configuration.
• MRBench Checks whether small jobs are responsive and running efficiently on
your cluster.
HiBench, a Hadoop benchmark suite consisting of both micro-benchmarks and real
world applications
https://software.intel.com/en-us/blogs/2012/10/15/use-hibench-as-a-representative-proxy-for-benchmarking-hadoop-
applications
19
Chukwa is an open source data collection system for monitoring and
analyzing large distributed systems. It is built on top of Hadoop and
includes a powerful and flexible toolkit for monitoring, analyzing, and
viewing results. Many components of Chukwa are pluggable, allowing
easy customization and enhancement.
Monitoring
20
Dr. Elephant is a performance monitoring and tuning tool for
Hadoop and Spark. It automatically gathers all the metrics,
runs analysis on them, and presents them in a simple way
for easy consumption.
Open sourced by at 08-04-2016
21
Thinking Scalability
Scalability is the ability of the software to keep up the performance even under
increasing load by adding resources linearly. But achieving scalability requires more
than just adding resources and tuning performance. To achieve scalability one
needs to think holistically about software design, quality, maintainability and
performance aspects.
Necessary conditions for Scalability
• Software has sound architecture and high quality
• Software is easy to release, monitor and tweak.
• Software performance can keep up with additional load
by adding resources linearly.
22
23
Q & A
Praegus B.V. - Experts in Testing & Test Automation 24
www.praegus.nl
25
26
Docker lets you limit a container’s CPU resources with the –cpu-shares flag
DataBase
@1024 ~66%
WebServer
@512 ~14%
Total Shares 1536
DataBase
@1024 ~28%
WebServer
@512 ~33%
Total Shares 3584
ApplicationServer
@2048 ~57%
CPU shares differ from memory limits in that they’re enforced only when
there is contention for time on the CPU. If other processes and containers are
idle, then the container may burst well beyond its limits.

Mais conteúdo relacionado

Mais procurados

Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopMark Johnson
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Data Con LA
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupQualitest
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionDataWorks Summit
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemDataWorks Summit
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & RedshiftDataKitchen
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
 

Mais procurados (20)

Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
TESTING IN BIG DATA WORLD
TESTING IN BIG DATA  WORLDTESTING IN BIG DATA  WORLD
TESTING IN BIG DATA WORLD
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 

Destaque

Keynote Systems - Mobile Solutions Overview Presentation
Keynote Systems - Mobile Solutions Overview PresentationKeynote Systems - Mobile Solutions Overview Presentation
Keynote Systems - Mobile Solutions Overview Presentationvprathap
 
KeytorcTestTalks #11 - Onur Başkirt, Agile Test Management with Testrail
KeytorcTestTalks #11 - Onur Başkirt, Agile Test Management with Testrail KeytorcTestTalks #11 - Onur Başkirt, Agile Test Management with Testrail
KeytorcTestTalks #11 - Onur Başkirt, Agile Test Management with Testrail Keytorc Software Testing Services
 
Testistanbul 2016 - Keynote: "The Story of Appium" by Dan Cuellar
Testistanbul 2016 - Keynote: "The Story of Appium" by Dan CuellarTestistanbul 2016 - Keynote: "The Story of Appium" by Dan Cuellar
Testistanbul 2016 - Keynote: "The Story of Appium" by Dan CuellarTurkish Testing Board
 
ISTQB / ISEB Foundation Exam Practice
ISTQB / ISEB Foundation Exam PracticeISTQB / ISEB Foundation Exam Practice
ISTQB / ISEB Foundation Exam PracticeYogindernath Gupta
 
KeytorcTestTalks #11 - Serkan Akoğlanoğlu, Release Management vs Test Management
KeytorcTestTalks #11 - Serkan Akoğlanoğlu, Release Management vs Test ManagementKeytorcTestTalks #11 - Serkan Akoğlanoğlu, Release Management vs Test Management
KeytorcTestTalks #11 - Serkan Akoğlanoğlu, Release Management vs Test ManagementKeytorc Software Testing Services
 
EbruKazaskeroglu_CV2016_ENG.PDF
EbruKazaskeroglu_CV2016_ENG.PDFEbruKazaskeroglu_CV2016_ENG.PDF
EbruKazaskeroglu_CV2016_ENG.PDFEbru Kazaskeroglu
 
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...Turkish Testing Board
 
ISTQB / ISEB Foundation Exam Practice -1
ISTQB / ISEB Foundation Exam Practice -1ISTQB / ISEB Foundation Exam Practice -1
ISTQB / ISEB Foundation Exam Practice -1Yogindernath Gupta
 
Istqb ctfl-series - Black Box Testing
Istqb ctfl-series - Black Box TestingIstqb ctfl-series - Black Box Testing
Istqb ctfl-series - Black Box TestingDisha Srivastava
 
KeytorcTestTalks #11 - Berk Dülger, DevOps Tactical Adaption Theory
KeytorcTestTalks #11 - Berk Dülger, DevOps Tactical Adaption TheoryKeytorcTestTalks #11 - Berk Dülger, DevOps Tactical Adaption Theory
KeytorcTestTalks #11 - Berk Dülger, DevOps Tactical Adaption TheoryKeytorc Software Testing Services
 
ISTQB / ISEB Foundation Exam Practice - 2
ISTQB / ISEB Foundation Exam Practice - 2ISTQB / ISEB Foundation Exam Practice - 2
ISTQB / ISEB Foundation Exam Practice - 2Yogindernath Gupta
 
Introduction to ISTQB & ISEB Certifications
Introduction to ISTQB & ISEB CertificationsIntroduction to ISTQB & ISEB Certifications
Introduction to ISTQB & ISEB CertificationsYogindernath Gupta
 
ISTQB Foundation Level: Why, Why Not and How?
ISTQB Foundation Level: Why, Why Not and How?ISTQB Foundation Level: Why, Why Not and How?
ISTQB Foundation Level: Why, Why Not and How?OdessaQA
 
ISTQB / ISEB Foundation Exam Practice - 6
ISTQB / ISEB Foundation Exam Practice - 6ISTQB / ISEB Foundation Exam Practice - 6
ISTQB / ISEB Foundation Exam Practice - 6Yogindernath Gupta
 

Destaque (20)

Keynote Systems - Mobile Solutions Overview Presentation
Keynote Systems - Mobile Solutions Overview PresentationKeynote Systems - Mobile Solutions Overview Presentation
Keynote Systems - Mobile Solutions Overview Presentation
 
KeytorcTestTalks #11 - Duygu Onaral, Agile QA'in rolü
KeytorcTestTalks #11 - Duygu Onaral, Agile QA'in rolüKeytorcTestTalks #11 - Duygu Onaral, Agile QA'in rolü
KeytorcTestTalks #11 - Duygu Onaral, Agile QA'in rolü
 
KeytorcTestTalks #11 - Onur Başkirt, Agile Test Management with Testrail
KeytorcTestTalks #11 - Onur Başkirt, Agile Test Management with Testrail KeytorcTestTalks #11 - Onur Başkirt, Agile Test Management with Testrail
KeytorcTestTalks #11 - Onur Başkirt, Agile Test Management with Testrail
 
Testistanbul 2016 - Keynote: "The Story of Appium" by Dan Cuellar
Testistanbul 2016 - Keynote: "The Story of Appium" by Dan CuellarTestistanbul 2016 - Keynote: "The Story of Appium" by Dan Cuellar
Testistanbul 2016 - Keynote: "The Story of Appium" by Dan Cuellar
 
ISTQB / ISEB Foundation Exam Practice
ISTQB / ISEB Foundation Exam PracticeISTQB / ISEB Foundation Exam Practice
ISTQB / ISEB Foundation Exam Practice
 
KeytorcTestTalks #11 - Serkan Akoğlanoğlu, Release Management vs Test Management
KeytorcTestTalks #11 - Serkan Akoğlanoğlu, Release Management vs Test ManagementKeytorcTestTalks #11 - Serkan Akoğlanoğlu, Release Management vs Test Management
KeytorcTestTalks #11 - Serkan Akoğlanoğlu, Release Management vs Test Management
 
EbruKazaskeroglu_CV2016_ENG.PDF
EbruKazaskeroglu_CV2016_ENG.PDFEbruKazaskeroglu_CV2016_ENG.PDF
EbruKazaskeroglu_CV2016_ENG.PDF
 
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
Testistanbul 2016 - Keynote: "Why Automated Verification Matters" by Kristian...
 
ISTQB / ISEB Foundation Exam Practice -1
ISTQB / ISEB Foundation Exam Practice -1ISTQB / ISEB Foundation Exam Practice -1
ISTQB / ISEB Foundation Exam Practice -1
 
Keytorc Proje Ekibi Zubizu Sunumu - Emirhan Şen
Keytorc Proje Ekibi Zubizu Sunumu - Emirhan ŞenKeytorc Proje Ekibi Zubizu Sunumu - Emirhan Şen
Keytorc Proje Ekibi Zubizu Sunumu - Emirhan Şen
 
Istqb ctfl-series - Black Box Testing
Istqb ctfl-series - Black Box TestingIstqb ctfl-series - Black Box Testing
Istqb ctfl-series - Black Box Testing
 
İyi Bir Test Uzmanı Olmak İçin...
İyi Bir Test Uzmanı Olmak İçin...İyi Bir Test Uzmanı Olmak İçin...
İyi Bir Test Uzmanı Olmak İçin...
 
KeytorcTestTalks #11 - Berk Dülger, DevOps Tactical Adaption Theory
KeytorcTestTalks #11 - Berk Dülger, DevOps Tactical Adaption TheoryKeytorcTestTalks #11 - Berk Dülger, DevOps Tactical Adaption Theory
KeytorcTestTalks #11 - Berk Dülger, DevOps Tactical Adaption Theory
 
ISTQB / ISEB Foundation Exam Practice - 2
ISTQB / ISEB Foundation Exam Practice - 2ISTQB / ISEB Foundation Exam Practice - 2
ISTQB / ISEB Foundation Exam Practice - 2
 
Keytorc Proje Ekibi Zubizu Sunumu - Ozan İlhan
Keytorc Proje Ekibi Zubizu Sunumu - Ozan İlhanKeytorc Proje Ekibi Zubizu Sunumu - Ozan İlhan
Keytorc Proje Ekibi Zubizu Sunumu - Ozan İlhan
 
Introduction to ISTQB & ISEB Certifications
Introduction to ISTQB & ISEB CertificationsIntroduction to ISTQB & ISEB Certifications
Introduction to ISTQB & ISEB Certifications
 
ISTQB Foundation Level: Why, Why Not and How?
ISTQB Foundation Level: Why, Why Not and How?ISTQB Foundation Level: Why, Why Not and How?
ISTQB Foundation Level: Why, Why Not and How?
 
Keytorc Proje Ekibi Zubizu Sunumu - Miray Doğan
Keytorc Proje Ekibi Zubizu Sunumu - Miray DoğanKeytorc Proje Ekibi Zubizu Sunumu - Miray Doğan
Keytorc Proje Ekibi Zubizu Sunumu - Miray Doğan
 
Test Automation - Keytorc Approach
Test Automation - Keytorc Approach Test Automation - Keytorc Approach
Test Automation - Keytorc Approach
 
ISTQB / ISEB Foundation Exam Practice - 6
ISTQB / ISEB Foundation Exam Practice - 6ISTQB / ISEB Foundation Exam Practice - 6
ISTQB / ISEB Foundation Exam Practice - 6
 

Semelhante a Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden

Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase Türkiye
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingCascading
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Hortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine LearningVasu S
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901WeCloudData
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockJeffrey T. Pollock
 

Semelhante a Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden (20)

Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden

  • 1. Performance Testing of Big Data 26 april 2016
  • 2. 2
  • 3. 3
  • 5. 5
  • 6. 6 Big Data refers to data that, because of its size, speed, or format-- that is, its volume, velocity, or variety-- cannot be easily stored, manipulated or analyzed with traditional methods, like spreadsheets, relational databases, or common statistical software.
  • 7. 7
  • 8. 8 Production like Big Data Cluster Testdata TeraBytes / PetaBytes Testdata Management ? Load Generating Cluster
  • 9. 9
  • 10. 10 Corporate Data Architecture Data is Fast Before it’s Big. Data often comes in streams into data systems. Events happening hundreds to tens of thousands of times a second. http://www.internetlivestats.com/ The things we do with Fast Data : • Ingest – get millions of events per second into the system • Decide – make a data-driven decision on each event • Analyze in real time – provide visibility into operational trends of the events
  • 13. 13 Component Performance Testing: These systems are made up of multiple components, and it is essential to test each of these components in isolation.
  • 14. 14
  • 15. 15 Storm is a distributed real-time computation system for processing large volumes of high- velocity data. Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more. A spout is a source of streams in a topology. Streams are composed of tuples The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type.
  • 16. 16
  • 17. 17 Due to a lack of real-world streaming benchmarks, we developed one to compare Apache Flink, Apache Storm and Apache Spark Streaming. It is released as open source: https://github.com/yahoo/streaming-benchmarks Storm Benchmark tools authored by Taylor Goetz - https://github.com/ptgoetz/storm-benchmark Storm Benchmark authored by Manu Zhang - https://github.com/manuzhang/storm-benchmark
  • 18. 13-04-201618 Apache distribution • TestDFSIO read and write test for HDFS • TeraSort The goal of TeraSort is to sort 1TB of data (or any other amount) as fast as possible. It is a benchmark that combines testing the HDFS and MapReduce layers of an Hadoop cluster. • NNBench Is used for load testing the NameNode hardware and configuration. • MRBench Checks whether small jobs are responsive and running efficiently on your cluster. HiBench, a Hadoop benchmark suite consisting of both micro-benchmarks and real world applications https://software.intel.com/en-us/blogs/2012/10/15/use-hibench-as-a-representative-proxy-for-benchmarking-hadoop- applications
  • 19. 19 Chukwa is an open source data collection system for monitoring and analyzing large distributed systems. It is built on top of Hadoop and includes a powerful and flexible toolkit for monitoring, analyzing, and viewing results. Many components of Chukwa are pluggable, allowing easy customization and enhancement. Monitoring
  • 20. 20 Dr. Elephant is a performance monitoring and tuning tool for Hadoop and Spark. It automatically gathers all the metrics, runs analysis on them, and presents them in a simple way for easy consumption. Open sourced by at 08-04-2016
  • 21. 21 Thinking Scalability Scalability is the ability of the software to keep up the performance even under increasing load by adding resources linearly. But achieving scalability requires more than just adding resources and tuning performance. To achieve scalability one needs to think holistically about software design, quality, maintainability and performance aspects. Necessary conditions for Scalability • Software has sound architecture and high quality • Software is easy to release, monitor and tweak. • Software performance can keep up with additional load by adding resources linearly.
  • 22. 22
  • 23. 23
  • 24. Q & A Praegus B.V. - Experts in Testing & Test Automation 24 www.praegus.nl
  • 25. 25
  • 26. 26 Docker lets you limit a container’s CPU resources with the –cpu-shares flag DataBase @1024 ~66% WebServer @512 ~14% Total Shares 1536 DataBase @1024 ~28% WebServer @512 ~33% Total Shares 3584 ApplicationServer @2048 ~57% CPU shares differ from memory limits in that they’re enforced only when there is contention for time on the CPU. If other processes and containers are idle, then the container may burst well beyond its limits.

Notas do Editor

  1. In 2014 I spoke about the importance of mobile performance testing. Recent research revealed that performance is number 2 on the list of problems users encounter with apps. So there is still a lot todo for performance testing for mobile. Today I will not talk about Mobile performance, but about performance testing for Big Data.
  2. My experience with Big Data started in 2000 when I was working for Global Crossing where I was part of the global engineering team building a Pan European Network. A lot of telco´s were doing the same, KPN, KPNQwest, Deutsche Telecom, BT and Worldcom to name a few. At that time however companies didn´t need all this capacity, when Global Crossing went bankrupt (dot com bubble), only 10% of the capacity of the network was used by customers. Today with the explosive use of internet and big data, all this capacity finally gets utilised. After the bankruptcy of Global Crossing I started to work as software tester, quickly moving to testautomation and performance testing due to my technical background.
  3. Testing was done in those days in waterfall fashion on 2 and 3 tier applications running at real hardware. The way bridges were build in the past is comparable with this way of software development and implementation, after completion the bridge was tested with fully loaded trucks, hoping it wouldn’t collapse.
  4. No need to talk about the explosion of Mobile usage and social media apps. We moved from Waterfall to Agile, from physical hardware to virtualisation. Large Hadron Collider 15 Petabytes of data a year.Less know usage is Big data and offshore windturbines. Being offshore maintance is costly, you want do maintance just in time. A Dutch company has developed software using Big data to compare sensor results from every turbine with others enabeling them todo maintance just in time.
  5. What is Big Data ? The three VVV
  6. Last version of the Big Data Landscape tools and applications overview. I’m not going to cover each and every tool in this presentation. I will focus on some common used solutions.
  7. Let’s do Performance testing on Big Data ! We need a production like cluster a test enviroment, and a second cluster to generate the load. Of course we need test data, lot’s of it, TeraBytes or even PetaBytes. Oops, this is going to be expensive, and which performance test tools support end to end Big Data testing ? How do I get all this test data ? When it’s data from the wind turbines, it’s fairly easy. Social media data, web shop data, basically any personal data would need to be anonymized. For my project at Staples it took 3 months to get all data for 500 customers and 100 articles setup and synchronized in all systems. This performance testing approach is clearly not an option.
  8. Let’s step back and look at the developments in engineering, nowadays bridges are not longer build and tested after construction hoping they can take the load. Sophisticated tooling helps you to determine the load on each element of the bridge and calculate how strong it needs to be. Some tooling is even able to calculate the impact of temperature and strong winds. Translated to Big Data this means we need to engineer and test the individual elements that form a big data solution.
  9. Let’s have a look at the Corporate Data Architecture, Big Data starts with fast data, lot’s of streams with relatively small amounts of data, over time becoming Big Data. This data is ingested by our system, evaluated to make a data driven decision and analyzed in near real time to provide insights in developing trends.
  10. The Lambda Architecture is composed of three layers: batch, speed, and serving. The batch layer has two major tasks: (a) managing historical data; and (b) re-computing results such as machine learning models. Specifically, the batch layer receives arriving data, combines it with historical data and re-computes results by iterating over the entire combined data set. The batch layer operates on the full data and thus allows the system to produce the most accurate results. However, the results come at the cost of high latency due to high computation time. The speed layer is used in order to provide results in a low-latency, near real-time fashion. The speed layer receives the arriving data and performs incremental updates to the batch layer results. Thanks to the incremental algorithms implemented at the speed layer, computation cost is significantly reduced. Finally, the serving layer enables various queries of the results sent from the batch and speed layers.
  11. One of the important motivations for inventing the Kappa architecture was to avoid maintaining two separate code bases for the batch and speed layers. The key idea is to handle both real-time data processing and continuous data reprocessing using a single stream processing engine. Data reprocessing is an important requirement for making visible the effects of code changes on the results. As a consequence, the Kappa architecture is composed of only two layers: stream processing and serving. The stream processing layer runs the stream processing jobs. Normally, a single stream processing job is run to enable real-time data processing. Data reprocessing is only done when some code of the stream processing job needs to be modified. This is achieved by running another modified stream processing job and replying all previous data. 
  12. Example from Travel Bird on a MeetUp presentation. TravelBird brings you the best 6 holiday deals every day, both for domestic breaks and foreign getaways. We select the best offers to bring you the ultimate travel experience. Our aim is to surprise and inspire you through a customized and diversified holiday offering — our streamlined offer choice will help you find the best deal. They use all kinds of data, browser, time, weather, location to profile each visitor combined with trends from their Big Data system (Lambda Architecture). Question: Who has been attending Meetups ?
  13. Let’s say I want to become a carpenter, I could buy some books about woodworking and tools, probably I could make a shelf or cupboard, but it would be average. To learn the real tricks of the trade I would contact a carpenter, teach me all the little details you won’t find in the books. So go to meetups; talk and interview real people in the tech field to collect knowledge and information. There are several Big Data meetups Amsterdam, and in Istanbul also, as I checked for this presentation.
  14. For processing of Fast data I choose to talk about Apache Storm. Topologies are being build by developers to extract the data needed by the business for real time dashboards. A topology consist of several elements, the way Bolts are developed can have big impact on performance. Performance engineering and testing on Bolts is similar to unit performance testing in application development
  15. Spark & Samza are other well known fast data processing engines. Based on project requirements you need to determine which one would suit you best. The three frameworks use different vocabularies for similar concepts.
  16. Running pre-established benchmarks can be a very helpful way to test performance and scaling your cluster without having to develop a Storm topology from scratch. Benchmarking with artificial data eliminates the need for test data. If your fast data cluster can handle only 7K/sec messages with benchmarking were 10K is required, there is work to do.
  17. Back in 2008, Yahoo! set a record by sorting 1 TB of data in 209 seconds – on an Hadoop cluster of 910 nodes as Owen O’Malley of the Yahoo! Grid Computing Team reports. Benchmarks for Hadoop are part of the installation package. Intel open sourced HiBench, a benchmark suite for Hadoop
  18. Most of the monitoring tools out there, whether open source or proprietary, are designed to collect system resource metrics and monitor cluster resources. They are focused on simplifying the deployment and management of Big Data clusters. Be aware that these tools only tell you that you run out of resources, like the fuel gauge in your car, and are not refilling it. Starting the race with not enough fuel will lead to failure. Similar like racing by benchmarking and measuring fuel consumption you are able to determine the amount of resources needed for you Big Data solution.
  19. While we can always optimize the underlying hardware resources, network infrastructure, OS, and other components of the Big Data solution, only users have control over optimizing the jobs that run on the cluster. Dr. Elephant it’s goal is to improve developer productivity and increase cluster efficiency by making it easier to tune the jobs. It analyzes the Hadoop and Spark jobs using a set of pluggable, configurable, rule-based heuristics that provide insights on how a job performed, and then uses the results to make suggestions about how to tune the job to make it perform more efficiently. Dr. Elephant prevents this way that non optimized jobs are causing performance issues in production.
  20. Most important for Big Data is scalability, it’s starts with architecture, engineering, quality in development and environment were additional resources can be added easy.
  21. This is how Architecture, Engineering, Agile and DevOps need come together for Big Data solutions to provide scalability.
  22. The Performance engineer is the spider in the web, part of engineering and architecture, part of development and operations. Advices about capacity planning and of course does modelling and performance testing on individual components.
  23. Current project uses Docker for test environments, every time we test with same test data, same scripts, same Docker container setup, only the release is different. Results however vary a lot without explanation. So I did a test, running the same performance test against the same release for 14 days, again the results show a lot of variation. Let’s dive deeper in Docker resource management.
  24. With 2 containers, the database can get 66% of available CPU and the webserver can get 33%, when we add a container, the 2 already running containers will get less, database drops to 28% of available CPU without changing any settings. At my customer all departments use the same Dockercloud, it’s therefore not clear how many containers are active at any given moment in time. A possible solution could be to create a separate Dockercloud for performancetesting.