O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Benchmarking Hadoop and Big Data

2.898 visualizações

Publicada em

The slides from the BDOOP meetup group presentation on Benchmarking Big Data systems with use cases from Hadoop

Publicada em: Tecnologia
  • Entre para ver os comentários

Benchmarking Hadoop and Big Data

  1. 1. @BDOOP_BCN Benchmarking Hadoop by Nicolas Poggi @ni_po June 2, 2015
  2. 2. About Nicolas Poggi @ni_po
  3. 3. What is BDOOP about? ● A group to share on Data ● Scalability ● Performance ● Configurations ● Cluster design ● Benchmarking ● …a/couple of beer/s! • Having sysadmins in mind ● Also POs ● Not a group to learn • Java • Mapreduce programming • Hadoop base concepts
  4. 4. BDOOP Group Objectives ● Create a local community to ● Learn Big Data ● performance and scalability ● Share ● day-to-day problems and solutions ● Present your work and findings ● Have talks from renown experts ● > Your objective here <
  5. 5. Benchmarking Motivation and Intro
  6. 6. Hadoop design  Hadoop designed to solve complex data  Structured and non structured  With [close to] linear scalability  Simplifying the programming model  From MPI, OpenMP, CUDA, …  Operates as a blackbox for data analysts Image source: Hadoop, the definitive guide
  7. 7. Hadoop attributes  Fault tolerant  from commodityhardware  Built in redundancy  via replication  Automatic scales out / down  With [almost] linear scalability  Move computation to data  minimize communication  Share nothing architecture
  8. 8. Hadoop highly-scalable but…  Not a high-performance solution!  Requires  Design,  Clusters, topology clusters  Setup,  OS, Hadoop config  and tuning required  Iterative approach  Time consuming  And extensive benchmarking!
  9. 9. Hadoop parameters  > 100+ tunable parameters  mapred.map/reduce.tasks.speculative.execution  obscure and interrelated  io.sort.mb 100 (300)  io.sort.record.percent 5% (15%)  io.sort.spill.percent 80% (95 – 100%)  Number of Mappers and Reducers  Rule of thumb 0.5 - 2 per CPU core
  10. 10. Hadoop ecosystem  Large and spread  Dominated by big players  Custom patches  Default values not ideal  Product claims  Cloud vs. On-premise  IaaS  PaaS  EMR, HDInsight  Needs standardization and auditing! DATA
  11. 11. Product claims  Need auditing!
  12. 12. Workload (jobs)  All jobs are different!  Different requirements  CPU bound  Memory bound  I/O bound  … a bit of all  Different tuning for each  Needs benchmarking! Terasort K-means Wordcount Sample mappers and reducer for 3 popular benchmarks:
  13. 13. One for all config? Vertical line:Average performance forthisworkloadacrossconfigurations Valuesto the right: above average Valuesto the left: below average Is there one software configurationiterationthat fits everybody? Configurations Good for Terasort but bad for Wordcount Good for Terasort but bad for Wordcount Good for Wordcount but very bad for Terasort
  14. 14. Example of SSD impact to Execution time  Impact of SSDs to running time of Terasort SSDs HDDs Configurations SSD SATA
  15. 15. Too many choices? Remote volumes - - Rotational HDDs JBODs Large VMs Small VMs GbEthernet InfiniBand RAID Cost Performance On-Premise Cloud And where is my system configurationpositionedon each of these axes? Highavailability Replication + +
  16. 16. Benchmarks
  17. 17. Why benchmark?  Validate assumptions  Reproduce bad behavior  Debugging  Measure performance and scale  Simulate higher load  Find bottlenecks/ limits  Plan for growth  Test different  SW and HW Source: Based on High Performance MySQL, benchmarking MySQL chapter
  18. 18. Benchmarking stakeholders and use cases  End-user / consumer  Compare products  Developer  Profiling  CI / QA  Sysadmin / architect  Cluster sizing  SW and HW vendors  Product claims  Marketing  Researcher  …
  19. 19. Big Data Vs  Volume  Velocity  Variety  Structured, semi, unstructured data  Different types of data (genres)  Veracity  Value Sample scale factorfrom TPCx-HS
  20. 20. Data generation  Real vs. Synthetic  Random data vs. repeatable  Datageneration time  Paralle  Datadistribution  Flat or uniformly distributed  Gaussian (normal distribution, skew)
  21. 21. Issues Benchmarking Big Data  Big Scale  Single node vs Multiple nodes  10MB vs 10TB  On-metal vs. virtualized vs. cloud  Non-deterministic/ Randomness  Need to average multiple runs  How long to benchmark  Systemwarm-up  Distributed systems  Failures?
  22. 22. Types of benchmarks and Standards  Micro benchmarks  HDFSIO  Functional  Terasort, ETL  Genre-specific  Graph 500  Application level  BigBench  TPC (implementation) vs SPEC (reference)
  23. 23. TPC vs. SPEC models  Specification based  Performance, price, energy in one benchmark  End-to-end  Multiple tests (ACID, load)  Independent review  Full disclosure  TPC Technology Conference  Kit based  Performance and energy in separate benchmarks  Server-centric  Single test  Peer review  Summary disclosure  SPEC Research Group, ICPE Source: From presentation by Meikel Poess, 1stWBDB, May 2012
  24. 24. Data Benchmarks Classical SQL OLAP DB Big Data  First there was TPC-H  Classical SQL OLAP benchmark  MRBench for M/R  On top of Hive or Impala for Hadoop  Then sorting  Terasort  Unofficial standard  Now part of TPCx-HS  Hadoop samples  Wordcount, grep,terasort,DFSIO  YCSB  From Yahoo!  For NoSQL, HBASE implementation  GridMix  CALDA  HiBench  SWIM  BigBench  based on TCP-DS + ML  30 queries  BigDataBench  33 workloads  TPCx-HS
  25. 25. Comparisonof popularHadoop benchmarks Spec[1 ] App domains Workload types Workload s Scalable data sets[2] Diverse implem[3] Multi- tenancy[4] Subset[5] Simulator [6] BigDataBench Y Five Four[7] Thirty- three [8] Eight[9] Y Y Y Y BigBench Y One Three Ten Three N N N N CloudSuite N N/A Two Eight Three N N N Y HiBench N N/A Two Ten Three N N N N CALDA Y N/A One Five N/A Y N N N YCSB Y N/A One Six N/A Y N N N LinkBench Y N/A One Ten N/A Y N N N AMP Benchma rks Y N/A One Four N/A Y N N N The Differences of BigDataBench from Other Benchmarks Suites. Source: BigDataBench homepage
  26. 26. What to measure and metrics  Job execution time  Throughput  Units / time  Framework overhead  # of spills  Scalability  Concurrency  Abstract metrics  CPU  MEM  DISK  IOPS, latency, bandwidth  NET  Latency bandwidth  TPCx-HS performance metric (HSph@SF)
  27. 27. Benchmarking
  28. 28. Project ALOJA online repository  Entry point for explore the results collected from the executions,  Provides insights on the obtained results through continuouslyevolving data views.  Online results at: http://hadoop.bsc.es
  29. 29. ALOJA Platform: Evolution and status  Benchmarking, Repository, and Analytics tools for Big Data  Composed of open-source  Benchmarking, provisioning and orchestration tools,  high-level system performance metric collection,  low-level Hadoop instrumentation based on BSC Tools  and Web based data analytics tools  Andrecommendations  Online Big Data Benchmark repository of:  20,000+ runs (from HiBench)  Sharable, comparable, repeatable, verifiable executions  Abstracting and leveraging tools for BD benchmarking  Not reinventing the wheel but,  most current BD tools designed for production, not for benchmarking  leverages current compatible tools and projects  Dev VM toolset and sandbox  via Vagrant Big Data Benchmarking Online Repository Analytics
  30. 30. Workflow in ALOJA Cluster(s) definition • VM sizes • # nodes • OS, disks • Capabilities Execution plan • Start cluster • Exec Benchmarks • Gather results • Cleanup Import data • Convert perf metric • Parse logs • Import into DB Evaluate data • Data views in Vagrant VM • Or http://hadoop.bsc.es PA and KD •Predictive Analytics •Knowledge Discovery Historic Repo
  31. 31. 34 Benchmarks Execution comparisons  You can compare, side by side, all execution parameters:  CPU, Memory, Network, Disk, Hadoop parameters…. Sample: http://hadoop.bsc.es/perfcharts?execs[]=91144
  32. 32. HiBench suiteHiBench : A Benchmark Suite for Hadoop HiBench A Comprehensive & Realistic Benchmark Suite Enhanced DFSIO Micro Benchmarks Web Search Sort WordCount TeraSort Nutch Indexing Page Rank Machine Learning Bayesian Classification K-Means Clustering HDFS Code at: https://github.com/intel-hadoop/HiBench
  33. 33. Job resource requirements 1/2 Source: Intel HiBench
  34. 34. Job resource requirements 2/2 Source: Intel HiBench
  35. 35. Impact of SW configurations in Speedup Number of mappers Compression algorithm No comp. ZLIB BZIP2 snappy 4m 6m 8m 10m Speedup (higher is better) Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  36. 36. Impact of HW configurationsin Speedup Disks and Network Cloud remote volumes Local only 1 Remote 2 Remotes 3 Remotes 3 Remotes /tmp local 2 Remotes /tmp local 1 Remotes /tmp local HDD-ETH HDD-IB SSD-ETH SDD-IB Speedup (higher is better) Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  37. 37. Cost/Performance Scalability  Terasort (100GB) Sample from: http://hadoop.bsc.es/nodeseval Execution time Execution cost
  38. 38. InfiniBand + SDD (LOCAL) GbE SDD + (LOCAL) CLOUD (local disk/tmpand HDFS) CLOUD (/tmpinLocal Disk, HDFSin Blob storage 1-3 devices) CLOUD (/tmpandHDFSin Blob storage 1-3 devices) InfiniBand + SATA disks (LOCAL) GbE+ SATA disks (LOCAL) Price Performance Cost-effectiveness On-premise vs. Cloud) Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  39. 39. Common Benchmarking pitfalls  Scalability  Assuming near scalability  Compare apples to apples  if benchmarking HW change HW  but leave SW the same  Terasort in v1 != Terasort in v2  Test for Big Data use large data  stress the system  If results are too good to be true, they probably aren't  Don’t believe in miracles  Expect vendor lies Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain
  40. 40. Resources  ALOJA Benchmarking platform and online repository  http://hadoop.bsc.es/  Big Data Benchmarking Community (BDBC) mailing list  (~200 members from ~80organizations)  http://clds.sdsc.edu/bdbc/community  Workshop Big Data Benchmarking (WBDB)  Next: http://clds.sdsc.edu/wbdb2015.ca  SPEC Research Big Data working group  http://research.spec.org/working-groups/big-data-working-group.html  Slides and video:  Michael Frank on Big Data benchmarking  http://www.tele-task.de/archive/podcast/20430/  Tilmann Rabl Big Data Benchmarking Tutorial  http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl
  41. 41. @BDOOP_BCN Benchmarking Hadoop by Nicolas Poggi @ni_po June 2, 2015

×