Mais conteúdo relacionado
Semelhante a Hw09 Optimizing Hadoop Deployments (20)
Mais de Cloudera, Inc. (20)
Hw09 Optimizing Hadoop Deployments
- 1. Optimizing Hadoop*
Workloads
Nurcan Coskun
Intel Software & Solutions Group
October 2, 2009
Acknowledgements to Jason Dai, Intel SSG, for many
of the test results and optimization techniques
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may
be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2009, Intel Corporation.
- 2. Legal Disclaimers
Disclaimers & Legal Notices
THE INFORMATION IS FURNISHED FOR INFORMATIONAL USE ONLY, IS SUBJECT TO CHANGE WITHOUT NOTICE, AND SHOULD
NOT BE CONSTRUED AS A COMMITMENT BY INTEL CORPORATION. INTEL CORPORATION ASSUMES NO RESPONSIBILITY OR
LIABILITY FOR ANY ERRORS OR INACCURACIES THAT MAY APPEAR IN THIS DOCUMENT OR ANY SOFTWARE THAT MAY BE
PROVIDED IN ASSOCIATION WITH THIS DOCUMENT. THIS INFORMATION IS PROVIDED "AS IS" AND INTEL DISCLAIMS ANY
EXPRESS OR IMPLIED WARRANTY, RELATING TO THE USE OF THIS INFORMATION INCLUDING WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, COMPLIANCE WITH A SPECIFICATION OR STANDARD, MERCHANTABILITY OR
NONINFRINGEMENT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate
performance of Intel products as measured by those tests. Any difference in system hardware or software design or
configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance
of systems or components they are considering purchasing. For more information on performance tests and on the
performance of Intel products, visit Intel Performance Benchmark Limitations
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR
IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT
AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY,
OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED
IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE
FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the
absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future
definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The
information here is subject to change without notice. Do not finalize a design with this information. The products described in
this document may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to
obtain the latest specifications and before placing your product order. Copies of documents which have an order number and
are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's
Web Site http://www.intel.com/.
2
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 3. Why Optimize Hadoop Deployments?
Handle At In With
More Lower Less Less
Data Cost Time Power
3
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 4. Where to Optimize?
Hardware Software
4
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 5. Hadoop Servers
Masters: JobTracker, NameNode, Secondary NameNode
– Deploy additional RAM and secondary power supplies
– Ensure highest performance and reliability
Slaves: DataNodes, TaskTrackers
– Hadoop Framework handles slave failures well
– Data blocks are replicated and distributed
– Workload may be bound by I/O, memory or processor resources
– The system level hardware should be adjusted on a
case-by-case basis
5
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 6. Server Platform
•Dual-socket servers are optimal for Hadoop deployments
•Dual-socket servers are more efficient than large-scale multi-
processor platforms from a per-node, cost benefit perspective
•Dual-socket servers offset the added per-node hardware cost
relative to entry-level servers through superior efficiencies in
terms of load-balancing and parallelization overheads
•Choosing hardware based on the most current platform
technologies available helps to ensure the optimal intra-server
throughput and efficiency
6
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 7. Processor Choice Matters
Faster
Handles More Data
More Energy Efficient
7
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 8. Processor Choice Impacts Speed
Data Source: Intel internal measurements by using Hadoop 0.19.1 as of September 20, 2009.
Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
8
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 9. Processor Choice Impacts Throughput
• Throughput = # of tasks completed / minute when cluster is at 100% utilization.
• Intel Xeon processor 5500 provides up to 86% more throughput than 5400 series.
Data Source: Intel internal measurements by using Hadoop 0.19.1 as of September 20, 2009.
Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
9
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 10. Processor Scaling
Inte l® X e on® P roce ssor 5400 Se rie s (H arpe rtown) C luste r Inte l® X e on® P roce ssor 5500 S e rie s (N e hale m) C luste r
(Lowe r Value s are B e tte r) (Lowe r Value s are B e tte r)
30000 20000
1G B
1G B
2G B 18000
2G B
25000 3G B
16000 3G B
4G B
JavaS ort Tom pletion Tim e (seconds)
4G B
J a v a S ort Tom ple tion Tim e (s e c onds )
5G B 14000
20000 5G B
6G B
6G B
7G B 12000
7G B
15000 8G B
10000 8G B
9G B
9G B
10G B 8000
10G B
10000 50G B
6000 50G B
100G B
100G B
150G B 4000
5000 150G B
200G B
200G B
2000
250G B
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Num be r of Node s Num be r of Node s
•Hadoop workloads scales well on Intel processors
•Intel® Xeon® processor 5500 can handle larger data sizes than 5400 series.
Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009.
Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
10
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 11. Turn on Intel® Hyper-threading Technology
Intel® Hyper-threading Intel® Xeon® Processor 5500 Series (Nehalem)
Technology SMT effect in 8 node cluster
(Lower Values Are Better)
250
JavaSort Completion Time (seconds)
Increases performance for threaded
applications delivering greater throughput 200
and responsiveness
150
SMT ON
SMT OFF
100
50
0
1GB 2GB 3GB 4GB 5GB 6GB 7GB 8GB 9GB 10GB
Data Set Size
Up to 25% better performance
Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009.
Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
11
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 12. Memory
•Sufficient memory capacity is critical for efficient operation of
servers in a Hadoop cluster, supporting high throughput by
allowing large number of map/reduce tasks to be carried out
simultaneously
•Typical Hadoop applications require approximately 1-2 GB of
RAM per processor core, which corresponds to 8-16GB for a
dual-socket server using quad-core processors
•Error Correcting Code (ECC) memory is highly recommended
to detect and correct errors introduced during storage and
transmission of data
12
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 13. Selecting Server Motherboard
•Select server motherboards which are optimized for high
density computing environments.
– They should use high efficiency voltage regulators
– They need to be optimized for airflow
– They should use certified power supplies
•Optimized server motherboards will use less power, need less
cooling, and save money
13
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 14. Hard Disk and SSD
•Large number of hard drives per server (4-6)
•Hadoop orchestrates data provisioning and redundancy across
individual nodes (Using RAID 0 is not needed)
•SSD’s are faster and they require very little power, SSD usage
will also eliminate cooling cost created by hard disk drives
•Use SSD’s:
– To store mission critical smaller data sets
– To store map/reduce intermediate results
– To replace HDD’s with SDD’s to reduce power consumption,
increase throughput and improve performance
14
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 15. Use Intel® X25-E SATA SSD’s
10 N ode Inte l® X e on® L5520 (N e hale m) C luste r
(Lowe r Value s are B e tte r)
2500
2000
JavaS ort Com pletion Tim e
1500
(seconds)
hdd
ssd
1000
500
0
1G B 10G B 50G B 80G B 100G B
Da ta S e t S iz e
Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009.
Hardware configurations are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
15
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 16. System Software
•Use a Linux* distribution based on kernel version 2.6.30 or
later because of the optimizations included for energy and
threading efficiency
– For Example: energy consumption can be up to 60 percent
(42 watts) higher at idle for each server using older
versions of Linux
•Optimize Linux* file system configurations
– Noatime attribute
– Open file descriptor limit
•Use latest Java (for example Sun Java* 6u14)
– Use 64 bit optimized JVM builds
16
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 17. Hadoop Configuration Tuning
•The number of NameNode and JobTracker threads(10 -> 64)
•The number of DataNode server threads (3 -> 8)
•The number of work threads on HTTP server that runs on each TaskTracker
(40-50)
•HDFS replication factor (3)
•Default HDFS block size (64MB -> 128MB)
•Maximum number of map/reduce tasks per node
– (cores_per_node)/2 -> 2*(cores_per_node)
• The number of input streams (files) to be merged at once in map/reduce
tasks (example: 100)
• JVM settings
• The total size of result and metadata buffers associates with a map task
(100MB -> 200 MB)
17
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 18. System-stack Example
Two-way Intel® Xeon® processor 5500 series
Intel® X25-E SATA SSD’s
Four to six 7200 RPM SATA drives
12-24 GB DDR3 ECC RAM
Intel® Server Board S5500WB
80 PLUS* Gold Certified power supplies
Linux* based on kernel 2.6.30 or later
Sun Java* 6u14 or later
Hadoop* (0.18.3 or 0.20.0)
18
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 19. Summary
Hardware selection:
• Intel® Xeon® 5500 (“Nehalem”) improves Hadoop Workload
performance
• Choosing an optimized server board such as Intel® SB5500WB
(“WillowBrook”) can reduce power consumption
• Use Intel® X25-E SATA SSD’s to improve performance
Software & configurations:
• Use latest Linux kernel
• Turn on Intel® Hyper-threading
• Optimize Hadoop Configuration
• Tuning may be different for different workload types
19
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 20. References:
1. http://www.intel.com/p/en_US/products/server/processor
2. http://www.intel.com/it/pdf/server-rightsizing.pdf
3. http://www.80plus.org/
4. https://opencirrus.org/content/agenda-open-cirrus-summit-palo-
alto-june-8-9-2009
20
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 21. Cluster Configurations Information
(Slides: “Processor Scaling” and “Turn on Intel® Hyper-
threading”)
Hardware Configuration
Item Endeavor Atlantis
Node count 1-10 nodes 1-10 nodes
Platform Intel SR1600UR Intel SR1560SF system
Intel S5520UR main board Intel S5400SF main board
1U chassis 1U chassis
CPU/Stepping Intel® Xeon® X5560 C1 step Intel® Xeon® X5482; C0 step
(Nehalem EP) (Harpertown)
2.8GHz / 6.4 QPI 1333 95 W 3.2 GHz / 12 MB L2 cache
1MB L2 cache, 8M L3 cache
RAM 24 GB total/node 16 GB
6*4GB 1333MHz Reg ECC DDR3 (FBDIMM 8x2-GB 667MHz)
Chipset Tylersburg Seaburg
BIOS Version Rev 26 Rev 22.1
08 Apr 2008 7 Nov 2007
Interconnects Gigabit Ethernet Gigabit Ethernet
QDR InfiniBand DDR InfiniBand
Hard drive specs Seagate Cheetah NS Seagate Barracuda ES
400 GB SAS HDD 10kRPM 250 GB SATA HDD
Model: ST3400755SS Model: ST3250620NS
Using onboard Intel Entry Level
Raid controller
21
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 22. Cluster Configurations Information
(Slides: “Processor Choice Impacts Speed” and
“Processor Choice Impacts Throughput”)
Intel® Xeon® X5460-based server
Processor: Dual-socket quad-core Intel® Xeon® X5460 3.16GHz
Processor Memory: 16GB (DDR2 FBDIM ECC 667MHz) RAM
Storage: 1 X 300GB 15K RPM SAS disk for system and log files, 4 X 1TB 7200RPM SATA for HDFS and
intermediate results
Network: 1 Gigabit Ethernet NIC
BIOS: BIOS version S5000.86B.10.60.0091.100920081631EIST (Enhanced Intel SpeedStep Technology)
disabled both hardware prefetcher and adjacent cache-line, prefetch disable
Intel® Xeon® X5570-based server
Processor: Dual-socket quad-core Intel® Xeon® X5570 2.93GHz
Processor Memory: 16GB (DDR3 ECC 1333MHz) RAM
Storage: 1 X 1TB 7200RPM SATA for system and log files, 4 X 1TB 7200RPM SATA for HDFS and
intermediate results
Network: 1 Gigabit Ethernet NIC
BIOS: BIOS version 4.6.3 Both EIST (Enhanced Intel SpeedStep Technology) and Turbo mode disabled
both hardware prefetcher and adjacent cache-line prefetch enabled, SMT (Simultaneous MultiThreading),
enabled (Disabling hardware prefetcher and adjacent cache-line prefetch helps improve Hadoop
performance on Xeon X5460 server according to our benchmarking.)
22
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
- 23. Cluster Configurations Information
(Slides: “Use Intel® X25-E SATA SSD’s”)
Slaves:
• Intel® Xeon® L5520 Processor (Nehalem) @ 2.27 GHz CPUs 5.8 GB/sec QPI, 24 GBy RAM
• Server Board: Intel® SB5500WB (Willowbrook)
• 1x 1 TB SATA HDD boot disk, holds ${HOME} dirs: /
• 2x 1 TB SATA HDD scratch/experiment disks:
• 2x 64 GB Intel® X25-E SATA SLC SSD scratch/experiment disks
•OS: Ubuntu* 9.04 == 2.6.28-4 kernel (to enable power saving with preserved performance)
Master:
•Intel® Xeon® Processor 2.93 GHz CPUs, 6.4 GB/sec QPI, 16 GBy RAM
•Server Board: Intel® SB5500WB (Willowbrook)
•Hard Disks:
• 1x 500 GB SATA OS boot disk (/dev/sda1), holds installed software
and ${HOME} dirs
• 2x 500 GB SATA scratch disks
• 2x64 GB Intel® X25-E SATA SLC SSDs
•OS: RedHat* Enterprise Linux 5.3 Server x64t
23
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.