SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
HIGH PERFORMANCE
HARDWARE FOR DATA
ANALYSIS
Michael Pittaro
Michael_Pittaro@dell.com
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
WWW.SLIDESHARE.NET/LHRC_MIKEYP
WWW.GITHUB.COM/LHRC-MIKEYP
@pmikeyp
mikeyp@acm.org
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
3
About This Talk
• We can’t cover everything about hardware in a 30 minute session.
• We can go deep enough to help you
– Understand tradeoffs and balanced architectures
– Ask the right questions about choices
– Learn from what others are doing
• My Approach Today
1. Why look at high performance hardware ?
2. Look at a production cluster design
3. Look at the choices and tradeoffs behind the scene
4
Why consider High Performance Hardware ?
• Choice of hardware can have large impacts
– On performance
– On budget
• Understanding the hardware helps with the software
– Scalable and parallel systems deal with both
• Data is heavy
– Local clusters are persistent
– Large data transfer may not be a viable option.
• Cloud hosting may not be an option
– You can’t or won’t delegate critical infrastructure to third parties.
– You need every bit of performance you can get.
5
Servers
Processors
Memory
Lack of Trusted Information
Jargon
Disk Drives
Networking
Choices, Choices - The Hardware Toolbox
5
6
Performance
Reliability
Predictability
Cost
Management
Proven
Solutions
Tested
Configurations
What the Customer Wants
6
7
Reference Architectures Fill The Gap
• Tested Server Configurations
• Tested Network Configurations
• Recommended Software Configuration
– Application and Workload Software
– OS Infrastructure
– Operational Infrastructure
• Opinionated Point of View
– Based on real world experience
• Recommended starting point
– Customization is possible
7
8
The secret to a good architecture is balance
Price
Performance
Fault Zones
Application Workload
Software
9
Cluster Architecture
• The Dell In-Memory Appliance for Cloudera Enterprise
9
10
Dell In-Memory Appliance – Summary Specs
Cluster Starter Mid-Size Small Enterprise Maximum
Data Nodes 4 12 20 44
Total Memory 1536 GB 4608 GB 7680 GB 26896 GB
Total Storage 176TB 528 TB 880 TB 2112 TB
Processing Cores 80 280 400 880
Racks (42U) 1 2 2 4
Data Node Characteristic Configuration
Server Dell R720xd (2 Rack Units)
Processor Two Intel Xeon E5-2670v2 2.5GHz, 25M Cache, 10 Core
Memory 384GB
Memory Speed 1866 Mt/s DRAM
Disks 12 X 4TB SATA, 3.0 Gbps (48 TB)
Networking Dual 10GbE interfaces, with active bonding
Management Network
Two x 1GbE interfaces
11
Server Examples
M1000e Blade Chassis (10U)
4 Socket R920 (4U)
2 Socket R730xd (2U)
12
Server Choices
• 4 Socket Servers (e.g. Dell R920)
– Optimized for enterprise applications - Large RDBMS servers, SAP, SAP HANA,
Microsoft Exchange
– Very large memory available (6 TB)
– Often use direct or network attached storage
• ‘Blade’ Servers (e.g. Dell M620, M1000e Chassis)
– Pluggable Processor and Storage modules
– Backplane and Chassis has a lot of shared interconnect logic
– Flexibility for enterprise applications - Virtualization is popular
• 2 Socket Servers (e.g. Dell R620, R630, R720, R730)
– Many options available
– 1U and 2U chassis footprints
– Developed for Web Hosting and Large Scale-Out Clusters
– Dell Internal Storage – 12 x 3.5” drives, 24 x 2.5” drives (in chassis)
13
• Assume 1-1.5 Hadoop tasks per core
– allows headroom for other processes
• Hyperthreading
– Enable for Hadoop, Spark
– for others: it depends
• Hadoop: aim for 1 core / disk spindle
• Impala: can handle more spindles and cores easily
• Spark: I/O depends on back end storage
• Faster processor is better
– Most Hadoop jobs are I/O bound, not processor bound
– Hadoop compression uses processor cycles
– Less cores with a faster clock is often a good tradeoff
– The Map / Reduce balance depends on actual workload
– It’s hard to optimize more without knowing the actual workload
Selecting Processors
14
Intel Xeon Dual Socket Processor Architecture
Haswell CPU
Up to 18 cores
TDP: Up to 145 W (SVR); 160 W (WS)
Socket Socket-R3
Scalability 2S capability
Memory
4xDDR4 channels
1333, 1600, 1866 (2 DPC), 2133 (1 DPC)
RDIMM, LRDIMM
QPI
2xQPI 1.1 channels
6.4, 8.0, 9.6 GT/s
PCIe
PCIe 3.0 (2.5, 5, 8 GT/s)
PCIe Extensions: Dual Cast, Atomics
40xPCIe*3.0
Intel® Xeon®
processor
E5-2600 v3
Intel® Xeon®
processor
E5-2600 v3
QPI
2 Channels
DDR4
LAN
Up to
4x10GbE
PCIe* 3.0, 40 lanes
Intel® C610
series
chipset
WBG
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
15
Intel Processor Generations
Product Xeon E5-2600 E5-2600 V2 E5-2600 V3
Microarchitecture SandyBridge IvyBridge Haswell
Cores / Threads 8 / 16 12/24 18/36
Last Level Cache Up to 20MB Up to 30 MB Up to 45 MB
Max Memory Speed 1600 MT/S
DDR3
1866 MT/s
DDR3
2133 MT/s
DDR4
QPI (GT/s) 2 channels
6.4, 7.2, 8.0
2 channels
6.4, 7.2, 8.0
2 channels
6.4, 8.0, 9.6
Max DIMMS 12 12 12
Max Clock Speed 3.1GHz / 3.8GHz 3.7 GHz / 3.8GHz 3.7 Ghz / 3.8Ghz
Process Tech 32nm 22nm 22nm
Year 2012 2013 2014
16
Selecting Memory
• DDR3 versus DDR4, RDIMM versus LRDIMM
– DDR3 is cheaper now, DDR4 is faster (15%)
• DIMM Sizes
– 8GB, 16GB, 32GB, 64GB, 128GB
• Sweet Spot Varies
– DDR4 around 32GB right now
• Balance the memory banks
– 4 memory channels per processor
– 4 x 16GB better than 2 x 32GB
• Server Class Memory
– It’s all ECC checked
– Dell Server BIOS options to optimize checking method
17
Selecting Disks
• 3.5” Drives
– 3TB, 4TB, 6TB per drive
– Pricing sweet spot is 3TB
– Use enterprise grade drives, not consumer !!
– SATA or SAS. SAS slightly faster.
– 3.0 GB/sec is fine, 6.0 Gb/sec is a waste with spinning drives
• 2.5” Drives
– 800GB and 1.2 TB
– More expensive than 3.5” drives
– more spindles and performance
• SATA Solid State Drives
– 6.0 Gb/sec
– 2.5” and 1.8” options
– Expensive for now
– Not as deterministic as spindles
18
• Hadoop scales processing and storage together
– The cluster grows by adding more data nodes
– The ratio of processor to storage is the main adjustment
• Generally, aim for a 1 spindle / 1 core ratio
– I/O is large blocks (64Mb to 256Mb)
– Primarily sequential read/write, very little random I/O
– 8 tasks will be reading or writing 8 individual spindles
• Drive Sizes and Types
– NL SAS or Enterprise SATA 6 Gb/sec
– Drive size is mainly a price decision
• Depth per node
– Up to 48 TB/node is common
– 112 Tb / node is possible
– Consider how much data is ‘active’
– Very deep storage impacts recovery performance
Spindle / Core / Storage Depth Optimization
1
19
PowerEdge C8000 Hadoop Scaling - 16 core Xeon
1
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
1
26
51
76
101
126
151
176
201
226
TbStorage
(1) 12 spindle 3Tb versus (3) 6 spindle
3Tb
Cores (1)
Storage (1)
IOPS (1)
Storage (3)
IOPS (3)
20
Network Architecture – Layer 2 Switching
21
Network and Switches
• Simple Tree Structure
– Top of Rack (TOR) for each rack / group of nodes
– Racks feed up to a Cluster or Aggregation Switch
– All switching is at Layer 2 (Ethernet)
› No fancy routing or layer 3 (IP) packet inspection
– Most switches are 48 ports in this class
• Switch Characteristics
– Line rate switching at 10Gbps
– Deep buffers to handle bursts
– Virtual Link Trunking (VLT)– two switches act as one, with failover
– Uplinks are 40GbE
• High Availability and Performance
– Use two 10GbE links to alternate switches
– Bond at the Linux level into a single device
22
Model Data Node
Configuration
Comments RA
R730Xd Dual socket, 12 cores,
24 x 2.5” spindles
Most popular platform for
Hadoop
C8000 Dual socket, 16 cores,
16 x 3.5” spindles
Popular for deep/dense
Hadoop applications
C6100 /
C6105
Dual socket, 8/12 cores,
12 x 3.5” spindles
Two node version. C6100 is
hardware EOL
C2100 Dual Socket, 12 cores,
12 x 3.5” spindles
Popular, hardware EOL but
often repurposed for
Hadoop
R620 Dual Socket, 8 cores,
10 x 2.5” spindles
1U form factor
C6220 Dual-socket, 8 cores,
6 x 2.5” spindles
Core/spindle ratio is not
ideal for Hadoop.
In the Wild – Dell Customer Hadoop Configurations
2
23
• GPU’s
– Possible, not seen too often with Hadoop
• Ingest / Streaming
– Usually a custom configuration for high speed capture/loading (e.g. Kafka, Storm)
• Dell PowerEdge VRTX
– Designed as a ‘mini-blade’ for branch offices
– Could make a killer data science workstation
What I haven’t talked about!
24
• Dell.com/hadoop
– Hadoop Reference Acchitectures
– Optimizing PowerEdge Configurations for Hadoop
• Slideshare
– http://www.slideshare.net/lhrc-mikeyp
Download Links / References
25
High Performance Hardware for Data Analysis
• Choosing hardware for big data analysis is difficult because of the many options and variables involved. The problem is more
complicated when you need a full cluster for big data analytics.
• This session will cover the basic guidelines and architectural choices involved in choosing analytics hardware for Spark and
Hadoop. I will cover processor core and memory ratios, disk subsystems, and network architecture. This is a practical advice
oriented session, and will focus on performance and cost tradeoffs for many different options.

Mais conteúdo relacionado

Mais procurados

In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impala
NAVER D2
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

Mais procurados (20)

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMsGlobal Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Why new hardware may not make Oracle databases faster
Why new hardware may not make Oracle databases fasterWhy new hardware may not make Oracle databases faster
Why new hardware may not make Oracle databases faster
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impala
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 

Semelhante a High Performance Hardware for Data Analysis

SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
UniFabric
 

Semelhante a High Performance Hardware for Data Analysis (20)

Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014
 
Webinar NETGEAR - ReadyNAS, le novità hardware e software
Webinar NETGEAR - ReadyNAS, le novità hardware e softwareWebinar NETGEAR - ReadyNAS, le novità hardware e software
Webinar NETGEAR - ReadyNAS, le novità hardware e software
 
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database ApplianceVýhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database Appliance
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
Session 307 ravi pendekanti engineered systems
Session 307  ravi pendekanti engineered systemsSession 307  ravi pendekanti engineered systems
Session 307 ravi pendekanti engineered systems
 
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistOWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
 
INCOSE Colorado Front Range Chapter Presentation - Technology Impact on Compu...
INCOSE Colorado Front Range Chapter Presentation - Technology Impact on Compu...INCOSE Colorado Front Range Chapter Presentation - Technology Impact on Compu...
INCOSE Colorado Front Range Chapter Presentation - Technology Impact on Compu...
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
 
FAQ
FAQFAQ
FAQ
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 

Mais de odsc

Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
odsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
odsc
 

Mais de odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

High Performance Hardware for Data Analysis

  • 1. HIGH PERFORMANCE HARDWARE FOR DATA ANALYSIS Michael Pittaro Michael_Pittaro@dell.com O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  • 2. WWW.SLIDESHARE.NET/LHRC_MIKEYP WWW.GITHUB.COM/LHRC-MIKEYP @pmikeyp mikeyp@acm.org O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  • 3. 3 About This Talk • We can’t cover everything about hardware in a 30 minute session. • We can go deep enough to help you – Understand tradeoffs and balanced architectures – Ask the right questions about choices – Learn from what others are doing • My Approach Today 1. Why look at high performance hardware ? 2. Look at a production cluster design 3. Look at the choices and tradeoffs behind the scene
  • 4. 4 Why consider High Performance Hardware ? • Choice of hardware can have large impacts – On performance – On budget • Understanding the hardware helps with the software – Scalable and parallel systems deal with both • Data is heavy – Local clusters are persistent – Large data transfer may not be a viable option. • Cloud hosting may not be an option – You can’t or won’t delegate critical infrastructure to third parties. – You need every bit of performance you can get.
  • 5. 5 Servers Processors Memory Lack of Trusted Information Jargon Disk Drives Networking Choices, Choices - The Hardware Toolbox 5
  • 7. 7 Reference Architectures Fill The Gap • Tested Server Configurations • Tested Network Configurations • Recommended Software Configuration – Application and Workload Software – OS Infrastructure – Operational Infrastructure • Opinionated Point of View – Based on real world experience • Recommended starting point – Customization is possible 7
  • 8. 8 The secret to a good architecture is balance Price Performance Fault Zones Application Workload Software
  • 9. 9 Cluster Architecture • The Dell In-Memory Appliance for Cloudera Enterprise 9
  • 10. 10 Dell In-Memory Appliance – Summary Specs Cluster Starter Mid-Size Small Enterprise Maximum Data Nodes 4 12 20 44 Total Memory 1536 GB 4608 GB 7680 GB 26896 GB Total Storage 176TB 528 TB 880 TB 2112 TB Processing Cores 80 280 400 880 Racks (42U) 1 2 2 4 Data Node Characteristic Configuration Server Dell R720xd (2 Rack Units) Processor Two Intel Xeon E5-2670v2 2.5GHz, 25M Cache, 10 Core Memory 384GB Memory Speed 1866 Mt/s DRAM Disks 12 X 4TB SATA, 3.0 Gbps (48 TB) Networking Dual 10GbE interfaces, with active bonding Management Network Two x 1GbE interfaces
  • 11. 11 Server Examples M1000e Blade Chassis (10U) 4 Socket R920 (4U) 2 Socket R730xd (2U)
  • 12. 12 Server Choices • 4 Socket Servers (e.g. Dell R920) – Optimized for enterprise applications - Large RDBMS servers, SAP, SAP HANA, Microsoft Exchange – Very large memory available (6 TB) – Often use direct or network attached storage • ‘Blade’ Servers (e.g. Dell M620, M1000e Chassis) – Pluggable Processor and Storage modules – Backplane and Chassis has a lot of shared interconnect logic – Flexibility for enterprise applications - Virtualization is popular • 2 Socket Servers (e.g. Dell R620, R630, R720, R730) – Many options available – 1U and 2U chassis footprints – Developed for Web Hosting and Large Scale-Out Clusters – Dell Internal Storage – 12 x 3.5” drives, 24 x 2.5” drives (in chassis)
  • 13. 13 • Assume 1-1.5 Hadoop tasks per core – allows headroom for other processes • Hyperthreading – Enable for Hadoop, Spark – for others: it depends • Hadoop: aim for 1 core / disk spindle • Impala: can handle more spindles and cores easily • Spark: I/O depends on back end storage • Faster processor is better – Most Hadoop jobs are I/O bound, not processor bound – Hadoop compression uses processor cycles – Less cores with a faster clock is often a good tradeoff – The Map / Reduce balance depends on actual workload – It’s hard to optimize more without knowing the actual workload Selecting Processors
  • 14. 14 Intel Xeon Dual Socket Processor Architecture Haswell CPU Up to 18 cores TDP: Up to 145 W (SVR); 160 W (WS) Socket Socket-R3 Scalability 2S capability Memory 4xDDR4 channels 1333, 1600, 1866 (2 DPC), 2133 (1 DPC) RDIMM, LRDIMM QPI 2xQPI 1.1 channels 6.4, 8.0, 9.6 GT/s PCIe PCIe 3.0 (2.5, 5, 8 GT/s) PCIe Extensions: Dual Cast, Atomics 40xPCIe*3.0 Intel® Xeon® processor E5-2600 v3 Intel® Xeon® processor E5-2600 v3 QPI 2 Channels DDR4 LAN Up to 4x10GbE PCIe* 3.0, 40 lanes Intel® C610 series chipset WBG DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4
  • 15. 15 Intel Processor Generations Product Xeon E5-2600 E5-2600 V2 E5-2600 V3 Microarchitecture SandyBridge IvyBridge Haswell Cores / Threads 8 / 16 12/24 18/36 Last Level Cache Up to 20MB Up to 30 MB Up to 45 MB Max Memory Speed 1600 MT/S DDR3 1866 MT/s DDR3 2133 MT/s DDR4 QPI (GT/s) 2 channels 6.4, 7.2, 8.0 2 channels 6.4, 7.2, 8.0 2 channels 6.4, 8.0, 9.6 Max DIMMS 12 12 12 Max Clock Speed 3.1GHz / 3.8GHz 3.7 GHz / 3.8GHz 3.7 Ghz / 3.8Ghz Process Tech 32nm 22nm 22nm Year 2012 2013 2014
  • 16. 16 Selecting Memory • DDR3 versus DDR4, RDIMM versus LRDIMM – DDR3 is cheaper now, DDR4 is faster (15%) • DIMM Sizes – 8GB, 16GB, 32GB, 64GB, 128GB • Sweet Spot Varies – DDR4 around 32GB right now • Balance the memory banks – 4 memory channels per processor – 4 x 16GB better than 2 x 32GB • Server Class Memory – It’s all ECC checked – Dell Server BIOS options to optimize checking method
  • 17. 17 Selecting Disks • 3.5” Drives – 3TB, 4TB, 6TB per drive – Pricing sweet spot is 3TB – Use enterprise grade drives, not consumer !! – SATA or SAS. SAS slightly faster. – 3.0 GB/sec is fine, 6.0 Gb/sec is a waste with spinning drives • 2.5” Drives – 800GB and 1.2 TB – More expensive than 3.5” drives – more spindles and performance • SATA Solid State Drives – 6.0 Gb/sec – 2.5” and 1.8” options – Expensive for now – Not as deterministic as spindles
  • 18. 18 • Hadoop scales processing and storage together – The cluster grows by adding more data nodes – The ratio of processor to storage is the main adjustment • Generally, aim for a 1 spindle / 1 core ratio – I/O is large blocks (64Mb to 256Mb) – Primarily sequential read/write, very little random I/O – 8 tasks will be reading or writing 8 individual spindles • Drive Sizes and Types – NL SAS or Enterprise SATA 6 Gb/sec – Drive size is mainly a price decision • Depth per node – Up to 48 TB/node is common – 112 Tb / node is possible – Consider how much data is ‘active’ – Very deep storage impacts recovery performance Spindle / Core / Storage Depth Optimization 1
  • 19. 19 PowerEdge C8000 Hadoop Scaling - 16 core Xeon 1 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 1 26 51 76 101 126 151 176 201 226 TbStorage (1) 12 spindle 3Tb versus (3) 6 spindle 3Tb Cores (1) Storage (1) IOPS (1) Storage (3) IOPS (3)
  • 20. 20 Network Architecture – Layer 2 Switching
  • 21. 21 Network and Switches • Simple Tree Structure – Top of Rack (TOR) for each rack / group of nodes – Racks feed up to a Cluster or Aggregation Switch – All switching is at Layer 2 (Ethernet) › No fancy routing or layer 3 (IP) packet inspection – Most switches are 48 ports in this class • Switch Characteristics – Line rate switching at 10Gbps – Deep buffers to handle bursts – Virtual Link Trunking (VLT)– two switches act as one, with failover – Uplinks are 40GbE • High Availability and Performance – Use two 10GbE links to alternate switches – Bond at the Linux level into a single device
  • 22. 22 Model Data Node Configuration Comments RA R730Xd Dual socket, 12 cores, 24 x 2.5” spindles Most popular platform for Hadoop C8000 Dual socket, 16 cores, 16 x 3.5” spindles Popular for deep/dense Hadoop applications C6100 / C6105 Dual socket, 8/12 cores, 12 x 3.5” spindles Two node version. C6100 is hardware EOL C2100 Dual Socket, 12 cores, 12 x 3.5” spindles Popular, hardware EOL but often repurposed for Hadoop R620 Dual Socket, 8 cores, 10 x 2.5” spindles 1U form factor C6220 Dual-socket, 8 cores, 6 x 2.5” spindles Core/spindle ratio is not ideal for Hadoop. In the Wild – Dell Customer Hadoop Configurations 2
  • 23. 23 • GPU’s – Possible, not seen too often with Hadoop • Ingest / Streaming – Usually a custom configuration for high speed capture/loading (e.g. Kafka, Storm) • Dell PowerEdge VRTX – Designed as a ‘mini-blade’ for branch offices – Could make a killer data science workstation What I haven’t talked about!
  • 24. 24 • Dell.com/hadoop – Hadoop Reference Acchitectures – Optimizing PowerEdge Configurations for Hadoop • Slideshare – http://www.slideshare.net/lhrc-mikeyp Download Links / References
  • 25. 25 High Performance Hardware for Data Analysis • Choosing hardware for big data analysis is difficult because of the many options and variables involved. The problem is more complicated when you need a full cluster for big data analytics. • This session will cover the basic guidelines and architectural choices involved in choosing analytics hardware for Spark and Hadoop. I will cover processor core and memory ratios, disk subsystems, and network architecture. This is a practical advice oriented session, and will focus on performance and cost tradeoffs for many different options.