Best Practices for Virtualizing Apache Hadoop

© Hortonworks Inc. 2013
Best Practices
Virtualizing Hadoop
George Trujillo

George Trujillo
§ Master Principal Big Data Specialist - Hortonworks
§ Tier One Big Data/BCA Specialist – VMware Center of Excellence
§ VMware Certified Instructor (VMware Certified Professional)
§ MySQL Certified DBA
§ Sun Microsystem's Ambassador for Java Platforms
§ Author of Linux Administration and Advanced Linux Administration
Video Training
§ Recognized Oracle Double ACE by Oracle Corporation
§ Served on Oracle Fusion Council & Oracle Beta Leadership Council,
Independent Oracle Users Group (IOUG) Board of Directors,
Recognized as one of the “Oracles of Oracle” by IOUG
Page 2

Agenda
• Hypervisor’s today
• Building an enterprise virtual platform
• Virtualizing Master and Slave servers
• Best practices
• Deploying Hadoop in public and private clouds
Page 3

Hypervisors Today: Faster/Less Overhead
• VMware vSphere, Microsoft Hyper-V Server, Citrix
XenServer and RedHat RHEV
Page 4
Hypervisor Performance Benchmarks % Overhead
VMware 1M IOPS with 1 microsecond of latency (5.1) 2 – 10%
KVM 1M transactions/minute (IBM hardware RHEL) < 10%
Hypervisor Performance vSphere 5.1
VMware vCPUs 64
RAM per VM, RAM per Host 1TB / 2TB
Network 36 GB/s
IOPS/VM 1,000,000

Why Virtualize Hadoop?
• Virtual Servers offer advantages over Physical Servers
• Standardization: On a Single Common software stack
• Higher consistency and reliability due to abstracting the
hardware environment
• Operational flexibility with vMotion, Storage vMotion, Live
Cloning, template deployments, hot memory and CPU add,
Distributed Resource Scheduling, private VLANs, Storage and
Network I/O control, etc.
• Virtualization is a natural step towards the cloud
• Enabling Hadoop as a service in a public or private cloud
• Cloud providers are making it easy to deploy Hadoop for POCs,
dev and test environments
• Cloud and virtualization vendors are offering elastic MapReduce
solutions
Page 5

Virtualization Features
Page 6
Faster provisioning Live Cloning
Live migrations Templates
Live storage migrations Distributed Resource Scheduling
High Availability Hot CPU and Memory add
Live Cloning VM Replication
Network isolation using VXLANs Multi-VM trust zones
VM Backups Distributed Power Management
Elasticity Multi-tenancy
Storage/Network I/O Control Private virtual networks
16Gb FC Support iSCSI Jumbo Frame Support
Note: Features/functionality dependent on the hypervisor

Hortonworks Data Platform
Building an Enterprise Virtual Platform
Page 7
Hardware
Linux Windows
Distributed Storage
(HDFS)
Distributed Processing
(MapReduce)
Hive
(Query)
Pig
(Scripting)
HCatalog
(Metadata Mgmt)
Zookeeper
(Coordination)
HBase
(Column DB)
WebHCatalog
(Rest-like APIs)
Ambari
(Management)
Mahout
(Machine Learning)
Oozie
(Workflow)
Ganglia
(Monitoring)
Nagios
(Alerts)
Sqoop
(DB Transfer)
WebHDFS
(REST API)
“Others”
(Talend, Informatica, etc.)
Data Extraction
And Load
Management
Monitoring
Hadoop
Essentials
Core Hadoop
(kernel)
FlumeNG
(Data Transfer)
Hypervisor

Virtualizing Hadoop
• The primary goal of virtualizing master and slave servers is the
same, to maximize operational efficiency and leverage existing
hardware.
• However the strategy for virtualizing Hadoop master servers is
different than virtualizing Hadoop slave servers.
– Hadoop master servers can follow virtualization best practices and
guidelines for tier1 and business critical environments.
– Hadoop slave servers need to follow virtualization best practices and
also use Hadoop Virtual Extensions so a Hadoop cluster is “virtual
aware”.
Page 8

Virtualizing Master Servers
• Virtualize the master servers (NameNode, JobTracker,
HBase Master, Secondary NameNode)
– Consider any key management servers: Ganglia, Nagios, Ambari,
Active Directory, Metadata databases
• Goals of a virtual enterprise Hadoop platform:
– Less down time (Live migrations, cloning, …)
– A more reliable software stack
– A higher Quality of Service
– Reduced CapEx and OpEx
– Increased operational flexibility with virtualization features
– VMware High Availability (with five clicks)
• Shared storage for the Hadoop master servers is required
to fully leverage virtualization features.
Page 9

Configure Environment Properly
• Do not overcommit SLA or production environments
• Size virtual machines to avoid entering host “soft” memory
state and the likely breaking of host large pages into small
pages. Leave at least 6% of memory for the hypervisor
and VM memory overhead is conservative.
– If free memory drops below minFree (“soft” memory state),
memory will be reclaimed through ballooning and other memory
management techniques. All these techniques require breaking
host large pages into small pages.
• Leverage hyperthreading – make sure there is hardware
and BIOS support
– Hyper Threading – can improve performance up to 20%
• Do not set memory limits on production servers.
Page 10

Configure Environment Properly (2)
• Run latest version of hypervisor, BIOS and virtual tools
• Verify BIOS settings enable all populated processor
sockets and enable all cores in each socket.
• Enable “Turbo Boost” in BIOS if processors support it.
• Disabling hardware devices (in BIOS) can free interrupt
resources.
– COM and LPT ports, USB controllers, floppy drives, network
interfaces, optical drives, storage controllers, etc
• Enable virtualization features in BIOS (VT-x, AMD-V, EPT,
RVI)
• Initially leave memory scrubbing rate at manufacturer’s
default setting.
Page 11

More Best Practices
• Configure an OS kernel as a single-core or multi-core
kernel based on the number of vCPUs being used.
• Understand how NUMA affects your VMs – try to keep the
VM size within the NUMA node
– Look at disabling node interleaving (leave NUMA enabled)
– Maintain memory locality
• Let hypervisor control power mgmt by BIOS setting “OS
Controlled Mode”
• Enable C1E in BIOS
• Have a very good reason for using CPU affinity otherwise
avoid it like the plague
Page 12

Linux Best Practices
• Kernel parameters:
– nofile=16384
– nproc=32000
– Mount with noatime and nodiratime attributes disabled
– File descriptors set to 65535
– File system read-ahead buffer should be increased to 1024 or 2,048.
– Epoll file descriptor limit should be increased to 4096
• Turn off swapping
• Use ext4 or xfs (mount noatime)
– Ext can be about 5% better on reads than xfs
– XFS can be 12-25% better on writes (and auto defrags in the
background)
• Linux 2.6.30+ can give 60% better energy consumption.
Page 13

Networking Best Practices
• Separate VM traffic from live migration and management
traffic
– Separate NICs with separate vSwitches
• Leverage NIC teaming (at least 2 NICS per vSwitch)
• Leverage latest adapters and drivers from hypervisor
vendor
• Be careful with multi-queue networking: Hadoop drives a
high packet rate, but not high enough to justify the
overhead of multi-queue.
• Network:
– Channel bonding two GbE ports can give better I/O performance
– 8 Queues per port
Page 14

Networking Best Practices (2)
• Evaluate these features with network adapters to leverage
hardware features:
– Checksum offload
– TCP segmentation offload(TSO)
– Jumbo frames (JF)
– Large receive offload(LRO)
– Ability to handle high-memory DMA (that is, 64-bit DMA
addresses)
– Ability to handle multiple Scatter Gather elements per Tx frame
• Optimize 10 Gigabit Ethernet network adapters
– Features like NetQueue can significantly improve performance of
10 Gigabit Ethernet network adapters in virtualized environments.
Page 15

Storage Best Practices
• Make good storage decisions
– i.e. VMFS or Raw Device Mappings (RDM)
– VMDK – leverages all features of virtualization
– RDM – leverages features of storage vendors (replication,
snapshots, …)
– Run in Advanced Host Controller interface mode (AHCI).
– Native Command Queuing enabled (NCQ)
• Use multiple vSCSI adapters and evenly distribute target
devices
• Use eagerzeroedthick for VMDK files or uncheck
Windows “Quick Format” option
• Makes sure there is block alignment for storage
Page 16

Virtualizing Data Servers
• HVE is a new feature that extends the Hadoop topology
awareness mechanism to support rack and node groups
with hosts containing VMs.
– Data locality-related policies maintained within a virtual layer
• HVE merged into branch-1
– Available in Apache Hadoop 1.2, HDP 1.2
– https://issues.apache.org/jira/browse/HADOOP-8817
• Extensions include:
– Block placement and removal policies
– Balancer policies
– Task scheduling
– Network topology awareness
Page 17

HVE: Virtualization Topology Awareness
Page 18
Host8
Rack1
Data Center
Rack2
NodeG 3 NodeG 4
Host7
VMVM
VMVM
VMVM
VMVM
Host6Host5
VMVM
VMVM
VMVM
VMVM
Host4Host3
VMVM
VMVM
VMVM
VMVM
Host2Host1
VMVM
VMVM
VMVM
VMVM
NodeG 1 NodeG 2
• HVE is a new feature that extends the Hadoop topology
awareness mechanism to support rack and node groups
with hosts containing VMs.
– Data locality-related policies maintained within a virtual layer.

HVE: Replica Policies
Page 19
Standard Replica Policies Extension Replica Policies
1st replica is on local (closest) node of
the writer
Multiple replicas are not be placed on
the same node or on nodes under the
same node group
2nd replica is on separate rack of 1st
replica;
1st replica is on the local node or local
node group of the writer
3rd replica is on the same rack as the
2nd replica;
2nd replica is on a remote rack of the
1st replica
Remaining replicas are placed
randomly across rack to meet
minimum restriction.
Multiple replicas are not placed on the same node with standard or extension
replica placement/removal policies. Rules are maintained for the balancer.

Follow Virtualization Best Practices
Page 20
§ Validate virtualization and Hadoop configurations with
vendor hardware compatibility lists.
Hardware
§ Follow recommended Hadoop reference architectures.Hadoop
§ Review storage vendor recommendations.Storage
§ Follow virtualization vendors best practices,
deployment guides and workload characterizations.
Virtualization
§ Validate internal guidelines and best practices for
configuring and managing corporate VMs.
Internal

Benefits of Running Hadoop in a Private Cloud
Elastic Hadoop
•  Create pool of cluster
nodes
•  On demand cluster scale
up/down
Multi-tenant Hadoop
•  Better isolate workloads
and enforce organizational
security boundaries
CapEx reduction
•  Better utilization of physical
servers
•  Cluster ‘timeshare’
•  Promote responsible usage
through chargeback/showback
OpEx reduction
•  Rapid provisioning & self
provisioning
•  Simplify cluster maintenance
LEAD TO

Hortonworks & Rackspace Partnership
•  Goal:
–  Enable Hadoop to run efficiently in OpenStack based
public and private cloud environments
•  Where we stand
–  Rackspace public cloud service available soon
( Q3CY13)
–  Continued work on enabling Hortonworks data
platform to run efficiently on Rackspace OpenStack
private cloud platform
•  Project Savannah
–  Automate the deployment of Hadoop on enterprise
class OpenStack clouds.

Final Thoughts
• Virtualization features can provide operational advantages
to a Hadoop cluster.
• A lot of companies have expertise in virtualizing tier two/
three platforms but not tier one. Be careful of growing
pains.
• Can your organization handle the jump of moving to
Hadoop and managing an enterprise virtual infrastructure
at the same time?
• Give Hadoop Virtual Extensions time to bake.
• Organizations are increasing their percentage of virtual
servers and cloud deployments. They do not want to take
a step back into physical servers unless they have to.
Page 23

Next Steps
Page 24
Download Hortonworks Sandbox
www.hortonworks.com/sandbox
Download Hortonworks Data Platform
www.hortonworks.com/download
Register for Hadoop Series
www.hortonworks.com/webinars

Hadoop Summit
Page 25Architecting the Future of Big Data
•  June 26-27, 2013- San Jose Convention Cntr
•  Co-hosted by Hortonworks & Yahoo!
•  Theme: Enabling the Next Generation
Enterprise Data Platform
•  90+ Sessions and 7 Tracks:
•  Community Focused Event
–  Sessions selected by a Conference Committee
–  Community Choice allowed public to vote for
sessions they want to see
•  Training classes offered pre event
–  Apache Hadoop Essentials: A Technical
Understanding for Business Users
–  Understanding Microsoft HDInsight and Apache
Hadoop
–  Developing Solutions with Apache Hadoop –
HDFS and MapReduce
–  Applying Data Science using Apache Hadoop
hadoopsummit.org

Thank You
For Attending
Best Practices for Virtualizing Hadoop
George Trujillo
Blog: http://cloud-dba-journey.blogspot.com
Twitter: GeorgeTrujillo

Best Practices for Virtualizing Apache Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Best Practices for Virtualizing Apache Hadoop

Semelhante a Best Practices for Virtualizing Apache Hadoop (20)

Mais de Hortonworks

Mais de Hortonworks (20)

Último

Último (20)

Best Practices for Virtualizing Apache Hadoop