SlideShare uma empresa Scribd logo
1 de 26
Hadoop as a Service


Jun Ping Du
Richard McDougall
VMware, Inc.




                      © 2009 VMware Inc. All rights reserved
Cloud: Big Shifts in Simplification and Optimization


1. Reduce the Complexity      2. Dramatically Lower         3. Enable Flexible, Agile
                                     Costs                     IT Service Delivery
     to simplify operations   to redirect investment into   to meet and anticipate the
        and maintenance        value-add opportunities        needs of the business




 2
Infrastructure, Apps and now Data…




                            Build    Run
     Private
               Public


                                Manage



Simplify Infrastructure   Simplify App Platform
                                                   Next Trend:
     With Cloud              Through PaaS
                                                  Simplify Data



 3
Trend 1/3: New Data Growing at 60% Y/Y

Exabytes of information stored                                          20 Zetta by 2015

                                                                        1 Yotta by 2030

                                                                        Yes, you are part
                                                                        of the yotta
                                                        audio           generation…
                                                  digital tv
                                               digital photos
                                       camera phones, rfid
                                  medical imaging, sensors
                  satellite images, games, scanners, twitter
       cad/cam, appliances, videoconfercing, digital movies



                                                         Source: The Information Explosion , 2009


4
Trend 2/3: Big Data – Driven by Real-World Benefit




5
Trend 3/3: Value from Data Exceeds Hardware Cost

 Value from the intelligence of data analytics now outstrips the cost
    of hardware
    • Hadoop enables the use of lower cost hardware
    • Hardware cost halving every 18mo
                                                      Value
                  Big Iron:
                  $40k/CPU

                                                              Commodity
                                                              Cluster:
                                                              $1k/CPU
                                        Cost




6
Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware

                            Trend is ―not just hadoop‖ for big data
                            • Hadoop is often combined with other
                              technologies: Big SQL, NoSQL etc,…

SQLCluster
                            • Unify the infrastructure platform for all


                                  Big SQL        NoSQL          Hadoop
     NoSQL Cluster

                                       Unified Big Data Infrastructure

                                            Private
                                                      Public
 Hadoop Cluster
                             Common Hardware Base
                              • Eliminate the hardware/driver/testing phase
                              • Use existing team for
             DSS Cluster       ordering, diagnosis, capacity management of
 7
                               hardware farm
Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning

I WANT MY HADOOP CLUSTER NOW!

                                 Instant Cluster Provisioning
                                  • Provision Hadoop Clusters instantly
                                  • Automatable using provisioning
                                   engines/scripts: e.g. whir




  8
Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities

 Increase Utilization
    • Hadoop cluster only uses resources it needs
    • Extra resources can be used by other applications when not in use
 Eliminate single points of failure
    • Use vSphere HA for Namenode and Jobtracker
 Use VM Isolation
    • Create separate clusters with defensible security
    • Enables multiple-versions of Hadoop on the same infrastructure
    • Extends to Hadoop and Linux Environments
 Leverage Resource Management
    • Control/assign resources through resource pools
    • E.g. Use spare cycles for Hadoop Processing through priority control



9
What? Hadoop in a VM? Really?




        Actually, Hadoop performs well in a virtual machine




10
Performance Test: Cluster Configuration



                Mellanox10 GbE switch



     AMAX ClusterMax
     2X X5650, 96 GB
     12X SATA 500 GB
     Mellanox 10 GbE adapter




11
Cluster Configuration
 Hardware
 • AMAX ClusterMax, 7 nodes
 • 2X X5650 2.67 GHz hex-core, 96 GB memory
 • 12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4
 • Mellanox ConnectX VPI (MT26418), 10 GbE
 • Mellanox Vantage 6048, 10 GbE
 OS/Hypervisor
 • RHEL 6.1 x86_64 (native and guest)
 • ESX 5.0 RTM with devel Mellanox driver
 VMs (HT off/on)
 • 1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks
 • 2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks
 • 4 VMs (HT on only):
     • 2 small: 18400 MB, 5 vCPUs, 2 disks
     • 2 large: 27600 MB, 7 vCPUs, 3 disks
12
Hadoop Configuration
Distribution
  • Cloudera CDH3u0
  • Based on Apache open-source 0.20.2
Parameters
 • dfs.datanode.max.xcievers=4096
 • dfs.replication=2
 • dfs.block.size=134217728
 • io.file.buffer.size=131072
 • mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)
 • mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)
 Network topology
  • Hadoop uses info for reliability and performance
  • Multiple VMs/host: Each host is a “rack”


13
Benchmarks
 Derived from test apps included in distro
 Pi
 • Direct-exec Monte-Carlo estimation of pi
 • # map tasks = # logical processors
 • 1.68 T samples
 TestDFSIO
 • Streaming write and read
                                                       ~ 4*R/(R+G) = 22/7
 • 1 TB
 • More tasks than processors
 Terasort
 • 3 phases: teragen, terasort, teravalidate
 • 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)
 • More tasks than processors
 • CPU, networking, and storage I/O

14
Performance of Hadoop for Several Workloads

                             Ratio of time taken – Lower is Better
                       1.2


                        1


                       0.8
     Ratio to Native




                       0.6


                                                                     1 VM
                       0.4
                                                                     2 VMs

                       0.2


                        0




15
Architecting Hadoop as a Service using Virtualization

 Goals
 • Make it fast and easy to provision new Hadoop Clusters on Demand
 • Leverage virtual machines to provide isolation (esp. for Multi-tenant)
 • Optimize Hadoop’s performance based on virtual topologies
 • Make the system reliable based on virtual topologies
 Leveraging Virtualization
 • Elastic scale in/out
 • Use high-availability to protect namenode/job tracker
 • Resource controls and sharing: re-use underutilized memory, cpu
 • Prioritize Workloads: limit or guarantee resource usage in a mixed
     environment




16
Provisioning

 Leverage the vSphere APIs to auto-deploy a cluster
 • Whirr, HOD, or custom using ruby, chef, etc,…
 Use linked-clones to rapidly fork many nodes




17
Fast Provisioning

 From a ―seed‖ node to a cluster




     Thin Provisioning              Linked Clone




        60GB => 3.5GB               ~6 second

18
SAN, NAS or Local Disk?

  Shared Storage: SAN or NAS                                                                 Hybrid Storage
         • Easy to provision                                                                  • SAN for boot images, VMs, other
         • Automated cluster rebalancing                                                            workloads
                                                                                              • Local disk for HDFS
                                                                                              • Scalable Bandwidth, Lower Cost/GB
           Other VM

                      Other VM




                                                    Other VM




                                                                                  Other VM




                                                                                                         Other VM

                                                                                                                    Other VM




                                                                                                                                                  Other VM




                                                                                                                                                                                Other VM
Hadoop




                                 Hadoop

                                           Hadoop




                                                               Hadoop

                                                                         Hadoop




                                                                                               Hadoop




                                                                                                                               Hadoop

                                                                                                                                         Hadoop




                                                                                                                                                             Hadoop

                                                                                                                                                                       Hadoop
          Host                            Host                          Host                            Host                            Host                          Host




     19
Enable Automatic Rack awareness through vSphere

 Important to robust hadoop
 cluster


 Automatic network topology
 detect — an important
 vSphere feature


 Rack script is generated
 automatically




20
Multi-tenant: share cluster or not

      Shared big cluster        VS.       Isolated small clusters




        High performance                          Secure
           Large scale                           Flexible
       Pre-job provisioning                Post-job provisioning

Combination – as   customers’ requirement are different

21
Elastic Hadoop Cluster

 Traditional hadoop cluster
     • Easy to scale out
       • Fast-provision new hadoop nodes and join into existing cluster
     • Hard to scale in
 While (ClusterIsTooLarge) {
      choose node k;
      kill (node k);
      wait (k’s data block is recovered);
      if necessary, hadoop.rebalance();
 }

 Elastic hadoop cluster
                                            …
                                                                          Normal node

      NN                                                     JT           Elastic node

                                                                          TaskTracker
                                            …
                                                                          DataNode

22
Replica Placement

 Second Replica
 • Different rack
 • Rack-awareness required


 Third Replica
 • Same rack, different physical host
 • Nodes share host (in virtualized
     environment)




23
Demo




24
Performance

 Create more smaller VMs
 • Makes Hadoop scale better
 • Allows for easier/faster adjustment of packing of VMs across hosts by vSphere
     (including through DRS)
 Sizing/Configuration of storage is critical
 • Plan on ~50Mbytes/sec of bandwidth per core
 • SANs are typically configured by default for IOPS, not Bandwidth
 • Ensure SAN ports/switch topology allows required aggregate bandwidth
 • Performance of the backend storage should be tested/sized
 • Local disks will give ~100-140MBytes/sec per disk: pick correct controller




25
Summary

 Hadoop does work well in a virtual environment
 Plan a virtual cluster, enable other big-data solutions on the same
 infrastructure
 Leverage the recipes to automate your configuration and
 deployment




26

Mais conteúdo relacionado

Mais procurados

Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
DataWorks Summit
 
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMUse Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Amazon Web Services
 
32984 cloud system la-bcs
32984 cloud system la-bcs32984 cloud system la-bcs
32984 cloud system la-bcs
gmazuel
 
Use the power of Microsoft Azure with NetApp Storage
Use the power of Microsoft Azure with NetApp StorageUse the power of Microsoft Azure with NetApp Storage
Use the power of Microsoft Azure with NetApp Storage
Proact Netherlands B.V.
 

Mais procurados (20)

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWSEnterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
 
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMUse Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
32984 cloud system la-bcs
32984 cloud system la-bcs32984 cloud system la-bcs
32984 cloud system la-bcs
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
Hybrid is the New Normal
Hybrid is the New NormalHybrid is the New Normal
Hybrid is the New Normal
 
Use the power of Microsoft Azure with NetApp Storage
Use the power of Microsoft Azure with NetApp StorageUse the power of Microsoft Azure with NetApp Storage
Use the power of Microsoft Azure with NetApp Storage
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
SAP on AWS
SAP on AWSSAP on AWS
SAP on AWS
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIs
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azure
 
Best Practices for Monitoring Postgres
Best Practices for Monitoring Postgres Best Practices for Monitoring Postgres
Best Practices for Monitoring Postgres
 

Semelhante a Hadoop World 2011: Hadoop as a Service in Cloud

Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
Richard McDougall
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
solarisyourep
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
Vlad Ponomarev
 

Semelhante a Hadoop World 2011: Hadoop as a Service in Cloud (20)

Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
 

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Hadoop World 2011: Hadoop as a Service in Cloud

  • 1. Hadoop as a Service Jun Ping Du Richard McDougall VMware, Inc. © 2009 VMware Inc. All rights reserved
  • 2. Cloud: Big Shifts in Simplification and Optimization 1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  • 3. Infrastructure, Apps and now Data… Build Run Private Public Manage Simplify Infrastructure Simplify App Platform Next Trend: With Cloud Through PaaS Simplify Data 3
  • 4. Trend 1/3: New Data Growing at 60% Y/Y Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio generation… digital tv digital photos camera phones, rfid medical imaging, sensors satellite images, games, scanners, twitter cad/cam, appliances, videoconfercing, digital movies Source: The Information Explosion , 2009 4
  • 5. Trend 2/3: Big Data – Driven by Real-World Benefit 5
  • 6. Trend 3/3: Value from Data Exceeds Hardware Cost  Value from the intelligence of data analytics now outstrips the cost of hardware • Hadoop enables the use of lower cost hardware • Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost 6
  • 7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware  Trend is ―not just hadoop‖ for big data • Hadoop is often combined with other technologies: Big SQL, NoSQL etc,… SQLCluster • Unify the infrastructure platform for all Big SQL NoSQL Hadoop NoSQL Cluster Unified Big Data Infrastructure Private Public Hadoop Cluster  Common Hardware Base • Eliminate the hardware/driver/testing phase • Use existing team for DSS Cluster ordering, diagnosis, capacity management of 7 hardware farm
  • 8. Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning I WANT MY HADOOP CLUSTER NOW!  Instant Cluster Provisioning • Provision Hadoop Clusters instantly • Automatable using provisioning engines/scripts: e.g. whir 8
  • 9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities  Increase Utilization • Hadoop cluster only uses resources it needs • Extra resources can be used by other applications when not in use  Eliminate single points of failure • Use vSphere HA for Namenode and Jobtracker  Use VM Isolation • Create separate clusters with defensible security • Enables multiple-versions of Hadoop on the same infrastructure • Extends to Hadoop and Linux Environments  Leverage Resource Management • Control/assign resources through resource pools • E.g. Use spare cycles for Hadoop Processing through priority control 9
  • 10. What? Hadoop in a VM? Really? Actually, Hadoop performs well in a virtual machine 10
  • 11. Performance Test: Cluster Configuration Mellanox10 GbE switch AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter 11
  • 12. Cluster Configuration  Hardware • AMAX ClusterMax, 7 nodes • 2X X5650 2.67 GHz hex-core, 96 GB memory • 12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4 • Mellanox ConnectX VPI (MT26418), 10 GbE • Mellanox Vantage 6048, 10 GbE  OS/Hypervisor • RHEL 6.1 x86_64 (native and guest) • ESX 5.0 RTM with devel Mellanox driver  VMs (HT off/on) • 1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks • 2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks • 4 VMs (HT on only): • 2 small: 18400 MB, 5 vCPUs, 2 disks • 2 large: 27600 MB, 7 vCPUs, 3 disks 12
  • 13. Hadoop Configuration Distribution • Cloudera CDH3u0 • Based on Apache open-source 0.20.2 Parameters • dfs.datanode.max.xcievers=4096 • dfs.replication=2 • dfs.block.size=134217728 • io.file.buffer.size=131072 • mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native) • mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)  Network topology • Hadoop uses info for reliability and performance • Multiple VMs/host: Each host is a “rack” 13
  • 14. Benchmarks  Derived from test apps included in distro  Pi • Direct-exec Monte-Carlo estimation of pi • # map tasks = # logical processors • 1.68 T samples  TestDFSIO • Streaming write and read ~ 4*R/(R+G) = 22/7 • 1 TB • More tasks than processors  Terasort • 3 phases: teragen, terasort, teravalidate • 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB) • More tasks than processors • CPU, networking, and storage I/O 14
  • 15. Performance of Hadoop for Several Workloads Ratio of time taken – Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 0.4 2 VMs 0.2 0 15
  • 16. Architecting Hadoop as a Service using Virtualization  Goals • Make it fast and easy to provision new Hadoop Clusters on Demand • Leverage virtual machines to provide isolation (esp. for Multi-tenant) • Optimize Hadoop’s performance based on virtual topologies • Make the system reliable based on virtual topologies  Leveraging Virtualization • Elastic scale in/out • Use high-availability to protect namenode/job tracker • Resource controls and sharing: re-use underutilized memory, cpu • Prioritize Workloads: limit or guarantee resource usage in a mixed environment 16
  • 17. Provisioning  Leverage the vSphere APIs to auto-deploy a cluster • Whirr, HOD, or custom using ruby, chef, etc,…  Use linked-clones to rapidly fork many nodes 17
  • 18. Fast Provisioning  From a ―seed‖ node to a cluster Thin Provisioning Linked Clone 60GB => 3.5GB ~6 second 18
  • 19. SAN, NAS or Local Disk?  Shared Storage: SAN or NAS  Hybrid Storage • Easy to provision • SAN for boot images, VMs, other • Automated cluster rebalancing workloads • Local disk for HDFS • Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 19
  • 20. Enable Automatic Rack awareness through vSphere  Important to robust hadoop cluster  Automatic network topology detect — an important vSphere feature  Rack script is generated automatically 20
  • 21. Multi-tenant: share cluster or not  Shared big cluster VS. Isolated small clusters High performance Secure Large scale Flexible Pre-job provisioning Post-job provisioning Combination – as customers’ requirement are different 21
  • 22. Elastic Hadoop Cluster  Traditional hadoop cluster • Easy to scale out • Fast-provision new hadoop nodes and join into existing cluster • Hard to scale in While (ClusterIsTooLarge) { choose node k; kill (node k); wait (k’s data block is recovered); if necessary, hadoop.rebalance(); }  Elastic hadoop cluster … Normal node NN JT Elastic node TaskTracker … DataNode 22
  • 23. Replica Placement  Second Replica • Different rack • Rack-awareness required  Third Replica • Same rack, different physical host • Nodes share host (in virtualized environment) 23
  • 25. Performance  Create more smaller VMs • Makes Hadoop scale better • Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS)  Sizing/Configuration of storage is critical • Plan on ~50Mbytes/sec of bandwidth per core • SANs are typically configured by default for IOPS, not Bandwidth • Ensure SAN ports/switch topology allows required aggregate bandwidth • Performance of the backend storage should be tested/sized • Local disks will give ~100-140MBytes/sec per disk: pick correct controller 25
  • 26. Summary  Hadoop does work well in a virtual environment  Plan a virtual cluster, enable other big-data solutions on the same infrastructure  Leverage the recipes to automate your configuration and deployment 26