SlideShare uma empresa Scribd logo
1 de 20
Proven Tools to
Simplify Hadoop Environments

    Joey Jablonski & Vin Sharma
Intros and Bios
• Joey Jablonski
  – Dell Principal Solution Architect
  – http://www.linkedin.com/in/joeyjablonski
  – http://mergingbusinessandit.com
• Vin Sharma
  – Intel Open Source Enterprise Strategist
  – http://www.linkedin.com/in/c1f3rt3xt
  – http://www.intel.com/opensource
Agenda
• Why Hadoop is difficult for IT to operate
• How the right tools can make this easier
  – Deployment & Configuration with Dell Crowbar
  – Monitoring and Management with Dell Barclamps
  – Performance Tuning with Intel HiTune
Hadoop Operational Models
Traditional Datacenter   Cloud Infrastructure
• Assigned Servers       • Elastic Resources
• Rigid Policies         • Services (APIs)
• Tiered Software        • Distributed Software
Operational Challenges
• Deployment
   – Complex because of scale (60 nodes to 1000 nodes)
   – Cumbersome because of high-touch processes
• Configuration & Tuning
   – Error-prone configuration management
   – State management
• Monitoring and Management
   – Complex troubleshooting and diagnostics
   – No proactive notification of problems
• Performance Optimization
   – Limitations of traditional tools
CloudOps Framework
Three aspects of revolutionary clouds
Two Sides of Cloud

   Ecosystem
        +API         Cloud = Operations
      Ops
   Black Box           HW
                             OPS
                                          CloudOps
                       SW
                                                 APIs

                                                 Cloud




                                           Ops
                                                 O/S

                                                 Physical
Images vs. Layers: Overview
Images: Single Unit            Layers: Stacked Pieces


          Configuration                           Integrations

                                                Application Foo




                                Configuration
          Integrations +                         Application Bar
          Applications +
       Utilities + Operating                        Utilities
               System

                                                Operating System
Images vs. Layers: Lifecycle

 Images: Replacement             Layers: Upgrade




Config      Config     Config
                                            I                      I
                                           Foo                    Foo




                                     Config




                                                            Config
I+A+U+O    I+A+U+O     I+A+U+O
                                          Bar v1                 Bar v2
   /S         /S          /S
                                            U                      U
                                           OS                     OS

              Config                               Bar v2
             I+A+U+O
                /S
Modular Design: Barclamps

                  APIs, User Access,      Nagios       Ganglia   Dashboard
                  & Ecosystem
                  Partners
Ops Management




                                         Hadoop
 Dell “Crowbar”




                  Cloud Infrastructure
                  & Dell IP Extensions


                                         Crowbar        DNS       Logging
                  Core Components &
                  Operating Systems
                                         Deployer       NTP


                                         Provisioner    BIOS       IPMI
                  Physical Resources
                                          Network       RAID
Crowbar = Install State Machine
Cloud = Ops

We have capable hardware & software, the real question is
how are we going to operate it as a service?

                         • This is CloudOps
              OPS
      HW                 • Software mindset to infrastructure
                            • Software is constantly changing
              Cloud         • Fluid resources instead of servers
      SW
               Ops          • Manual touch is unacceptable


Ultimately, all the rules for operating the data center become
encoded as automation software.
Second Act
Platform Selection
 Dell PowerEdge C2100 for Hadoop based on Intel® Xeon®
 Dell PowerEdge C2100
• Designed with big data in mind
• Compact 2U form factor
• 2-socket 6-core
• Intel® Xeon® 5620 processor
• High performance memory system
• Expansive disk storage          Recommended Configuration
                                 • Intel Xeon Processor 5600 series
                                 • 4-6 1TB or 2TB 7200 RPM SATA SSD
                                 • 12-24GB DDR3 R-ECC RAM
                                 • 1-2 dual-port 1GigE
                                 • Linux kernel 2.6.30 or later
                                 • Sun Java 6u14 or later
                                 • Hadoop version 0.20.x or later
Intel Whitepaper: “Optimizing Hadoop Deployments” (http://software.intel.com/file/31124)
So what seems to be the problem?
• Dataflow and high level
  abstraction make it difficult
  to understand runtime
  behaviors

• Large distributed system
  makes it difficult to correlate
  concurrent performance-
  related activities
HiTune: Hadoop Performance Analyzer
 •   Collects metrics from each node
 •   Aggregates data using Chukwa
 •   Analyzes results using Hadoop
 •   Generates reports for visualization

• System metric (CPU, Disk I/O, Network IO, Memory)
• Hadoop metrics
   (NameNode, DataNode, JobTracker, TaskTracker, JVM metrics)
• Dataflow based statistics
  (Job, MapTasks, Reduce Tasks, Threaddump for M/R)
• Summary view of a single job
• Summary view by comparing multiple jobs




                                                            Apache 2.0 License
HiTune Architecture
                                                                                Sampler

• Tracker                                                   Sampler
                                                                            Task
                                                                        Sampler
                                                                                Task
                                                                           Sampler




    – Lightweight agent running on each node             Task            Task                           Sampler
                                                      Sampler
                                                       Task                                           Task
                                                                                                   Sampler
                                                    Sampler                 Tracker
      in the Hadoop cluster
                                                                                                  Task
                                                                                                Sampler
                                                     Task
                                                                                                 Task

                                                            Tracker
         • Sysstat, Hadoop logs and metrics, Java                                                      Tracker
                                                                                                                            Sampler

           instrumentation                                                                                                Task
                                                                                                                       Sampler
                                                                                                                      Task
                                                                                                                    Sampler
                                                                                                                     Task

                                                                                                                         Tracker

• Aggregation engine
                                                              Aggregation engine
    – Merges the results of all the trackers in a               Analysis engine
      distributed fashion

                                                                                  Specification file

• Analysis engine
    – Generates reports based on data flow
      model                                                                                      Dataflow diagram
Case Study
          Partitioned
            Input        Map Tasks                            Reduce Tasks
               D
                        map
                                spill                                               Aggregated
                                                shuffle                               Output
                                            copier
                                                      merge       sort     reduce
               A        map     spill
                                                shuffle
                                            copier
                                                      merge      sort
                                                                           reduce
                T       map     Spill
                                                shuffle
                                            copier
                                                      merge      sort
                        map                                                reduce
               A                spill


                                 Streaming dataflow                 Sequential dataflow



 Terasort with zlib
• Large gap between end of map and end of shuffle
• No CPU, I/O, or network bandwidth bottlenecks
• Adding copiers does not change “shuffle fetchers busy percent” = 100
 Terasort with LZO
• Copier threads idle 80% waiting for memory merge threads
• Memory merge threads busy mostly due to compression
• Changing compression codec to LZO closes the gap
• Improves job running time by 2.3x
Have at it
• Pull Crowbar
  – https://github.com/dellcloudedge/crowbar



• Pull HiTune
  – https://github.com/HiTune/HiTune
Q&A

Mais conteúdo relacionado

Mais procurados

Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
Solaris cluster roadshow day 2 technical presentation
Solaris cluster roadshow day 2 technical presentationSolaris cluster roadshow day 2 technical presentation
Solaris cluster roadshow day 2 technical presentationxKinAnx
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Intel® Software
 
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...Unity Technologies
 
Cloumon enterprise
Cloumon enterpriseCloumon enterprise
Cloumon enterpriseGruter
 
Solaris cluster roadshow day 1 technical presentation
Solaris cluster roadshow day 1 technical presentationSolaris cluster roadshow day 1 technical presentation
Solaris cluster roadshow day 1 technical presentationxKinAnx
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Sal Marcus
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011darach
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUseHortonworks
 
Cloud Computing - Making IT Simple
 Cloud Computing - Making IT Simple Cloud Computing - Making IT Simple
Cloud Computing - Making IT SimpleBob Rhubart
 

Mais procurados (16)

Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
SenseiDB
SenseiDBSenseiDB
SenseiDB
 
Solaris cluster roadshow day 2 technical presentation
Solaris cluster roadshow day 2 technical presentationSolaris cluster roadshow day 2 technical presentation
Solaris cluster roadshow day 2 technical presentation
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
2012 11 Openstack China
2012 11 Openstack China2012 11 Openstack China
2012 11 Openstack China
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)
 
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
 
110701 asakusa説明資料
110701 asakusa説明資料110701 asakusa説明資料
110701 asakusa説明資料
 
Cloumon enterprise
Cloumon enterpriseCloumon enterprise
Cloumon enterprise
 
Solaris cluster roadshow day 1 technical presentation
Solaris cluster roadshow day 1 technical presentationSolaris cluster roadshow day 1 technical presentation
Solaris cluster roadshow day 1 technical presentation
 
OSPRay 1.0 and Beyond
OSPRay 1.0 and BeyondOSPRay 1.0 and Beyond
OSPRay 1.0 and Beyond
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUse
 
Cloud Computing - Making IT Simple
 Cloud Computing - Making IT Simple Cloud Computing - Making IT Simple
Cloud Computing - Making IT Simple
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 

Semelhante a Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonski, Dell & Vin Sharma, Intel

Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...eNovance
 
Dell openstack boston meetup dell crowbar and open stack
Dell openstack boston meetup   dell crowbar and open stackDell openstack boston meetup   dell crowbar and open stack
Dell openstack boston meetup dell crowbar and open stackDellCloudEdge
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarCeph Community
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Operating the Hyperscale Cloud
Operating the Hyperscale CloudOperating the Hyperscale Cloud
Operating the Hyperscale CloudOpen Stack
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Deploying OpenStack using Crowbar
Deploying OpenStack using CrowbarDeploying OpenStack using Crowbar
Deploying OpenStack using Crowbaropenstackindia
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new featuresTanvi_Agrawal
 
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructuredevopsdaysaustin
 
OpenStack Preso: DevOps on Hybrid Infrastructure
OpenStack Preso: DevOps on Hybrid InfrastructureOpenStack Preso: DevOps on Hybrid Infrastructure
OpenStack Preso: DevOps on Hybrid Infrastructurerhirschfeld
 
OSCON 2012 OpenStack Automation and DevOps Best Practices
OSCON 2012 OpenStack Automation and DevOps Best PracticesOSCON 2012 OpenStack Automation and DevOps Best Practices
OSCON 2012 OpenStack Automation and DevOps Best PracticesMatt Ray
 
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure PlatformVitor Tomaz
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopYahoo Developer Network
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 
20120524 cern data centre evolution v2
20120524 cern data centre evolution v220120524 cern data centre evolution v2
20120524 cern data centre evolution v2Tim Bell
 

Semelhante a Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonski, Dell & Vin Sharma, Intel (20)

Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
 
Dell openstack boston meetup dell crowbar and open stack
Dell openstack boston meetup   dell crowbar and open stackDell openstack boston meetup   dell crowbar and open stack
Dell openstack boston meetup dell crowbar and open stack
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Operating the Hyperscale Cloud
Operating the Hyperscale CloudOperating the Hyperscale Cloud
Operating the Hyperscale Cloud
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Deploying OpenStack using Crowbar
Deploying OpenStack using CrowbarDeploying OpenStack using Crowbar
Deploying OpenStack using Crowbar
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new features
 
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
 
OpenStack Preso: DevOps on Hybrid Infrastructure
OpenStack Preso: DevOps on Hybrid InfrastructureOpenStack Preso: DevOps on Hybrid Infrastructure
OpenStack Preso: DevOps on Hybrid Infrastructure
 
OSCON 2012 OpenStack Automation and DevOps Best Practices
OSCON 2012 OpenStack Automation and DevOps Best PracticesOSCON 2012 OpenStack Automation and DevOps Best Practices
OSCON 2012 OpenStack Automation and DevOps Best Practices
 
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 
20120524 cern data centre evolution v2
20120524 cern data centre evolution v220120524 cern data centre evolution v2
20120524 cern data centre evolution v2
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonski, Dell & Vin Sharma, Intel

  • 1. Proven Tools to Simplify Hadoop Environments Joey Jablonski & Vin Sharma
  • 2. Intros and Bios • Joey Jablonski – Dell Principal Solution Architect – http://www.linkedin.com/in/joeyjablonski – http://mergingbusinessandit.com • Vin Sharma – Intel Open Source Enterprise Strategist – http://www.linkedin.com/in/c1f3rt3xt – http://www.intel.com/opensource
  • 3. Agenda • Why Hadoop is difficult for IT to operate • How the right tools can make this easier – Deployment & Configuration with Dell Crowbar – Monitoring and Management with Dell Barclamps – Performance Tuning with Intel HiTune
  • 4. Hadoop Operational Models Traditional Datacenter Cloud Infrastructure • Assigned Servers • Elastic Resources • Rigid Policies • Services (APIs) • Tiered Software • Distributed Software
  • 5. Operational Challenges • Deployment – Complex because of scale (60 nodes to 1000 nodes) – Cumbersome because of high-touch processes • Configuration & Tuning – Error-prone configuration management – State management • Monitoring and Management – Complex troubleshooting and diagnostics – No proactive notification of problems • Performance Optimization – Limitations of traditional tools
  • 7. Three aspects of revolutionary clouds Two Sides of Cloud Ecosystem +API Cloud = Operations Ops Black Box HW OPS CloudOps SW APIs Cloud Ops O/S Physical
  • 8. Images vs. Layers: Overview Images: Single Unit Layers: Stacked Pieces Configuration Integrations Application Foo Configuration Integrations + Application Bar Applications + Utilities + Operating Utilities System Operating System
  • 9. Images vs. Layers: Lifecycle Images: Replacement Layers: Upgrade Config Config Config I I Foo Foo Config Config I+A+U+O I+A+U+O I+A+U+O Bar v1 Bar v2 /S /S /S U U OS OS Config Bar v2 I+A+U+O /S
  • 10. Modular Design: Barclamps APIs, User Access, Nagios Ganglia Dashboard & Ecosystem Partners Ops Management Hadoop Dell “Crowbar” Cloud Infrastructure & Dell IP Extensions Crowbar DNS Logging Core Components & Operating Systems Deployer NTP Provisioner BIOS IPMI Physical Resources Network RAID
  • 11. Crowbar = Install State Machine
  • 12. Cloud = Ops We have capable hardware & software, the real question is how are we going to operate it as a service? • This is CloudOps OPS HW • Software mindset to infrastructure • Software is constantly changing Cloud • Fluid resources instead of servers SW Ops • Manual touch is unacceptable Ultimately, all the rules for operating the data center become encoded as automation software.
  • 14. Platform Selection Dell PowerEdge C2100 for Hadoop based on Intel® Xeon® Dell PowerEdge C2100 • Designed with big data in mind • Compact 2U form factor • 2-socket 6-core • Intel® Xeon® 5620 processor • High performance memory system • Expansive disk storage Recommended Configuration • Intel Xeon Processor 5600 series • 4-6 1TB or 2TB 7200 RPM SATA SSD • 12-24GB DDR3 R-ECC RAM • 1-2 dual-port 1GigE • Linux kernel 2.6.30 or later • Sun Java 6u14 or later • Hadoop version 0.20.x or later Intel Whitepaper: “Optimizing Hadoop Deployments” (http://software.intel.com/file/31124)
  • 15. So what seems to be the problem? • Dataflow and high level abstraction make it difficult to understand runtime behaviors • Large distributed system makes it difficult to correlate concurrent performance- related activities
  • 16. HiTune: Hadoop Performance Analyzer • Collects metrics from each node • Aggregates data using Chukwa • Analyzes results using Hadoop • Generates reports for visualization • System metric (CPU, Disk I/O, Network IO, Memory) • Hadoop metrics (NameNode, DataNode, JobTracker, TaskTracker, JVM metrics) • Dataflow based statistics (Job, MapTasks, Reduce Tasks, Threaddump for M/R) • Summary view of a single job • Summary view by comparing multiple jobs Apache 2.0 License
  • 17. HiTune Architecture Sampler • Tracker Sampler Task Sampler Task Sampler – Lightweight agent running on each node Task Task Sampler Sampler Task Task Sampler Sampler Tracker in the Hadoop cluster Task Sampler Task Task Tracker • Sysstat, Hadoop logs and metrics, Java Tracker Sampler instrumentation Task Sampler Task Sampler Task Tracker • Aggregation engine Aggregation engine – Merges the results of all the trackers in a Analysis engine distributed fashion Specification file • Analysis engine – Generates reports based on data flow model Dataflow diagram
  • 18. Case Study Partitioned Input Map Tasks Reduce Tasks D map spill Aggregated shuffle Output copier merge sort reduce A map spill shuffle copier merge sort reduce T map Spill shuffle copier merge sort map reduce A spill Streaming dataflow Sequential dataflow Terasort with zlib • Large gap between end of map and end of shuffle • No CPU, I/O, or network bandwidth bottlenecks • Adding copiers does not change “shuffle fetchers busy percent” = 100 Terasort with LZO • Copier threads idle 80% waiting for memory merge threads • Memory merge threads busy mostly due to compression • Changing compression codec to LZO closes the gap • Improves job running time by 2.3x
  • 19. Have at it • Pull Crowbar – https://github.com/dellcloudedge/crowbar • Pull HiTune – https://github.com/HiTune/HiTune
  • 20. Q&A

Notas do Editor

  1. Hadoop Operations (10-min)Struggles and Challenges (Dell)Operations Framework (25 min)Dev Ops inspired operations framework (Dell)Crowbar (Dell)Monitoring and Management (Intel)Power & Cooling (Dell)Hadoop Lifecycle Management (10-min)Performance Testing - HiTune (Intel)Hadoop Tuning (Intel)
  2. For NoSQL data warehouses using Hadoop, you can see the benefit of modern servers versus legacy. On these two tests, the Xeon 5600-based server cluster significantly outperformed the legacy server cluster, and offered many more features and greater energy-efficiency than the older model.It pays to optimize around the right hardware. Legacy servers will forego a lot of performance and energy efficiency, potentially limiting the SLA, number of users and amount of data that can be processed for analysis.• Intel® Xeon® 5600 improves Hadoop Workload performance• Choosing an optimized server board can reduce power consumption• Use Intel® X25-E SATA SSDs to improve performanceSoftware & configurations:• Use latest Linux kernel• Turn on Intel® Hyper-threading• Optimize Hadoop Configuration• Tuning may be different for different workload types