SlideShare uma empresa Scribd logo
1 de 21
PRODUCTIONIZING HADOOP
New Lessons Learned
Eric Sammer
General Announcements

• All lines are muted
• Ask questions any time using the “questions”
  pane on your GoToWebinar panel
• Recording of this webinar will be available on
  demand at www.cloudera.com
The Universe of Operations


System Operations       Architecture and App Ops
• Server and network    • Data architecture
• Operating system      • Data integration
• Identity and access   • Data quality monitoring
• Resource management   • Resource management
• Maintenance           • Pipeline maintenance
• Cluster monitoring    • Governance
• Backup and DR
Scope for Today

• A focus on common stumbling blocks
   • Workload-oriented planning and identification
   • Network architecture
   • Host management
   • Configuration management
   • Identity, Access, and Authorization
   • Cluster and resource sharing
• Time for questions
Proper Planning

• Develop an understanding of your use cases
   • What you (will) do defines what you need
   • Analog: OLTP RDBMS versus OLAP
• Prototype if necessary
Understanding Cluster Usage
…by use case



           Data Mining / IR


                                             ETL


                              Report Generation

         Analytics
…by use case



                      Data Mining / IR
    Network utilization is a
    function of job size, its
    profile, and the number                             ETL
    of concurrent jobs
                                         Report Generation

                  Analytics
Network Architecture

• Your current architecture is probably fine
   • Typical: traditional L2 tree (fine for North/South)
   • Emerging: L3 spine/leaf (optimized for East/West)
• Minimize oversubscription (normal: 1:1.2)
• Deep port buffers       (with fair allocation for shared memory)

• Do not collocate low-latency apps with MR
• Monitor, monitor, monitor
   • Bandwidth, buffer, packet count, and size deciles
Host Configuration

• OS version and patches
• Java 6   (HotSpot VM)

• PAM limits    (nofile, nproc)

• Naming    (nsswitch.conf, resolv.conf, hosts, gethostname())

• OS filesystem selection and tuning
• Time service
• Users, groups, and identity management
• Machines should not be unique snowflakes
Configuration Management

• Puppet/Chef/<your favorite> for OS config
   • Package installation
   • Identity and authorization wiring
• Cloudera Manager for platform management
   • Deployment and configuration
   • Service lifecycle
   • Platform-specific service monitoring and diagnostics
   • Activity monitoring
• Complementary systems
   • Differentiating factors: centralized
     coordination, service awareness, orchestration
Identity, Access, and Authorization

• MapReduce is a code execution engine
• Identity management and access control is hard
  (in distributed systems like Hadoop)
• Hadoop uses the OS (or Kerberos) for identity
   • Lots of entry points
   • Comparatively low level
• Access control is a function of each service
   • HDFS: Unix-style octal permissions on objects
   • MapReduce: ACLs on job queues
Resource Sharing

• One cluster, many groups
• Pros
   • Benefit from aggregate resources
   • Greater utilization
   • Reduced cap/op-ex
Resource Sharing
• Three dimensions of sharing a cluster
    • Collocation of services (e.g. MapReduce and HBase)
    • Collocation of groups of users
    • Collocation of workload profiles (ETL, analytics)
• In an ideal world, collocate all and enforce policy
    • Not currently possible
• Problems
    • System utilization varies wildly
    • Fair distribution of shared resources
    • Increased access control complexity
    • SLA of most sensitive group applies to all
    • …but nothing new
Resource Sharing
• Reasons to collocate groups / applications:
   • Similar system utilization profiles
   • Time-based utilization (e.g. daily ETL and office hour
      analytics)
   • Maintain similar SLAs
   • Extensively data sharing
   • When it’s trivially easy with current control mechanisms
• Reasons to segregate groups / applications:
   • Compliance, regulation, or where security is paramount
   • Wildly dissimilar utilization profiles (notably HBase and
      MapReduce)
• A significant area of interest for Cloudera
Now What?

• There’s a lot (more) to think about
• We can help
   • Education
   • Services
   • Software
   • Support
• Strata + Hadoop World 2012
• Look for upcoming webinars
Questions?
Type them in the “Questions” panel.

Congratulations to the winners
of the book drawing!
• Vani Mahobia
• Ken Gayler
• Richard Zhang
• Anand Rajan
• Erica Muxlow
Questions?
Type them in the “Questions” panel.



To learn more about Hadoop
Operations, A Guide for
Developers and
Administrators, or about the
spotted cavy, go to
www.oreilly.com
THANK YOU!
Eric Sammer, Principal Solutions Architect
@esammer
For more information: www.cloudera.com
Sales: (888)789-1488
@cloudera
Hardware Planning

• CPU
• Disk capacity and configuration
• Spindle count
• Memory (amount and configuration)
• NIC configuration
• Hadoop’s hardware preferences tend to be
 controversial until the architecture is understood
Baseline Hardware

• Disk
   • SATA II 7200RPM (SAS controller)
   • JBOD (OS on R1)
   • Option 1: 12x3.5” LFF 3TB
   • Option 2: 24x2.5” SFF 1TB
   • Option: MDL/NL SAS drives
• 2x2.2Ghz 6C 20MB cache
• 48GB+ DDR3-1600 ECC
• 1GbE vs. 10GbE
   • Is there new info here?

Mais conteúdo relacionado

Mais procurados

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectRemy Rosenbaum
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paperJethroData
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksDatabricks
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azureMostafa
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Learning UML with Enterprise Architect
Learning UML with Enterprise ArchitectLearning UML with Enterprise Architect
Learning UML with Enterprise ArchitectGerald R. Gray
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
 
Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraDatabricks
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Cloudera, Inc.
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAshrith Mekala
 

Mais procurados (19)

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Data engineering
Data engineeringData engineering
Data engineering
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Learning UML with Enterprise Architect
Learning UML with Enterprise ArchitectLearning UML with Enterprise Architect
Learning UML with Enterprise Architect
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
 
Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & Privacera
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration framework
 

Semelhante a Productionizing Hadoop - New Lessons Learned

Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
Survey of Big Data Infrastructures
Survey of Big Data InfrastructuresSurvey of Big Data Infrastructures
Survey of Big Data Infrastructuresm.a.kirn
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute ClusterRamsay Key
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Hadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichHadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichDan TheMan
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?CQD
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...confluent
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehousehadoopsphere
 
Implementing Private Database Clouds
Implementing Private Database CloudsImplementing Private Database Clouds
Implementing Private Database CloudsRoland Slee
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overviewStreamHorizon
 

Semelhante a Productionizing Hadoop - New Lessons Learned (20)

Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Survey of Big Data Infrastructures
Survey of Big Data InfrastructuresSurvey of Big Data Infrastructures
Survey of Big Data Infrastructures
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Hadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichHadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_which
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
Implementing Private Database Clouds
Implementing Private Database CloudsImplementing Private Database Clouds
Implementing Private Database Clouds
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overview
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Productionizing Hadoop - New Lessons Learned

  • 2. General Announcements • All lines are muted • Ask questions any time using the “questions” pane on your GoToWebinar panel • Recording of this webinar will be available on demand at www.cloudera.com
  • 3. The Universe of Operations System Operations Architecture and App Ops • Server and network • Data architecture • Operating system • Data integration • Identity and access • Data quality monitoring • Resource management • Resource management • Maintenance • Pipeline maintenance • Cluster monitoring • Governance • Backup and DR
  • 4. Scope for Today • A focus on common stumbling blocks • Workload-oriented planning and identification • Network architecture • Host management • Configuration management • Identity, Access, and Authorization • Cluster and resource sharing • Time for questions
  • 5. Proper Planning • Develop an understanding of your use cases • What you (will) do defines what you need • Analog: OLTP RDBMS versus OLAP • Prototype if necessary
  • 7. …by use case Data Mining / IR ETL Report Generation Analytics
  • 8. …by use case Data Mining / IR Network utilization is a function of job size, its profile, and the number ETL of concurrent jobs Report Generation Analytics
  • 9. Network Architecture • Your current architecture is probably fine • Typical: traditional L2 tree (fine for North/South) • Emerging: L3 spine/leaf (optimized for East/West) • Minimize oversubscription (normal: 1:1.2) • Deep port buffers (with fair allocation for shared memory) • Do not collocate low-latency apps with MR • Monitor, monitor, monitor • Bandwidth, buffer, packet count, and size deciles
  • 10. Host Configuration • OS version and patches • Java 6 (HotSpot VM) • PAM limits (nofile, nproc) • Naming (nsswitch.conf, resolv.conf, hosts, gethostname()) • OS filesystem selection and tuning • Time service • Users, groups, and identity management • Machines should not be unique snowflakes
  • 11. Configuration Management • Puppet/Chef/<your favorite> for OS config • Package installation • Identity and authorization wiring • Cloudera Manager for platform management • Deployment and configuration • Service lifecycle • Platform-specific service monitoring and diagnostics • Activity monitoring • Complementary systems • Differentiating factors: centralized coordination, service awareness, orchestration
  • 12. Identity, Access, and Authorization • MapReduce is a code execution engine • Identity management and access control is hard (in distributed systems like Hadoop) • Hadoop uses the OS (or Kerberos) for identity • Lots of entry points • Comparatively low level • Access control is a function of each service • HDFS: Unix-style octal permissions on objects • MapReduce: ACLs on job queues
  • 13. Resource Sharing • One cluster, many groups • Pros • Benefit from aggregate resources • Greater utilization • Reduced cap/op-ex
  • 14. Resource Sharing • Three dimensions of sharing a cluster • Collocation of services (e.g. MapReduce and HBase) • Collocation of groups of users • Collocation of workload profiles (ETL, analytics) • In an ideal world, collocate all and enforce policy • Not currently possible • Problems • System utilization varies wildly • Fair distribution of shared resources • Increased access control complexity • SLA of most sensitive group applies to all • …but nothing new
  • 15. Resource Sharing • Reasons to collocate groups / applications: • Similar system utilization profiles • Time-based utilization (e.g. daily ETL and office hour analytics) • Maintain similar SLAs • Extensively data sharing • When it’s trivially easy with current control mechanisms • Reasons to segregate groups / applications: • Compliance, regulation, or where security is paramount • Wildly dissimilar utilization profiles (notably HBase and MapReduce) • A significant area of interest for Cloudera
  • 16. Now What? • There’s a lot (more) to think about • We can help • Education • Services • Software • Support • Strata + Hadoop World 2012 • Look for upcoming webinars
  • 17. Questions? Type them in the “Questions” panel. Congratulations to the winners of the book drawing! • Vani Mahobia • Ken Gayler • Richard Zhang • Anand Rajan • Erica Muxlow
  • 18. Questions? Type them in the “Questions” panel. To learn more about Hadoop Operations, A Guide for Developers and Administrators, or about the spotted cavy, go to www.oreilly.com
  • 19. THANK YOU! Eric Sammer, Principal Solutions Architect @esammer For more information: www.cloudera.com Sales: (888)789-1488 @cloudera
  • 20. Hardware Planning • CPU • Disk capacity and configuration • Spindle count • Memory (amount and configuration) • NIC configuration • Hadoop’s hardware preferences tend to be controversial until the architecture is understood
  • 21. Baseline Hardware • Disk • SATA II 7200RPM (SAS controller) • JBOD (OS on R1) • Option 1: 12x3.5” LFF 3TB • Option 2: 24x2.5” SFF 1TB • Option: MDL/NL SAS drives • 2x2.2Ghz 6C 20MB cache • 48GB+ DDR3-1600 ECC • 1GbE vs. 10GbE • Is there new info here?

Notas do Editor

  1. INTERNAL NOTES – DELETE BEFORE POSTING!Set expectation that this is targeted to relatively beginner audience?What’s new? What are the NEW lessons learned? Example war story to start it off would help audience get into it.Scope? Core Hadoop (MR &amp; HDFS) vs. the entire CDH stack (Hive, ZK, HBase, etc.) and how do they co-locate deployment-wise. i.e. Do I need separate HW to run other components?(MapR depositioning): Mention: HA, performance, DR, data integrity, federation, MR2,
  2. SCRIPT for Zoo/Moderator (go through this as quickly as you can)Before we get started I’d like to let you know thatAll lines are mutedAsk questions any time by typing them into the QUESTIONS pane on your GoToWebinar panelThis webinar is being recorded and will be available later at cloudera.comLet me pass you to Eric Sammer, who is a Principal Solutions Architect and Cloudera and author of the recently published book “Hadoop Operations” by O’Reilly Media.
  3. - Do I need to dedicated rack/network for Hadoop? Or can I run other apps services running on same rack/network?
  4. Why not use Puppet/Chef for Hadoop config as well? Why is CM better? If I use Puppet/Chef for ALL my config mgmt (systems &amp; apps), why point solution CM for Hadoop?
  5. SCRIPT Zoo/moderator (speak fast):Thank you Eric. Let’s now move quickly into the Q&amp;A portion of this webinar. Please type your questions into the QUESTIONS PANEL and we’ll get to as many questions as we have time for. While Eric is reviewing the questions I’d like to congratulate the winners of the book drawing. If you see your name listed here your book will be mailed to you by the last week of October. It’s being printed now so when you receive it it’ll be “hot off the press”.MOVE TO NEXT SLIDE – get winners’ names off the screen
  6. SCRIPT Zoo/moderator (speak fast):Eric, are you ready to answer some questions?MOVE TO THANK YOU SLIDE WHILE CLOSING