SlideShare uma empresa Scribd logo
1 de 18
DO NOT USE PUBLICLY
    Million Monkeys                                 PRIOR TO 10/23/12
    Headline Goes Here
    Jesse Anderson | Curriculum Developer and Instructor
    Speaker Name or Subhead Goes Here
    November 2012




1
About Me
    • Cloudera - Educational Services Team
    • Twitter - @jessetanderson
    • Blog and more info: http://www.jesse-anderson.com
    • Screencasts on Pragmatic Programmers: Buy It Now on
      http://www.jesse-anderson.com
    • President – Northern Nevada Software Developers Group




2
About Cloudera
    • Cloudera is “The commercial Hadoop company”
    • Founded by leading experts on Hadoop from
      Facebook, Google, Oracle and Yahoo
    • Provides consulting and training services for Hadoop users
    • Staff includes committers to virtually all Hadoop projects




3
Introduction

    • Infinite Monkey Theorem
    • Hadoop
    • Million Monkeys Algorithm
    • Business Case




4
Infinite Monkey Theorem




5
Exponential Growth (aka Big Data)


     Odds of finding a group    Contiguous
                                              Combinations
     of characters is 1 in 26   Characters
     raised to the power of
          the number of             8           208,827,064,576
     contiguous characters
                                    9          5,429,503,678,976

                                   10        141,167,095,653,376




6
Hadoop

    •   Apache Project
    •   Reliable, Scalable, Distributed Computing
    •   Software Framework
    •   MapReduce
    •   Distributed File System (HDFS)
    •   Other projects

7
Map
    Create or process the input data




8
Reduce
    Process data from Map into something usable




9
Data Flow




10
Million Monkeys Algorithm




11
Business Case




12
Hadoop Scalability
                                Percent of Linear Scalability
               100

               80
     Percent




               60                                                               RDBMS
                                                                                Hadoop
               40

               20

                 0
                     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                                       Nodes                        RDBMS = Relational Database

13
Business Value of Scalability

        Scaling does not require    Adding more computers
         massive re-engineering         to cluster gets a
        and complete rewrites of     predictable increase in
                  code             computational power and
                                             storage

        SAVE                         SAVE




14
Going Viral (and taking over the world)


     Covered internationally      26,000 unique
     in BBC, Wall Street          visits from 119
     Journal, Wired and           countries in
     Slashdot                     one day




15
Next Steps
     •   Books
          •   Hadoop: The Definitive Guide - Tom White
          •   Hadoop Operations - Eric Sammer
     •   Cloudera Training
          •   Developer, Admin, Hive and Pig, HBase, Essentials
     •   CDH
          •   Cloudera's Apache Distribution Including Hadoop
          •   Open Source
          •   VM Image

16
Conclusion

     • MapReduce breaks up problem efficiently
     • No code changes to scale
     • Incredible scalability
     • Enables previously impossible tasks




17
18

Mais conteúdo relacionado

Mais procurados

Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Yahoo Developer Network
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
Samatha Kamuni
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
MapR Technologies
 

Mais procurados (19)

Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Hadoop
Hadoop Hadoop
Hadoop
 
Cassandra eu
Cassandra euCassandra eu
Cassandra eu
 
Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Bn1028 demo hadoop administration and development
Bn1028 demo  hadoop administration and developmentBn1028 demo  hadoop administration and development
Bn1028 demo hadoop administration and development
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Hadoop
HadoopHadoop
Hadoop
 
Big data advance topics - part 2.pptx
Big data   advance topics - part 2.pptxBig data   advance topics - part 2.pptx
Big data advance topics - part 2.pptx
 

Semelhante a Million Monkeys User Group

Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 

Semelhante a Million Monkeys User Group (20)

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainings
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 

Mais de Jesse Anderson

Strata 2012 Million Monkeys
Strata 2012 Million MonkeysStrata 2012 Million Monkeys
Strata 2012 Million Monkeys
Jesse Anderson
 

Mais de Jesse Anderson (13)

Managing Real-Time Data Teams
Managing Real-Time Data TeamsManaging Real-Time Data Teams
Managing Real-Time Data Teams
 
Pulsar for Kafka People
Pulsar for Kafka PeoplePulsar for Kafka People
Pulsar for Kafka People
 
Big Data and Analytics in the COVID-19 Era
Big Data and Analytics in the COVID-19 EraBig Data and Analytics in the COVID-19 Era
Big Data and Analytics in the COVID-19 Era
 
Working Together As Data Teams V1
Working Together As Data Teams V1Working Together As Data Teams V1
Working Together As Data Teams V1
 
What Does an Exec Need to About Architecture and Why
What Does an Exec Need to About Architecture and WhyWhat Does an Exec Need to About Architecture and Why
What Does an Exec Need to About Architecture and Why
 
The Five Dysfunctions of a Data Engineering Team
The Five Dysfunctions of a Data Engineering TeamThe Five Dysfunctions of a Data Engineering Team
The Five Dysfunctions of a Data Engineering Team
 
HBaseCon 2014-Just the Basics
HBaseCon 2014-Just the BasicsHBaseCon 2014-Just the Basics
HBaseCon 2014-Just the Basics
 
Strata 2012 Million Monkeys
Strata 2012 Million MonkeysStrata 2012 Million Monkeys
Strata 2012 Million Monkeys
 
EC2 Performance, Spot Instance ROI and EMR Scalability
EC2 Performance, Spot Instance ROI and EMR ScalabilityEC2 Performance, Spot Instance ROI and EMR Scalability
EC2 Performance, Spot Instance ROI and EMR Scalability
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
 
Why Use MVC?
Why Use MVC?Why Use MVC?
Why Use MVC?
 
How to Use MVC
How to Use MVCHow to Use MVC
How to Use MVC
 
Introduction to Android
Introduction to AndroidIntroduction to Android
Introduction to Android
 

Million Monkeys User Group

  • 1. DO NOT USE PUBLICLY Million Monkeys PRIOR TO 10/23/12 Headline Goes Here Jesse Anderson | Curriculum Developer and Instructor Speaker Name or Subhead Goes Here November 2012 1
  • 2. About Me • Cloudera - Educational Services Team • Twitter - @jessetanderson • Blog and more info: http://www.jesse-anderson.com • Screencasts on Pragmatic Programmers: Buy It Now on http://www.jesse-anderson.com • President – Northern Nevada Software Developers Group 2
  • 3. About Cloudera • Cloudera is “The commercial Hadoop company” • Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo • Provides consulting and training services for Hadoop users • Staff includes committers to virtually all Hadoop projects 3
  • 4. Introduction • Infinite Monkey Theorem • Hadoop • Million Monkeys Algorithm • Business Case 4
  • 6. Exponential Growth (aka Big Data) Odds of finding a group Contiguous Combinations of characters is 1 in 26 Characters raised to the power of the number of 8 208,827,064,576 contiguous characters 9 5,429,503,678,976 10 141,167,095,653,376 6
  • 7. Hadoop • Apache Project • Reliable, Scalable, Distributed Computing • Software Framework • MapReduce • Distributed File System (HDFS) • Other projects 7
  • 8. Map Create or process the input data 8
  • 9. Reduce Process data from Map into something usable 9
  • 13. Hadoop Scalability Percent of Linear Scalability 100 80 Percent 60 RDBMS Hadoop 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Nodes RDBMS = Relational Database 13
  • 14. Business Value of Scalability Scaling does not require Adding more computers massive re-engineering to cluster gets a and complete rewrites of predictable increase in code computational power and storage SAVE SAVE 14
  • 15. Going Viral (and taking over the world) Covered internationally 26,000 unique in BBC, Wall Street visits from 119 Journal, Wired and countries in Slashdot one day 15
  • 16. Next Steps • Books • Hadoop: The Definitive Guide - Tom White • Hadoop Operations - Eric Sammer • Cloudera Training • Developer, Admin, Hive and Pig, HBase, Essentials • CDH • Cloudera's Apache Distribution Including Hadoop • Open Source • VM Image 16
  • 17. Conclusion • MapReduce breaks up problem efficiently • No code changes to scale • Incredible scalability • Enables previously impossible tasks 17
  • 18. 18

Notas do Editor

  1. Interesting statistical question. Thought about since Aristotle.Randomness+Resouces+Time=Anything PossibleNo real monkeys – need virtual monkeys
  2. Shakespeare lazy. Heavily influenced English Literature.Big Data isn’t always a huge file. It can be high computation.
  3. This is not a map of MT and ID1 to 20 node testingKeep efficiency up RDBMS efficiency in gutter
  4. Engineers not spending time coding to scale. Busy adding new features.No code changes for scaling. Took 1.5 months on one computer and 3.5 days on 20 nodesSpending on new computers gives a consistent, linear increase. Compare spending on RDBMS and Hadoop.