SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
Distributed Computing the Google way
     An introduction to Apache Hadoop
3 million images
    are uploaded to      everyday.




…enough images to fill a
      375.000 page photo album.
Over 210 billion emails
                    are sent out daily.




…which is more than a year’s worth
                of letter mail in the US.
Bloggers post 900.000 new
      articles every day.




  Enough posts to fill the
New York Times for 19 years!
43.339 TB are sent across all
mobile phones globally everyday.




                           That is enough to fill...
               1.7 million         9.2 million    63.9 trillion



                Blu-rays             DVDs         3.5” diskettes
700.000 new members are
signing up on Facebook everyday..




          It’s the approximate population of Guyana.
Agenda



1   Introduction.
     2   MapReduce.
         3   Apache Hadoop.
              4     RDBMS & MapReduce.
                    5   Questions & Discussion.
Eduard Hildebrandt
    Consultant, Architect, Coach
                     Freelancer




                  +49 160 6307253
       mail@eduard-hildebrandt.de
  http://www.eduard-hildebrandt.de
Why should I care?
It’s not just Google!



     New York            Internet Archive     Hadron Collider
  Stock Exchange         www.archive.org       Switzerland




1 TB trade data       growing by 20 TB      producing 15 PB
    per day              per month              per year
It’s a growing job market!
It may be the future of distributed computing!

                            Think about…




GPS tracker
                                  genom analysis

                     RFID                           medical monitors


      The amount of data we produce will rise from year to year!
It’s about performance!



BEFORE
Development: 2-3 Weeks
Runtime: 26 days




AFTER
Development: 2-3 Days
Runtime: 20 minutes
Grid computing
                 focus on: distributing workload




• one SAN drive, many compute nodes
• works well for small data sets and long processing time
• examples: SETI@home, Folding@home
Problem: Sharing data is slow!
  Google processed 400 PB per month in 2007 with an
average job size of 180 GB. It takes ~ 45 minutes to read a
                 180 GB file sequentially.
Modern approach
               focus on: distributing the data




• stores data locally
• parallel read / write
     1 HDD  ~75 MB/sec
     1000 HDD  ~75000 MB/sec
The MAP and REDUCE algorithm


            Do       1
                                  Not          1
            As       1
                                  Do          1
            I    1                    Do          1
                                                               as 2
            Say       1                                        do 2
      Map                         As          1
                          Group
                                      As       1
                                                      Reduce   i2
            Not      1
                                                               not 1
            As       1            I       1                    say 1
                                      I    1
            I    1
                                  Say          1
            Do       1




  It‘s really map – group – reduce!
Implementation of the MAP algorithm


public static class MapClass extends MapReduceBase implements
         Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
         OutputCollector<Text, IntWritable> output, Reporter reporter)
         throws IOException {

         String line = value.toString();
         StringTokenizer itr = new StringTokenizer(line);
         while (itr.hasMoreTokens()) {
             word.set(itr.nextToken());
             output.collect(word, one);
         }
    }
}




                 Could it be even simpler?
Implementation of the REDUCE algorithm



public static class Reduce extends MapReduceBase implements
         Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable>       values,
         OutputCollector<Text, IntWritable> output, Reporter reporter)
         throws IOException {

        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
}




                                                     Just REDUCE it!
Apache Hadoop




        http://hadoop.apache.org/



Hadoop is an open‐source Java framework for
 parallel processing large data on clusters
         of commodity hardware.
Hadoop History


      02/2003 first MapReduce library @ Google
             10/2003 GFS paper
                        12/2004 MapReduce paper
                                  07/2005 Nutch uses MapReduce
                                        02/2006 Hadoop moves out of Nutch




      2003       2004      2005        2006      2007      2008      2009     2010


04/2007 Yahoo running Hadoop on 10.000 nodes

    01/2008 Hadoop made on Apache top level project

                 07/2008 Hadoop wins tera sort benchmark

                                                  07/2010 this presentation
Who is using Apache Hadoop?
“Failure is the defining difference between
   distributed and local programming.”
                 -- Ken Arnold, CORBA designer
mean time between failures of a HDD: 1.200.000 hours




             If your cluster has 10.000 hard drives,
then you have a hard drive crash every 5 days on average.
HDFS

                                                               NameNode
                        Client                            sample1.txt  1,3,5
                                           metadata
                                                          sample2.txt  2,4




                       read/write


         DataNode 1                  DataNode 2                      DataNode 3
Rack 1




               1                             5                                      3
                   4                             1
                       2
                             replication     2                                  6




         DataNode 4                  DataNode 5                      DataNode 6
Rack 2




                   2                                                            1
                       6                     4        3                             4   6
               3                                 5                              5
Do       1
                                          Do 1
                             As       1    Do 1



                             Do       1
                                          As 1
                                           As 1     as 2
                             As       1
                                                    do 2
                                                    i2
block      block     block   I    1                 not 1
                                          I     1
              file                            I 1   say 1
                             Say 1
 split   split split split

                             I    1       Say 1

                             Not 1        Not 1




                   How does it fit together?
Hadoop architecture



2. Submit Job
                   Client



                       1. Copy files      4. Initialize Job



                NamedNode
                                       DataNode                       TaskTracker
                                                     5. Read files                  6. MapReduce


                TaskTracker                          7. Save Result      Job


                  JobCue




                                                    3. Haertbeat
Reduce it to the max!




Performance improvement when scaling
         with your hadoop system
Reads are OK, but writes are getting slower and slower
                                                                                        7
Drop secondary indexes and triggers.


                         Some queries are still to slow
               6
                         Periodically prematerialize the most complex queries.

                        Rising popularity swamps server
           5
                        Stop doing any server-side computation.

                                       New features increases query complexity
                4
                                       Denormalize your data to reduce joins.

                                                  Service continues to grow in popularity
                          3
                                                  Scale DB-Server vertically by buying a costly server.

                                               Service becomes more popular
                                          2    Cache common queries. Reads are no longer strongly ACID.

                                                              Initial public launch
                                                        1
                                                              Move from local workstation to a server.
How can we solve this scaling problem?
Join


          page_view
                                                                   pv_users
                                            user
   pageid userid      time                                      pageid   age
                                  userid   age     gender
      1     111    10:18:21                                       1      22
                              X    111     22      female   =
      2     111    10:19:53                                       2      22
                                   222     33      male
      1     222    11:05:12                                       1      33




SQL:
  INSERT INTO TABLE pv_users
  SELECT pv.pageid, u.age
  FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Join with MapReduce

         page_view
pageid userid        time
                                  key   value              key   value
  1       111    10:18:21         111   <1,1>              111   <1,1>
  2       111    10:19:53         111   <1,2>              111   <1,2>
  1       222    11:05:12         222   <1,1>              111   <2,22>
                            map                  shuffle                  reduce
         user
userid   age    gender            key    value             key    value
 111      22    female            111   <2,22>             222   <1,1>
 222      33    male              222   <2,33>             222   <2,33>
HBase
   HBase is an open-source, distributed, versioned,
 column-oriented store modeled after Google' Bigtable.

• No real indexes

• Automatic partitioning

• Scale linearly and automatically with new nodes

• Commodity hardware

• Fault tolerant

• Batch processing
RDBMS vs. MapReduce

                RDBMS                MapReduce

Data size        gigabytes             petabytes

Access      interactive and batch       batch

               read and write          write once
Updates
                many times          read many times
Structure      static schema        dynamic schema

Integrity           high                 low

Scaling          nonlinear              linear
Use the right tool!

MapReduce is a screwdriver.
good for:
• unstructured data
• data intensive computation
• batch operations
• scale horizontal
                                    good for:
                                    • structured data
                                    • transactions
                                    • interactive requests
                                    • scale vertically

                               Databases are hammers.
Where is the bridge?



user profiles
                        ?                 log files




           RDBMS                 Hadoop
Sqoop
SQL-to-Hadoop database import tool
  user profiles                                     log files



                          Sqoop




             RDBMS                        Hadoop




  $ sqoop –connect jdbc:mysql://database.example.com/users 
    –username aaron –password 12345 –all-tables 
    –warehouse-dir /common/warehouse
Sqoop
SQL-to-Hadoop database import tool
  user profiles                           log files



                     Sqoop




             RDBMS               Hadoop




                      Java classes          DB schema
What is common across
              Hadoop-able problems?

• nature of data
    • complex data
    • multiple data soucers
    • lot of it

• nature of the analysis
    • batch processing
    • parallel execution
    • spread data over nodes in a cluster
    • take computation to the data
TOP 10 Hadoop-able problems

1   modeling true risk        6   network data analysis

2   customer churn analysis   7   fraud detection

3   recommendation engine     8   trade surveillance

4   ad targeting              9   search quality

5   point of sale analysis    10 data “sandbox”
“Appetite comes with eating.”
                 -- François Rabelais
Case Study 1



Listening data:
user id   track id   scrobble   radio   skip
  123       456         0        1       1
  451       789         1        0       1
  241       234         0        1       1




Hadoop jobs for:
• number of unique listeners
• number of times the track was:
    • scrobbled
    • listened to on the radio
    • listened to in total
    • skipped on the radio
Case Study 2



User data:
• 12 TB of compressed data added per day
• 800 TB of compressed data scanned per day
• 25,000 map-reduce jobs per day
• 65 millions files in HDFS
• 30,000 simultaneous clients to the
  HDFS NameNode




Hadoop jobs for:
• friend recommendations
• Insights for the Facebook Advertisers
Enterprise
    SCIENCE        Xml

                   Cvs

                   Edi                                         Internet, mashups,
    INDUSTRY                                                   dashboard
                   Logs                        Server cloud

                               Create map
                   Objects

                   Sql                                                          ERP,SOA,BI
    LEGACY
                   Txt

                   Json
                                            REDUCE        IMPORT
SYSTEM DATA        binary

                                        Hadoop subsystem                RDBMS

     High Volume
1                    2       MapReduce algorithm                   3   Consume results
     Data
That was just the tip of the iceberg!




                      Core                                         HDFS
  HBase                                                                       Hive

                  Pig                                     Thrift
                                   Avro
                                                       Nutch
      ZooKeeper
                             Chukwa
                                               Jaql            Cassandra
            Dumbo
                                        Solr
                                                                   KosmosFS
Cascading              Mahout
                                                      Scribe
            Ganglia             Katta
                                                      Hypertable
Hadoop is a good choice for:

• analyzing log files
• sort a large amount of data
• search engines
• contextual adds
• image analysis
• protein folding
• classification
Hadoop is a poor choice for:
• figuring PI to 1.000.000 digits
• calculating Fibonacci sequences
• a general RDBMS replacement
Final thoughts
    Data intensive computation is a fundamentally different challenge
1
    than doing CPU intensive computation over small dataset.

2   New ways of thinking about problems are needed.

    Failure is acceptable and inevitable.
3
    Go cheap! Go distributed!

4   RDBMS is not dead!
    It just got new friends and helpers.

5   Give Hadoop a chance!
Time for questions!
Let’s get connected!



http://www.soa-at-work.com
http://www.eduard-hildebrandt.de

Mais conteúdo relacionado

Destaque

Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReducecoolmirza143
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopDataWorks Summit
 
Consensus in distributed computing
Consensus in distributed computingConsensus in distributed computing
Consensus in distributed computingRuben Tan
 
High performance data center computing using manageable distributed computing
High performance data center computing using manageable distributed computingHigh performance data center computing using manageable distributed computing
High performance data center computing using manageable distributed computingJuniper Networks
 
Introduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application DevelopmentIntroduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application DevelopmentMichelle Holley
 
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...tugrulh
 
Concepts of Distributed Computing & Cloud Computing
Concepts of Distributed Computing & Cloud Computing Concepts of Distributed Computing & Cloud Computing
Concepts of Distributed Computing & Cloud Computing Hitesh Kumar Markam
 
Hệ PhâN TáN
Hệ PhâN TáNHệ PhâN TáN
Hệ PhâN TáNit
 
Load Balancing In Distributed Computing
Load Balancing In Distributed ComputingLoad Balancing In Distributed Computing
Load Balancing In Distributed ComputingRicha Singh
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notesSyed Mustafa
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system modelHarshad Umredkar
 
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFVOpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFVCloud Native Day Tel Aviv
 
Distributed computing
Distributed computingDistributed computing
Distributed computingshivli0769
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed SystemsRupsee
 

Destaque (18)

Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
 
Consensus in distributed computing
Consensus in distributed computingConsensus in distributed computing
Consensus in distributed computing
 
BitCoin, P2P, Distributed Computing
BitCoin, P2P, Distributed ComputingBitCoin, P2P, Distributed Computing
BitCoin, P2P, Distributed Computing
 
High performance data center computing using manageable distributed computing
High performance data center computing using manageable distributed computingHigh performance data center computing using manageable distributed computing
High performance data center computing using manageable distributed computing
 
Introduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application DevelopmentIntroduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application Development
 
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
 
Concepts of Distributed Computing & Cloud Computing
Concepts of Distributed Computing & Cloud Computing Concepts of Distributed Computing & Cloud Computing
Concepts of Distributed Computing & Cloud Computing
 
Hệ PhâN TáN
Hệ PhâN TáNHệ PhâN TáN
Hệ PhâN TáN
 
Load Balancing In Distributed Computing
Load Balancing In Distributed ComputingLoad Balancing In Distributed Computing
Load Balancing In Distributed Computing
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notes
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFVOpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 

Semelhante a Distributed computing the Google way

Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Lucidworks (Archived)
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceChandra Sekhar
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptxShree Shree
 
MapReduce with Hadoop and Ruby
MapReduce with Hadoop and RubyMapReduce with Hadoop and Ruby
MapReduce with Hadoop and RubySwanand Pagnis
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир СуворовEYevseyeva
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopJosh Devins
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDBStone Gao
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopVeda Vyas
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 

Semelhante a Distributed computing the Google way (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValuesColumn Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective Audience
 
Hadoop
HadoopHadoop
Hadoop
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
 
MapReduce with Hadoop and Ruby
MapReduce with Hadoop and RubyMapReduce with Hadoop and Ruby
MapReduce with Hadoop and Ruby
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир Суворов
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDB
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 

Último

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Distributed computing the Google way

  • 1. Distributed Computing the Google way An introduction to Apache Hadoop
  • 2. 3 million images are uploaded to everyday. …enough images to fill a 375.000 page photo album.
  • 3. Over 210 billion emails are sent out daily. …which is more than a year’s worth of letter mail in the US.
  • 4. Bloggers post 900.000 new articles every day. Enough posts to fill the New York Times for 19 years!
  • 5. 43.339 TB are sent across all mobile phones globally everyday. That is enough to fill... 1.7 million 9.2 million 63.9 trillion Blu-rays DVDs 3.5” diskettes
  • 6. 700.000 new members are signing up on Facebook everyday.. It’s the approximate population of Guyana.
  • 7.
  • 8. Agenda 1 Introduction. 2 MapReduce. 3 Apache Hadoop. 4 RDBMS & MapReduce. 5 Questions & Discussion.
  • 9. Eduard Hildebrandt Consultant, Architect, Coach Freelancer +49 160 6307253 mail@eduard-hildebrandt.de http://www.eduard-hildebrandt.de
  • 10. Why should I care?
  • 11. It’s not just Google! New York Internet Archive Hadron Collider Stock Exchange www.archive.org Switzerland 1 TB trade data growing by 20 TB producing 15 PB per day per month per year
  • 12. It’s a growing job market!
  • 13. It may be the future of distributed computing! Think about… GPS tracker genom analysis RFID medical monitors The amount of data we produce will rise from year to year!
  • 14. It’s about performance! BEFORE Development: 2-3 Weeks Runtime: 26 days AFTER Development: 2-3 Days Runtime: 20 minutes
  • 15. Grid computing focus on: distributing workload • one SAN drive, many compute nodes • works well for small data sets and long processing time • examples: SETI@home, Folding@home
  • 16. Problem: Sharing data is slow! Google processed 400 PB per month in 2007 with an average job size of 180 GB. It takes ~ 45 minutes to read a 180 GB file sequentially.
  • 17. Modern approach focus on: distributing the data • stores data locally • parallel read / write  1 HDD  ~75 MB/sec  1000 HDD  ~75000 MB/sec
  • 18. The MAP and REDUCE algorithm Do 1 Not 1 As 1 Do 1 I 1 Do 1 as 2 Say 1 do 2 Map As 1 Group As 1 Reduce i2 Not 1 not 1 As 1 I 1 say 1 I 1 I 1 Say 1 Do 1 It‘s really map – group – reduce!
  • 19. Implementation of the MAP algorithm public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } Could it be even simpler?
  • 20. Implementation of the REDUCE algorithm public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Just REDUCE it!
  • 21. Apache Hadoop http://hadoop.apache.org/ Hadoop is an open‐source Java framework for parallel processing large data on clusters of commodity hardware.
  • 22. Hadoop History 02/2003 first MapReduce library @ Google 10/2003 GFS paper 12/2004 MapReduce paper 07/2005 Nutch uses MapReduce 02/2006 Hadoop moves out of Nutch 2003 2004 2005 2006 2007 2008 2009 2010 04/2007 Yahoo running Hadoop on 10.000 nodes 01/2008 Hadoop made on Apache top level project 07/2008 Hadoop wins tera sort benchmark 07/2010 this presentation
  • 23. Who is using Apache Hadoop?
  • 24. “Failure is the defining difference between distributed and local programming.” -- Ken Arnold, CORBA designer
  • 25. mean time between failures of a HDD: 1.200.000 hours If your cluster has 10.000 hard drives, then you have a hard drive crash every 5 days on average.
  • 26. HDFS NameNode Client sample1.txt  1,3,5 metadata sample2.txt  2,4 read/write DataNode 1 DataNode 2 DataNode 3 Rack 1 1 5 3 4 1 2 replication 2 6 DataNode 4 DataNode 5 DataNode 6 Rack 2 2 1 6 4 3 4 6 3 5 5
  • 27. Do 1 Do 1 As 1 Do 1 Do 1 As 1 As 1 as 2 As 1 do 2 i2 block block block I 1 not 1 I 1 file I 1 say 1 Say 1 split split split split I 1 Say 1 Not 1 Not 1 How does it fit together?
  • 28. Hadoop architecture 2. Submit Job Client 1. Copy files 4. Initialize Job NamedNode DataNode TaskTracker 5. Read files 6. MapReduce TaskTracker 7. Save Result Job JobCue 3. Haertbeat
  • 29. Reduce it to the max! Performance improvement when scaling with your hadoop system
  • 30. Reads are OK, but writes are getting slower and slower 7 Drop secondary indexes and triggers. Some queries are still to slow 6 Periodically prematerialize the most complex queries. Rising popularity swamps server 5 Stop doing any server-side computation. New features increases query complexity 4 Denormalize your data to reduce joins. Service continues to grow in popularity 3 Scale DB-Server vertically by buying a costly server. Service becomes more popular 2 Cache common queries. Reads are no longer strongly ACID. Initial public launch 1 Move from local workstation to a server.
  • 31. How can we solve this scaling problem?
  • 32. Join page_view pv_users user pageid userid time pageid age userid age gender 1 111 10:18:21 1 22 X 111 22 female = 2 111 10:19:53 2 22 222 33 male 1 222 11:05:12 1 33 SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);
  • 33. Join with MapReduce page_view pageid userid time key value key value 1 111 10:18:21 111 <1,1> 111 <1,1> 2 111 10:19:53 111 <1,2> 111 <1,2> 1 222 11:05:12 222 <1,1> 111 <2,22> map shuffle reduce user userid age gender key value key value 111 22 female 111 <2,22> 222 <1,1> 222 33 male 222 <2,33> 222 <2,33>
  • 34. HBase HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable. • No real indexes • Automatic partitioning • Scale linearly and automatically with new nodes • Commodity hardware • Fault tolerant • Batch processing
  • 35. RDBMS vs. MapReduce RDBMS MapReduce Data size gigabytes petabytes Access interactive and batch batch read and write write once Updates many times read many times Structure static schema dynamic schema Integrity high low Scaling nonlinear linear
  • 36. Use the right tool! MapReduce is a screwdriver. good for: • unstructured data • data intensive computation • batch operations • scale horizontal good for: • structured data • transactions • interactive requests • scale vertically Databases are hammers.
  • 37. Where is the bridge? user profiles ? log files RDBMS Hadoop
  • 38. Sqoop SQL-to-Hadoop database import tool user profiles log files Sqoop RDBMS Hadoop $ sqoop –connect jdbc:mysql://database.example.com/users –username aaron –password 12345 –all-tables –warehouse-dir /common/warehouse
  • 39. Sqoop SQL-to-Hadoop database import tool user profiles log files Sqoop RDBMS Hadoop Java classes DB schema
  • 40. What is common across Hadoop-able problems? • nature of data • complex data • multiple data soucers • lot of it • nature of the analysis • batch processing • parallel execution • spread data over nodes in a cluster • take computation to the data
  • 41. TOP 10 Hadoop-able problems 1 modeling true risk 6 network data analysis 2 customer churn analysis 7 fraud detection 3 recommendation engine 8 trade surveillance 4 ad targeting 9 search quality 5 point of sale analysis 10 data “sandbox”
  • 42. “Appetite comes with eating.” -- François Rabelais
  • 43. Case Study 1 Listening data: user id track id scrobble radio skip 123 456 0 1 1 451 789 1 0 1 241 234 0 1 1 Hadoop jobs for: • number of unique listeners • number of times the track was: • scrobbled • listened to on the radio • listened to in total • skipped on the radio
  • 44. Case Study 2 User data: • 12 TB of compressed data added per day • 800 TB of compressed data scanned per day • 25,000 map-reduce jobs per day • 65 millions files in HDFS • 30,000 simultaneous clients to the HDFS NameNode Hadoop jobs for: • friend recommendations • Insights for the Facebook Advertisers
  • 45. Enterprise SCIENCE Xml Cvs Edi Internet, mashups, INDUSTRY dashboard Logs Server cloud Create map Objects Sql ERP,SOA,BI LEGACY Txt Json REDUCE IMPORT SYSTEM DATA binary Hadoop subsystem RDBMS High Volume 1 2 MapReduce algorithm 3 Consume results Data
  • 46. That was just the tip of the iceberg! Core HDFS HBase Hive Pig Thrift Avro Nutch ZooKeeper Chukwa Jaql Cassandra Dumbo Solr KosmosFS Cascading Mahout Scribe Ganglia Katta Hypertable
  • 47. Hadoop is a good choice for: • analyzing log files • sort a large amount of data • search engines • contextual adds • image analysis • protein folding • classification
  • 48. Hadoop is a poor choice for: • figuring PI to 1.000.000 digits • calculating Fibonacci sequences • a general RDBMS replacement
  • 49. Final thoughts Data intensive computation is a fundamentally different challenge 1 than doing CPU intensive computation over small dataset. 2 New ways of thinking about problems are needed. Failure is acceptable and inevitable. 3 Go cheap! Go distributed! 4 RDBMS is not dead! It just got new friends and helpers. 5 Give Hadoop a chance!