SlideShare uma empresa Scribd logo
1 de 171
Hadoop
                 divide and conquer petabyte-scale data




© Matthew McCullough, Ambient Ideas, LLC
Metadata

Matthew McCullough
  Ambient Ideas, LLC
  matthewm@ambientideas.com
  http://ambientideas.com/blog
  @matthewmccull
Code
  http://github.com/matthewmccullough/hadoop-intro
http://delicious.com/matthew.mccullough/hadoop
ado op o ften.
I u se H
                   d my  Tivo
        the s  oun
Th at’s             I s kip a
         very  time
ma kes e
 com mer cial.
                                          rrency in Practice
                              Ja va Concu
         n Goetz, author of
    -Bria
Why?
u are using
       mpu ter yo
The co
right now
        ell hav e the
    very w             sor
may
          GHz  pro ces
 fas test
    u’ll ever own
 yo
Scale up?
Scale out
Talk Roadmap

History
MapReduce
Filesystem
DSLs
History
Do
   ug
        Cu
          tt
             in
               g
open source
Lucene
Lucene
Nutch
Lucene
Nutch
Hadoop
Lucene
Nutch
Hadoop
Lucene
Mahout Nutch
   Hadoop
Today
0.21.0 current version

Dozens of companies contributing

Hundreds of companies using
Virtual
Machine
VM Sources

 Yahoo
  True to the OSS distribution
 Cloudera
  Desktop tools
 Both VMWare based
Set up the Environment


      • Launch “cloudera-training 0.3.3” VMWare instance
      • Open a terminal window
Lab
Log Files



      •Attach tail to all the Hadoop logs
Lab
processing / Data
Processing Framework                     Data Storage

MapReduce                             HDFS
Hive QL                               Hive
Chuckwa                               HBase
Flume                                 Avro
ZooKeeper
                             Hybrid
                       Pig
Hadoop
Components
Hadoop Components
   Tool                      Purpose
   Common       MapReduce

      HDFS      Filesystem

          Pig   Analyst MapReduce language

     HBase      Column-oriented data storage

       Hive     SQL-like language for HBase

  ZooKeeper     Workflow & distributed transactions

    Chukwa      Log file processing
the Players



                             Comm
       C hukwa   ZooKeeper          on   H Base
Hive                                              HDFS
Motivations
Original Goals


 Web Indexer
 Yahoo Search Engine
Pre-Hadoop
Shortcomings

 Search engine update frequency
 Storage costs
 Expense of durability
 Ad-hoc queries
Anti-Patterns Cured


 RAM-heavy RDBMS boxes
 Sharding
 Archiving
 Ever-smaller-range SQL queries
1 Gigabyte
1 Terabyte
1 Petabyte
16 Petabytes
Near-Linear Hardware Scalability
Applications

 Protein folding
 pharmaceutical research

 Search Engine Indexing
 walking billions of web pages

 Product Recommendations
 based on other customer purchases

 Sorting
 terabytes to petabyes in size

 Classification
 government intelligence
Contextual Ads
SELECT reccProd.name, reccProd.id
 FROM products reccProd
 WHERE purchases.customerId =


 (SELECT customerId
   FROM customers
   WHERE purchases.productId = thisProd)


 LIMIT 5
30%
of Amazon sales are from
recommendations
ACID


 ATOMICITY

 CONSISTENCY

 ISOLATION

 DURABILITY
CAP


 Consistency
 Availability
 Partition Tolerance
MapReduce
MapReduce: Simplified Data
                                               Processing                                    on Large Clusters

                                              Jeffrey Dean and Sanjay Ghem
                                                                           awat
                                                  jeff@google.com, sanjay@goog
                                                                              le.com

                                                             Google, Inc.


                             Abstract
                                                                             given day, etc. Most such com
       MapReduce is a programmin                                                                                   putations are conceptu-
                                       g model and an associ-               ally straightforward. However
    ated implementation for proc                                                                                , the input data is usually
                                   essing and generating large              large and the computations have
    data sets. Users specify a map                                                                                  to be distributed across
                                     function that processes a              hundreds or thousands of mac
    key/value pair to generate a set                                                                           hines in order to finish in
                                     of intermediate key/value              a reasonable amount of time.
    pairs, and a reduce function that                                                                          The issues of how to par-
                                       merges all intermediate              allelize the computation, dist
   values associated with the sam                                                                           ribute the data, and handle
                                    e intermediate key. Many               failures conspire to obscure the
   real world tasks are expressible                                                                               original simple compu-
                                      in this model, as shown              tation with large amounts of
   in the paper.                                                                                            complex code to deal with
                                                                           these issues.
       Programs written in this func                                          As a reaction to this complex
                                       tional style are automati-                                                 ity, we designed a new
    cally parallelized and executed                                       abstraction that allows us to exp
                                      on a large cluster of com-                                               ress the simple computa-
   modity machines. The run-time                                          tions we were trying to perform
                                        system takes care of the                                                 but hides the messy de-
   details of partitioning the inpu                                       tails of parallelization, fault-tol
                                    t data, scheduling the pro-                                                erance, data distribution
   gram’s execution across a set                                         and load balancing in a libra
                                   of machines, handling ma-                                                ry. Our abstraction is in-
   chine failures, and managing                                          spired by the map and reduce
                                   the required inter-machine                                                primitives present in Lisp
   communication. This allows                                            and many other functional lang
                                    programmers without any                                                    uages. We realized that
  experience with parallel and                                           most of our computations invo
                                   distributed systems to eas-                                               lved applying a map op-
  ily utilize the resources of a larg                                    eration to each logical “record”
                                      e distributed system.                                                     in our input in order to
     Our implementation of Map                                          compute a set of intermediat
                                      Reduce runs on a large                                              e key/value pairs, and then
  cluster of commodity machine                                          applying a reduce operation to
                                     s and is highly scalable:                                               all the values that shared
  a typical MapReduce computa                                           the same key, in order to com
                                    tion processes many ter-                                               bine the derived data ap-
 abytes of data on thousands of                                        propriately. Our use of a func
                                     machines. Programmers                                                    tional model with user-
 find the system easy to use: hun                                       specified map and reduce ope
                                    dreds of MapReduce pro-                                               rations allows us to paral-
 grams have been implemente                                            lelize large computations easi
                                 d and upwards of one thou-                                               ly and to use re-execution
 sand MapReduce jobs are exec                                          as the primary mechanism for
                                   uted on Google’s clusters                                              fault tolerance.
 every day.                                                                The major contributions of this
                                                                                                                 work are a simple and
                                                                      powerful interface that enables
                                                                                                           automatic parallelization
                                                                      and distribution of large-scale
                                                                                                           computations, combined
 1 Introduction                                                       with an implementation of this
                                                                                                              interface that achieves
                                                                      high performance on large clus
                                                                                                          ters of commodity PCs.
 Over the past five years, the                                             Section 2 describes the basic
                              authors and many others at                                                   programming model and
 Google have implemented hun                                         gives several examples. Sec
                                dreds of special-purpose                                               tion 3 describes an imple-
 computations that process larg                                      mentation of the MapReduce
                                 e amounts of raw data,                                                  interface tailored towards
such as crawled documents,                                           our cluster-based computing
                               web request logs, etc., to                                             environment. Section 4 de-
compute various kinds of deri                                        scribes several refinements of
                               ved data, such as inverted                                                 the programming model
indices, various representatio                                      that we have found useful. Sec
                               ns of the graph structure                                                   tion 5 has performance
of web documents, summaries                                         measurements of our implem
                                 of the number of pages                                                 entation for a variety of
crawled per host, the set of                                        tasks. Section 6 explores the
                              most frequent queries in a                                               use of MapReduce within
                                                                    Google including our experie
                                                                                                      nces in using it as the basis
To appear in OSDI 2004
                                                                                                                                    1
mode l and
        gram ming
“ A pro           for
      ment ation
imple
                  nd ge nera ting
 pro cess   ing a
         dat a s ets         ”
 la rg e
MapReduce
a word counting conceptual example
The Goal


Provide the occurrence count
of each distinct word across
all documents
Raw Data
             a folder of documents

mydoc1.txt     mydoc2.txt            mydoc3.txt
Map
break documents into words
Shuffle
physically group (relocate) by key
Reduce
count word occurrences
Reduce Again
  sort occurrences alphabetically
Grep.java
package org.apache.hadoop.examples;

import java.util.Random;

import   org.apache.hadoop.conf.Configuration;
import   org.apache.hadoop.conf.Configured;
import   org.apache.hadoop.fs.FileSystem;
import   org.apache.hadoop.fs.Path;
import   org.apache.hadoop.io.LongWritable;
import   org.apache.hadoop.io.Text;
import   org.apache.hadoop.mapred.*;
import   org.apache.hadoop.mapred.lib.*;
import   org.apache.hadoop.util.Tool;
import   org.apache.hadoop.util.ToolRunner;

/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
  private Grep() {}                               // singleton

  public int run(String[] args) throws Exception {
    if (args.length < 3) {
      System.out.println("Grep <inDir> <outDir> <regex>
[<group>]");
import org.apache.hadoop.util.ToolRunner;

/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
  private Grep() {}                               // singleton

  public int run(String[] args) throws Exception {
    if (args.length < 3) {
      System.out.println("Grep <inDir> <outDir> <regex>
[<group>]");
      ToolRunner.printGenericCommandUsage(System.out);
      return -1;
    }

    Path tempDir =
      new Path("grep-temp-"+
          Integer.toString(new Random().nextInt
(Integer.MAX_VALUE)));

    JobConf grepJob = new JobConf(getConf(), Grep.class);

    try {

      grepJob.setJobName("grep-search");

      FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
  grepJob.set("mapred.mapper.regex.group", args[3]);

grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);

FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);

JobClient.runJob(grepJob);

JobConf sortJob = new JobConf(Grep.class);
sortJob.setJobName("grep-sort");

FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);
FileOutputFormat.setOutputPath(grepJob, tempDir);
        grepJob.setOutputFormat(SequenceFileOutputFormat.class);
        grepJob.setOutputKeyClass(Text.class);
        grepJob.setOutputValueClass(LongWritable.class);

        JobClient.runJob(grepJob);

        JobConf sortJob = new JobConf(Grep.class);
        sortJob.setJobName("grep-sort");

        FileInputFormat.setInputPaths(sortJob, tempDir);
        sortJob.setInputFormat(SequenceFileInputFormat.class);

        sortJob.setMapperClass(InverseMapper.class);

       sortJob.setNumReduceTasks(1);                 // write a
single file
       FileOutputFormat.setOutputPath(sortJob, new Path(args
[1]));
       sortJob.setOutputKeyComparatorClass           // sort by
decreasing freq
       (LongWritable.DecreasingComparator.class);

        JobClient.runJob(sortJob);
    }
JobClient.runJob(grepJob);

        JobConf sortJob = new JobConf(Grep.class);
        sortJob.setJobName("grep-sort");

        FileInputFormat.setInputPaths(sortJob, tempDir);
        sortJob.setInputFormat(SequenceFileInputFormat.class);

        sortJob.setMapperClass(InverseMapper.class);

       sortJob.setNumReduceTasks(1);                 // write a
single file
       FileOutputFormat.setOutputPath(sortJob, new Path(args
[1]));
       sortJob.setOutputKeyComparatorClass           // sort by
decreasing freq
       (LongWritable.DecreasingComparator.class);

        JobClient.runJob(sortJob);
      }
      finally {
        FileSystem.get(grepJob).delete(tempDir, true);
      }
      return 0;
  }
RegExMapper.java
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements. See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership. The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License. You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.mapred.lib;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/** A {@link Mapper} that extracts text matching a regular expression.
*/
public class RegexMapper<K> extends MapReduceBase
    implements Mapper<K, Text, Text, LongWritable> {

    private Pattern pattern;
    private int group;

    public void configure(JobConf job) {
      pattern = Pattern.compile(job.get("mapred.mapper.regex"));
      group = job.getInt("mapred.mapper.regex.group", 0);
    }

  public void map(K key, Text value,
                  OutputCollector<Text, LongWritable> output,
                  Reporter reporter)
    throws IOException {
    String text = value.toString();
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
      output.collect(new Text(matcher.group(group)), new LongWritable
(1));
    }
  }

}
Have Code,
  Will Travel


 Code travels to the data
 Opposite of traditional systems
Filesystem
uted     file
    alab le di strib
“sc                       ibuted
            r large distr
s yst em fo                 icati  ons.
            nsive    appl
 d ata-inte                  ance
            fa  ult t  oler
       ovides
 It pr
             in expensive
      e runn ing on
 whil
          ty h ardw  are”
 com modi
HDFS Basics

 Open Source implementation of
 Google BigTable
 Replicated data store
 Stored in 64MB blocks
HDFS

 Rack location aware
 Configurable redundancy factor
 Self-healing
 Looks almost like *NIX filesystem
Why HDFS?


 Random reads
 Parallel reads
 Redundancy
Data Overload
Test HDFS


      •   Upload a file
      •   List directories
Lab
HDFS Upload


      •   Show contents of file in HDFS
      •   Show vocabulary of HDFS
Lab
HDFS Distcp


      •   Copy file to other nodes
      •   MapReduce job starts
Lab
HDFS Challenges

 Writes are re-writes today
  Append is planned
 Block size, alignment
 Small file inefficiencies
 NameNode SPOF
FSCK



      •Check filesystem
Lab
NO SQL
Data Categories
Structured




Semi-structured




 Unstructured
NOSQL is...
Semi-structured


                   not death of the RDBMS
                   no JOINing
                   no Normalization
                   big-data tools
                   solving different issues than
                   RDBMSes
Grid Benefits
Scalable


 Data storage is pipelined
 Code travels to data
 Near linear hardware scalability
Optimization


 Preemptive execution
 Maximizes use of faster hardware
 New jobs trump P.E.
Fault Tolerant

 Configurable data redundancy
 Minimizes hardware failure impact
 Automatic job retries
 Self healing filesystem
Sproinnnng!

Bzzzt!

                       Poof!
server Funerals



No pagers go off when machines die
Report of dead machines once a week
 Clean out the carcasses
ss attributes
Ro bustne             eding
        ed fro  m ble
p revent
       plication code
into ap
             Data redundancy
             Node death
             Retries
             Data geography
             Parallelism
             Scalability
Node TYPES
NameNode
SecondaryNameNode
DataNode
JobTracker
TaskTracker
Robust, Primary
                           1                     Hot Backup
                                                     0..1

                     NameNode
NFS                                              Secondary
                                                 NameNode


                     JobTracker
                                                 JobTracker




      Commodity                       Commodity
          1..N                            1..N



      TaskTracker                     TaskTracker




      DataNode                        DataNode
NameNode


 Catalog of data blocks
  Prefers large files
  High memory consumption

 Journaled file system
  Saves state to disk
Robust, Primary
                     1



               NameNode




               JobTracker




Commodity
    1..N


TaskTracker




DataNode
Secondary NameNode

 Near-hot backup for NN
 Snapshots
  Missing journal entries
  Data loss in non-optimal setup
 Grid reconfig?
  Big IP device
Robust, Primary
       1          Hot Backup
                      0..1

 NameNode
                  Secondary
                  NameNode


 JobTracker
                  JobTracker
NameNode, Cont’d

 Backup via SNN
  Journal to NFS with replication?
 Single point of failure
  Failover plan
  Big IP device, virtual IP
Robust, Primary
                           1                       Hot Backup
                                                       0..1

                     NameNode
NFS                                                Secondary
                                                   NameNode


                     JobTracker
                                                   JobTracker




      Commodity
          1..N


      TaskTracker                     Virtual IP




      DataNode
DataNode

 Anonymous
 Data block storage
 “No identity” is positive trait
 Commodity equipment
Robust, Primary
                           1                     Hot Backup
                                                     0..1

                     NameNode
NFS                                              Secondary
                                                 NameNode


                     JobTracker
                                                 JobTracker




      Commodity                       Commodity
          1..N                            1..N



      TaskTracker                     TaskTracker




      DataNode                        DataNode
JobTracker
 Singleton
 Commonly co-located with NN
 Coordinates TaskTracker
 Load balances
 FairPlay scheduling
 Preemptive execution
Robust, Primary
                           1                     Hot Backup
                                                     0..1

                     NameNode
NFS                                              Secondary
                                                 NameNode


                     JobTracker
                                                 JobTracker




      Commodity                       Commodity
          1..N                            1..N



      TaskTracker                     TaskTracker




      DataNode                        DataNode
TaskTracker


 Runs MapReduce jobs
 Reports back to JobTracker
Robust, Primary
                           1                     Hot Backup
                                                     0..1

                     NameNode
NFS                                              Secondary
                                                 NameNode


                     JobTracker
                                                 JobTracker




      Commodity                       Commodity
          1..N                            1..N



      TaskTracker                     TaskTracker




      DataNode                        DataNode
Listing Nodes


      •Use Java’sa JPS tool to get list of
       nodes on box
      •sudo jps -l
Lab
Processing Nodes


 Anonymous
 “No identity” is positive trait
 Commodity equipment
Master Node

 Master is a special machine
 Use high quality hardware
 Single point of failure
 Recoverable
Key-value
  Store
HBase
HBase Basics
 Map-oriented storage
 Key value pairs
 Column families
 Stores to HDFS
 Fast
 Usable for synchronous responses
Direct Comparison



 Amazon SimpleDB
 Google BigTable (Datastore)
Competitors


 Facebook Cassandra
 LinkedIn Project Voldemort
Shortcomings

 HQL
 WHERE clause limited to key
 Other column filters are full-table
 scans
 No ad-hoc queries
hbase>
help

create 'mylittletable', 'mylittlecolumnfamily'
describe 'mylittletable'

put 'mylittletable', 'r2', 'mylittlecolumnfamily', 'x'

get 'mylittletable', 'r2'
scan 'mylittletable'
HBase


      •   Create column family
      •   Insert data
      •   Select data
Lab
DSLs
Pig
Pig Basics

 Yahoo-authored add-on
 High-level language for authoring
 data analysis programs
 Console
PIG Questions
 Ask big questions on unstructured
 data
  How many ___?
  Should we ____?
 Decide on the questions you want to
 ask long after you’ve collected the
 data.
Pig Data

999991,female,Mary,T,Hargrave,600 Quiet Valley Lane,Los Angeles,CA,90017,US,Mary.T.Hargrave@dodgit.com,
999992,male,Harold,J,Candelario,294 Ford Street,OAKLAND,CA,94607,US,Harold.J.Candelario@dodgit.com,ad2U
999993,female,Ruth,G,Carter,4890 Murphy Court,Shakopee,MN,55379,US,Ruth.G.Carter@mailinator.com,uaseu8e
999994,male,Lionel,J,Carter,2701 Irving Road,Saint Clairsville,OH,43950,US,Lionel.J.Carter@trashymail.c
999995,female,Georgia,C,Medina,4541 Cooks Mine Road,CLOVIS,NM,88101,US,Georgia.C.Medina@trashymail.com,
999996,male,Stanley,S,Cruz,1463 Jennifer Lane,Durham,NC,27703,US,Stanley.S.Cruz@pookmail.com,aehoh5rooG
999997,male,Justin,A,Delossantos,3169 Essex Court,MANCHESTER,VT,05254,US,Justin.A.Delossantos@mailinato
999998,male,Leonard,K,Baker,4672 Margaret Street,Houston,TX,77063,US,Leonard.K.Baker@trashymail.com,Aep
999999,female,Charissa,J,Thorne,2806 Cedar Street,Little Rock,AR,72211,US,Charissa.J.Thorne@trashymail.
1000000,male,Michael,L,Powell,2797 Turkey Pen Road,New York,NY,10013,US,Michael.L.Powell@mailinator.com
Pig Sample

Person = LOAD 'people.csv' using PigStorage(',');
Names = FOREACH Person GENERATE $2 AS name;
OrderedNames = ORDER Names BY name ASC;
GroupedNames = GROUP OrderedNames BY name;
NameCount = FOREACH GroupedNames
 GENERATE group, COUNT(OrderedNames);
store NameCount into 'names.out';
Pig Scripting



      •Re-column and sort data
Lab
Hive
Hive Basics
 Authored by
 SQL interface to HBase
 Hive is low-level
 Hive-specific metadata
 Data warehousing
hive
  -e 'select t1.x from mylittletable t1'
hive -S
  -e 'select a.col from tab1 a'
  > dump.txt
SELECT * FROM shakespeare
WHERE freq > 100
SORT BY freq ASC
LIMIT 10;
Load Hive Data



      • Load and select Hive data
Lab
Sync, Async


RDBMS SQL is realtime
Hive is primarily asynchronous
Sqoop
Sqoop


 A              utility
 Imports from RDBMS
 Outputs plaintext, SequenceFile, or Hive
sqoop --connect jdbc:mysql://database.example.com/
Data
Warehousing
Sync, Async


RDBMS is realtime
HBase is near realtime
Hive is asynchronous
Monitoring
Web Status Panels


 NameNode
 http://localhost:50070/
 JobTracker
 http://localhost:50030/
View Consoles


      •   Start job
      •   Watch consoles
      •   Browse HDFS via web interface
Lab
Configuration
     Files
Robust, Primary
                           1                     Hot Backup
                                                     0..1

                     NameNode
NFS                                              Secondary
                                                 NameNode


                     JobTracker
                                                 JobTracker




      Commodity                       Commodity
          1..N                            1..N



      TaskTracker                     TaskTracker




      DataNode                        DataNode
Grid Execution
Single Node
 start-all.sh
 Needs XML config
 ssh account setup
 Multiple JVMs
 List of processes
  jps -l
Start and Stop Grid

      •cd /usr/lib/hadoop-0.20/bin
      •sudo -u hadoop ./stop-all.sh
      •sudo jps -l
      •tail
      •sudo -u hadoop ./start-all.sh
      •sudo jps -l
Lab
SSH Setup


 create key
 ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Commodity
               Robust, Primary
                      1                                1..N



                                                   TaskTracker
start-all.sh    NameNode           start-dfs.sh
                                 start-mapred.sh


                JobTracker                         DataNode
Complete Grid


 start-all.sh
 Host names, ssh setup
 Default out-of-box experience
Commodity
                                                                  1..N


                                                              TaskTracker


                                                 h
                                           dfs.s
               Robust, Primary         rt-
                      1            sta                 h      DataNode
                                                 e d.s
                                         -m  apr
                                   s tart

start-all.sh    NameNode

                                                              Commodity
                                                                  1..N
                JobTracker               sta
                                               rt-d
                                                      fs.
                                 sta                     sh
                                       rt-
                                          ma                  TaskTracker
                                               pr
                                                 ed
                                                       .sh


                                                              DataNode
Grid on the Cloud
EMR Pricing
Summary
Sapir-WHorF Hypothesis...
                Remixed



The ability to store and process
massive data influences what you
decide to store.
Hadoop
                 divide and conquer petabyte-scale data




© Matthew McCullough, Ambient Ideas, LLC
Metadata

Matthew McCullough
  Ambient Ideas, LLC
  matthewm@ambientideas.com
  http://ambientideas.com/blog
  @matthewmccull
Code
  http://github.com/matthewmccullough/hadoop-intro

Mais conteúdo relacionado

Destaque

LCI Brand DNA Guide[1]
LCI Brand DNA Guide[1]LCI Brand DNA Guide[1]
LCI Brand DNA Guide[1]
Sonoco
 
Welcome to london
Welcome to londonWelcome to london
Welcome to london
corynik
 
Dasarengine 131021130211-phpapp01 (1) (1)
Dasarengine 131021130211-phpapp01 (1) (1)Dasarengine 131021130211-phpapp01 (1) (1)
Dasarengine 131021130211-phpapp01 (1) (1)
cecep supriadi
 
ชีทสรุปกฎหมายปกครองเบื้องต้น
ชีทสรุปกฎหมายปกครองเบื้องต้นชีทสรุปกฎหมายปกครองเบื้องต้น
ชีทสรุปกฎหมายปกครองเบื้องต้น
Mac Legendlaw
 

Destaque (19)

Geralnews13jan
Geralnews13janGeralnews13jan
Geralnews13jan
 
Global big data bootcamp-april2016
Global big data bootcamp-april2016Global big data bootcamp-april2016
Global big data bootcamp-april2016
 
2015 HPX BlindNavi
2015 HPX BlindNavi2015 HPX BlindNavi
2015 HPX BlindNavi
 
LCI Brand DNA Guide[1]
LCI Brand DNA Guide[1]LCI Brand DNA Guide[1]
LCI Brand DNA Guide[1]
 
Arum W --- Portfolio
Arum W --- PortfolioArum W --- Portfolio
Arum W --- Portfolio
 
Greenhouse gas trade-offs and N cycling in low-disturbance soils with long te...
Greenhouse gas trade-offs and N cycling in low-disturbance soils with long te...Greenhouse gas trade-offs and N cycling in low-disturbance soils with long te...
Greenhouse gas trade-offs and N cycling in low-disturbance soils with long te...
 
Welcome to london
Welcome to londonWelcome to london
Welcome to london
 
Dasarengine 131021130211-phpapp01 (1) (1)
Dasarengine 131021130211-phpapp01 (1) (1)Dasarengine 131021130211-phpapp01 (1) (1)
Dasarengine 131021130211-phpapp01 (1) (1)
 
Blut abnehmen (Deutsch für Ärzte)
Blut abnehmen (Deutsch für Ärzte)Blut abnehmen (Deutsch für Ärzte)
Blut abnehmen (Deutsch für Ärzte)
 
Chp4. Contextual Analysis
Chp4. Contextual AnalysisChp4. Contextual Analysis
Chp4. Contextual Analysis
 
The state of twitter 2016
The state of twitter 2016The state of twitter 2016
The state of twitter 2016
 
ชีทสรุปกฎหมายปกครองเบื้องต้น
ชีทสรุปกฎหมายปกครองเบื้องต้นชีทสรุปกฎหมายปกครองเบื้องต้น
ชีทสรุปกฎหมายปกครองเบื้องต้น
 
COURS
COURSCOURS
COURS
 
[Dl輪読会]video pixel networks
[Dl輪読会]video pixel networks[Dl輪読会]video pixel networks
[Dl輪読会]video pixel networks
 
[輪読会]Multilingual Image Description with Neural Sequence Models
[輪読会]Multilingual Image Description with Neural Sequence Models[輪読会]Multilingual Image Description with Neural Sequence Models
[輪読会]Multilingual Image Description with Neural Sequence Models
 
GoogLeNet Insights
GoogLeNet InsightsGoogLeNet Insights
GoogLeNet Insights
 
AIRCOM LTE Webinar 3 - LTE Carriers
AIRCOM LTE Webinar 3 - LTE CarriersAIRCOM LTE Webinar 3 - LTE Carriers
AIRCOM LTE Webinar 3 - LTE Carriers
 
Scenario Design Process
Scenario Design ProcessScenario Design Process
Scenario Design Process
 
Andrew Ng, Chief Scientist at Baidu
Andrew Ng, Chief Scientist at BaiduAndrew Ng, Chief Scientist at Baidu
Andrew Ng, Chief Scientist at Baidu
 

Semelhante a Hadoop at JavaZone 2010

Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
Dan Harvey
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
George Ang
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 

Semelhante a Hadoop at JavaZone 2010 (20)

Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
 
sigmod08
sigmod08sigmod08
sigmod08
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on Clouds
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 

Mais de Matthew McCullough

Mais de Matthew McCullough (20)

Using Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge InteractiveUsing Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge Interactive
 
All About GitHub Pull Requests
All About GitHub Pull RequestsAll About GitHub Pull Requests
All About GitHub Pull Requests
 
Adam Smith Builds an App
Adam Smith Builds an AppAdam Smith Builds an App
Adam Smith Builds an App
 
Git's Filter Branch Command
Git's Filter Branch CommandGit's Filter Branch Command
Git's Filter Branch Command
 
Git Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh MyGit Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh My
 
Git and GitHub at the San Francisco JUG
 Git and GitHub at the San Francisco JUG Git and GitHub at the San Francisco JUG
Git and GitHub at the San Francisco JUG
 
Finding Things in Git
Finding Things in GitFinding Things in Git
Finding Things in Git
 
Git and GitHub for RallyOn
Git and GitHub for RallyOnGit and GitHub for RallyOn
Git and GitHub for RallyOn
 
Migrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHubMigrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHub
 
Git Notes and GitHub
Git Notes and GitHubGit Notes and GitHub
Git Notes and GitHub
 
Intro to Git and GitHub
Intro to Git and GitHubIntro to Git and GitHub
Intro to Git and GitHub
 
Build Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUGBuild Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUG
 
Git Going for the Transylvania JUG
Git Going for the Transylvania JUGGit Going for the Transylvania JUG
Git Going for the Transylvania JUG
 
Transylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting AnnouncementsTransylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting Announcements
 
Game Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUGGame Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUG
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
JQuery Mobile
JQuery MobileJQuery Mobile
JQuery Mobile
 
R Data Analysis Software
R Data Analysis SoftwareR Data Analysis Software
R Data Analysis Software
 
Please, Stop Using Git
Please, Stop Using GitPlease, Stop Using Git
Please, Stop Using Git
 
Dr. Strangedev
Dr. StrangedevDr. Strangedev
Dr. Strangedev
 

Último

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 

Hadoop at JavaZone 2010

  • 1. Hadoop divide and conquer petabyte-scale data © Matthew McCullough, Ambient Ideas, LLC
  • 2. Metadata Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog @matthewmccull Code http://github.com/matthewmccullough/hadoop-intro
  • 4.
  • 5.
  • 6.
  • 7. ado op o ften. I u se H d my Tivo the s oun Th at’s I s kip a very time ma kes e com mer cial. rrency in Practice Ja va Concu n Goetz, author of -Bria
  • 9. u are using mpu ter yo The co right now ell hav e the very w sor may GHz pro ces fas test u’ll ever own yo
  • 14.
  • 15. Do ug Cu tt in g
  • 22. Today 0.21.0 current version Dozens of companies contributing Hundreds of companies using
  • 23.
  • 24.
  • 25.
  • 27. VM Sources Yahoo True to the OSS distribution Cloudera Desktop tools Both VMWare based
  • 28. Set up the Environment • Launch “cloudera-training 0.3.3” VMWare instance • Open a terminal window Lab
  • 29. Log Files •Attach tail to all the Hadoop logs Lab
  • 31. Processing Framework Data Storage MapReduce HDFS Hive QL Hive Chuckwa HBase Flume Avro ZooKeeper Hybrid Pig
  • 33. Hadoop Components Tool Purpose Common MapReduce HDFS Filesystem Pig Analyst MapReduce language HBase Column-oriented data storage Hive SQL-like language for HBase ZooKeeper Workflow & distributed transactions Chukwa Log file processing
  • 34. the Players Comm C hukwa ZooKeeper on H Base Hive HDFS
  • 36. Original Goals Web Indexer Yahoo Search Engine
  • 37. Pre-Hadoop Shortcomings Search engine update frequency Storage costs Expense of durability Ad-hoc queries
  • 38. Anti-Patterns Cured RAM-heavy RDBMS boxes Sharding Archiving Ever-smaller-range SQL queries
  • 44. Applications Protein folding pharmaceutical research Search Engine Indexing walking billions of web pages Product Recommendations based on other customer purchases Sorting terabytes to petabyes in size Classification government intelligence
  • 46. SELECT reccProd.name, reccProd.id FROM products reccProd WHERE purchases.customerId = (SELECT customerId FROM customers WHERE purchases.productId = thisProd) LIMIT 5
  • 47. 30% of Amazon sales are from recommendations
  • 48. ACID ATOMICITY CONSISTENCY ISOLATION DURABILITY
  • 49. CAP Consistency Availability Partition Tolerance
  • 51. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghem awat jeff@google.com, sanjay@goog le.com Google, Inc. Abstract given day, etc. Most such com MapReduce is a programmin putations are conceptu- g model and an associ- ally straightforward. However ated implementation for proc , the input data is usually essing and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of mac key/value pair to generate a set hines in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, dist values associated with the sam ribute the data, and handle e intermediate key. Many failures conspire to obscure the real world tasks are expressible original simple compu- in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this func As a reaction to this complex tional style are automati- ity, we designed a new cally parallelized and executed abstraction that allows us to exp on a large cluster of com- ress the simple computa- modity machines. The run-time tions we were trying to perform system takes care of the but hides the messy de- details of partitioning the inpu tails of parallelization, fault-tol t data, scheduling the pro- erance, data distribution gram’s execution across a set and load balancing in a libra of machines, handling ma- ry. Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional lang programmers without any uages. We realized that experience with parallel and most of our computations invo distributed systems to eas- lved applying a map op- ily utilize the resources of a larg eration to each logical “record” e distributed system. in our input in order to Our implementation of Map compute a set of intermediat Reduce runs on a large e key/value pairs, and then cluster of commodity machine applying a reduce operation to s and is highly scalable: all the values that shared a typical MapReduce computa the same key, in order to com tion processes many ter- bine the derived data ap- abytes of data on thousands of propriately. Our use of a func machines. Programmers tional model with user- find the system easy to use: hun specified map and reduce ope dreds of MapReduce pro- rations allows us to paral- grams have been implemente lelize large computations easi d and upwards of one thou- ly and to use re-execution sand MapReduce jobs are exec as the primary mechanism for uted on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large clus ters of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hun gives several examples. Sec dreds of special-purpose tion 3 describes an imple- computations that process larg mentation of the MapReduce e amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of deri scribes several refinements of ved data, such as inverted the programming model indices, various representatio that we have found useful. Sec ns of the graph structure tion 5 has performance of web documents, summaries measurements of our implem of the number of pages entation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experie nces in using it as the basis To appear in OSDI 2004 1
  • 52. mode l and gram ming “ A pro for ment ation imple nd ge nera ting pro cess ing a dat a s ets ” la rg e
  • 53.
  • 54. MapReduce a word counting conceptual example
  • 55. The Goal Provide the occurrence count of each distinct word across all documents
  • 56. Raw Data a folder of documents mydoc1.txt mydoc2.txt mydoc3.txt
  • 60. Reduce Again sort occurrences alphabetically
  • 61.
  • 63. package org.apache.hadoop.examples; import java.util.Random; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.*; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; /* Extracts matching regexs from input files and counts them. */ public class Grep extends Configured implements Tool { private Grep() {} // singleton public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
  • 64. import org.apache.hadoop.util.ToolRunner; /* Extracts matching regexs from input files and counts them. */ public class Grep extends Configured implements Tool { private Grep() {} // singleton public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } Path tempDir = new Path("grep-temp-"+ Integer.toString(new Random().nextInt (Integer.MAX_VALUE))); JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]);
  • 65. grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);
  • 66. FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); sortJob.setNumReduceTasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args [1])); sortJob.setOutputKeyComparatorClass // sort by decreasing freq (LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); }
  • 67. JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); sortJob.setNumReduceTasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args [1])); sortJob.setOutputKeyComparatorClass // sort by decreasing freq (LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; }
  • 69. /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.mapred.lib; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern;
  • 70. /** A {@link Mapper} that extracts text matching a regular expression. */ public class RegexMapper<K> extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> { private Pattern pattern; private int group; public void configure(JobConf job) { pattern = Pattern.compile(job.get("mapred.mapper.regex")); group = job.getInt("mapred.mapper.regex.group", 0); } public void map(K key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String text = value.toString(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new Text(matcher.group(group)), new LongWritable (1)); } } }
  • 71. Have Code, Will Travel Code travels to the data Opposite of traditional systems
  • 73.
  • 74. uted file alab le di strib “sc ibuted r large distr s yst em fo icati ons. nsive appl d ata-inte ance fa ult t oler ovides It pr in expensive e runn ing on whil ty h ardw are” com modi
  • 75. HDFS Basics Open Source implementation of Google BigTable Replicated data store Stored in 64MB blocks
  • 76. HDFS Rack location aware Configurable redundancy factor Self-healing Looks almost like *NIX filesystem
  • 77. Why HDFS? Random reads Parallel reads Redundancy
  • 79.
  • 80. Test HDFS • Upload a file • List directories Lab
  • 81. HDFS Upload • Show contents of file in HDFS • Show vocabulary of HDFS Lab
  • 82. HDFS Distcp • Copy file to other nodes • MapReduce job starts Lab
  • 83. HDFS Challenges Writes are re-writes today Append is planned Block size, alignment Small file inefficiencies NameNode SPOF
  • 84. FSCK •Check filesystem Lab
  • 85.
  • 89. NOSQL is... Semi-structured not death of the RDBMS no JOINing no Normalization big-data tools solving different issues than RDBMSes
  • 91. Scalable Data storage is pipelined Code travels to data Near linear hardware scalability
  • 92. Optimization Preemptive execution Maximizes use of faster hardware New jobs trump P.E.
  • 93. Fault Tolerant Configurable data redundancy Minimizes hardware failure impact Automatic job retries Self healing filesystem
  • 95. server Funerals No pagers go off when machines die Report of dead machines once a week Clean out the carcasses
  • 96. ss attributes Ro bustne eding ed fro m ble p revent plication code into ap Data redundancy Node death Retries Data geography Parallelism Scalability
  • 99. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  • 100. NameNode Catalog of data blocks Prefers large files High memory consumption Journaled file system Saves state to disk
  • 101. Robust, Primary 1 NameNode JobTracker Commodity 1..N TaskTracker DataNode
  • 102. Secondary NameNode Near-hot backup for NN Snapshots Missing journal entries Data loss in non-optimal setup Grid reconfig? Big IP device
  • 103. Robust, Primary 1 Hot Backup 0..1 NameNode Secondary NameNode JobTracker JobTracker
  • 104. NameNode, Cont’d Backup via SNN Journal to NFS with replication? Single point of failure Failover plan Big IP device, virtual IP
  • 105. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity 1..N TaskTracker Virtual IP DataNode
  • 106. DataNode Anonymous Data block storage “No identity” is positive trait Commodity equipment
  • 107. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  • 108. JobTracker Singleton Commonly co-located with NN Coordinates TaskTracker Load balances FairPlay scheduling Preemptive execution
  • 109. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  • 110. TaskTracker Runs MapReduce jobs Reports back to JobTracker
  • 111. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  • 112. Listing Nodes •Use Java’sa JPS tool to get list of nodes on box •sudo jps -l Lab
  • 113. Processing Nodes Anonymous “No identity” is positive trait Commodity equipment
  • 114. Master Node Master is a special machine Use high quality hardware Single point of failure Recoverable
  • 116. HBase
  • 117.
  • 118.
  • 119. HBase Basics Map-oriented storage Key value pairs Column families Stores to HDFS Fast Usable for synchronous responses
  • 120. Direct Comparison Amazon SimpleDB Google BigTable (Datastore)
  • 121. Competitors Facebook Cassandra LinkedIn Project Voldemort
  • 122. Shortcomings HQL WHERE clause limited to key Other column filters are full-table scans No ad-hoc queries
  • 123. hbase> help create 'mylittletable', 'mylittlecolumnfamily' describe 'mylittletable' put 'mylittletable', 'r2', 'mylittlecolumnfamily', 'x' get 'mylittletable', 'r2' scan 'mylittletable'
  • 124. HBase • Create column family • Insert data • Select data Lab
  • 125. DSLs
  • 126. Pig
  • 127. Pig Basics Yahoo-authored add-on High-level language for authoring data analysis programs Console
  • 128. PIG Questions Ask big questions on unstructured data How many ___? Should we ____? Decide on the questions you want to ask long after you’ve collected the data.
  • 129. Pig Data 999991,female,Mary,T,Hargrave,600 Quiet Valley Lane,Los Angeles,CA,90017,US,Mary.T.Hargrave@dodgit.com, 999992,male,Harold,J,Candelario,294 Ford Street,OAKLAND,CA,94607,US,Harold.J.Candelario@dodgit.com,ad2U 999993,female,Ruth,G,Carter,4890 Murphy Court,Shakopee,MN,55379,US,Ruth.G.Carter@mailinator.com,uaseu8e 999994,male,Lionel,J,Carter,2701 Irving Road,Saint Clairsville,OH,43950,US,Lionel.J.Carter@trashymail.c 999995,female,Georgia,C,Medina,4541 Cooks Mine Road,CLOVIS,NM,88101,US,Georgia.C.Medina@trashymail.com, 999996,male,Stanley,S,Cruz,1463 Jennifer Lane,Durham,NC,27703,US,Stanley.S.Cruz@pookmail.com,aehoh5rooG 999997,male,Justin,A,Delossantos,3169 Essex Court,MANCHESTER,VT,05254,US,Justin.A.Delossantos@mailinato 999998,male,Leonard,K,Baker,4672 Margaret Street,Houston,TX,77063,US,Leonard.K.Baker@trashymail.com,Aep 999999,female,Charissa,J,Thorne,2806 Cedar Street,Little Rock,AR,72211,US,Charissa.J.Thorne@trashymail. 1000000,male,Michael,L,Powell,2797 Turkey Pen Road,New York,NY,10013,US,Michael.L.Powell@mailinator.com
  • 130. Pig Sample Person = LOAD 'people.csv' using PigStorage(','); Names = FOREACH Person GENERATE $2 AS name; OrderedNames = ORDER Names BY name ASC; GroupedNames = GROUP OrderedNames BY name; NameCount = FOREACH GroupedNames GENERATE group, COUNT(OrderedNames); store NameCount into 'names.out';
  • 131.
  • 132.
  • 133. Pig Scripting •Re-column and sort data Lab
  • 134. Hive
  • 135. Hive Basics Authored by SQL interface to HBase Hive is low-level Hive-specific metadata Data warehousing
  • 136. hive -e 'select t1.x from mylittletable t1'
  • 137. hive -S -e 'select a.col from tab1 a' > dump.txt
  • 138. SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10;
  • 139. Load Hive Data • Load and select Hive data Lab
  • 140.
  • 141. Sync, Async RDBMS SQL is realtime Hive is primarily asynchronous
  • 142. Sqoop
  • 143.
  • 144. Sqoop A utility Imports from RDBMS Outputs plaintext, SequenceFile, or Hive
  • 147. Sync, Async RDBMS is realtime HBase is near realtime Hive is asynchronous
  • 149. Web Status Panels NameNode http://localhost:50070/ JobTracker http://localhost:50030/
  • 150.
  • 151.
  • 152.
  • 153.
  • 154. View Consoles • Start job • Watch consoles • Browse HDFS via web interface Lab
  • 155. Configuration Files
  • 156. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  • 158. Single Node start-all.sh Needs XML config ssh account setup Multiple JVMs List of processes jps -l
  • 159. Start and Stop Grid •cd /usr/lib/hadoop-0.20/bin •sudo -u hadoop ./stop-all.sh •sudo jps -l •tail •sudo -u hadoop ./start-all.sh •sudo jps -l Lab
  • 160. SSH Setup create key ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  • 161. Commodity Robust, Primary 1 1..N TaskTracker start-all.sh NameNode start-dfs.sh start-mapred.sh JobTracker DataNode
  • 162. Complete Grid start-all.sh Host names, ssh setup Default out-of-box experience
  • 163. Commodity 1..N TaskTracker h dfs.s Robust, Primary rt- 1 sta h DataNode e d.s -m apr s tart start-all.sh NameNode Commodity 1..N JobTracker sta rt-d fs. sta sh rt- ma TaskTracker pr ed .sh DataNode
  • 164. Grid on the Cloud
  • 166.
  • 168. Sapir-WHorF Hypothesis... Remixed The ability to store and process massive data influences what you decide to store.
  • 169.
  • 170. Hadoop divide and conquer petabyte-scale data © Matthew McCullough, Ambient Ideas, LLC
  • 171. Metadata Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog @matthewmccull Code http://github.com/matthewmccullough/hadoop-intro