SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 4
                 September 15, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
Acknowledgments
        Course design and slides based on
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Today’s Agenda
• Practical Hadoop
  – Input/Ouput
  – Splits: small file and whole file operations
  – Compression
  – Mounting HDFS
  – Hadoop Workflow and EC2/S3
Practical Hadoop
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 17
Courtesy of Chuck Lam’s Hadoop In Action (2010), pp. 48-49
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 51
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 191
Command-Line Parsing




      Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 135
Data Types in Hadoop

          Writable        Defines a de/serialization protocol.
                          Every data type in Hadoop is a Writable.


     WritableComparable Defines a sort order. All keys must be
                        of this type (but not values).




        IntWritable       Concrete classes for different data types.
        LongWritable
        Text
        …



       SequenceFiles      Binary encoded of a sequence of
                          key/value pairs
Hadoop basic types




             Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 46
Complex Data Types in Hadoop
   How do you implement complex data types?
   The easiest way:
       Encoded it as Text, e.g., (a, b) = “a:b”
       Use regular expressions to parse and extract data
       Works, but pretty hack-ish
   The hard way:
       Define a custom implementation of WritableComprable
       Must implement: readFields, write, compareTo
       Computationally efficient, but slow for rapid prototyping
   Alternatives:
       Cloud9 offers two other choices: Tuple and JSON
       (Actually, not that useful in practice)
InputFormat &RecordReader
                                     Courtesy of Tom White’s
                                     Hadoop: The Definitive Guide,
                                     2nd Edition (2010), pp. 198-199




                         Split is logical; atomic
                         records are never split




                        Note re-use key & value objects!
Courtesy of Tom White’s
Hadoop: The Definitive Guide,
2nd Edition (2010), p. 201
Input




        Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 53
Output




         Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 58
OutputFormat    Reducer        Reducer         Reduce




                                              RecordWriter   RecordWriter   RecordWriter




                                              Output File    Output File    Output File




Source: redrawn from a slide by Cloduera, cc-licensed
Creating Input Splits (White p. 202-203)




   FileInputFormat: large files split into blocks
       isSplitable() – default TRUE
       computeSplitSize() = max(minSize, min(maxSize,blockSize) )
       getSplits()…
   How to prevent splitting?
       Option 1: set mapred.min.splitsize=Long.MAX_VALUE
       Option 2: subclass FileInputFormat, set isSplitable()=FALSE
How to process whole file as a single record?

   e.g. file conversion


   Preventing splitting is necessary, but not sufficient
       Need a RecordReader that delivers entire file as a record


   Implement WholeFile input format & record reader recipe
       See White pp. 206-209
       Overrides getRecordReader() in FileInputFormat
       Defines new WholeFileRecordReader
Small Files
   Files < Hadoop block size are never split (by default)
       Note this is with default mapred.min.splitsize = 1 byte
       Could extend FileInputFormat to override this behavior
   Using many small files inefficient in Hadoop
       Overhead for TaskTracker, JobTracker, Map object, …
       Requires more disk seeks
       Wasteful for NameNode memory
   How to deal with small files??
Dealing with small files
    Pre-processing: merge into one or more bigger files
        Doubles disk space, unless clever (can delete after merge)
        Create Hadoop Archive (White pp. 72-73)
          • Doesn’t solve splitting problem, just reduces NameNode memory
        Simple text: just concatenate (e.g. each record on a single line)
        XML: concatenate, specify start/end tags
         StreamXmlRecordReader (as newline is end tag for Text)
        Create a SequenceFile (see White pp. 117-118)
          • Sequence of records, all with same (key,value) type
          • E.g. Key=filename, Value=text or bytes of original file
          • Can also use for larger files, e.g. if block processing is really fast
    Use CombineFileInputFormat
        Reduces map overhead, but not seeks or NameNode memory…
        Only an abstract class provided, you get to implement it… :-<
        Could use to speed up the pre-processing above…
Multiple File Formats?
    What if you have multiple formats for same content type?
    MultipleInputs (White pp. 214-215)
        Specify InputFormat & Mapper to use on a per-path basis
          • Path could be a directory or a single file
              • Even a single file could have many records (e.g. Hadoop archive or
                SequenceFile)
        All mappers must have the same output signature!
          • Same reducer used for all (only input format is different, not the
            logical records being processed by the different mappers)
    What about multiple file formats stored in the same
     Archive or SequenceFile?
    Multiple formats stored in the same directory?
    How are multiple file types typically handled in general?
        e.g. factory pattern, White p. 80
White 77-86, Lam 153-155

Data Compression
   Big data = big disk space & I/O (bound) transfer times
       Affects both intermediate (mapper output) and persistent data
   Compression makes big data less big (but still cool)
       Often 1/4th size of original data
   Main issues
       Does the compression format support splitting?
         • What happens to parallelization if an entire 8GB compressed file has
           to be decompressed before we can access the splits?
       Compression/decompression ratio vs. speed
         • More compression reduces disk space and transfer times, but…
         • Slow compression can take longer than reduced transfer time savings
         • Use native libraries!
Courtesy of Tom White’s
                           Hadoop: The Definitive Guide,
                           2nd Edition (2010), Ch. 4




Slow; decompression can’t keep pace disk reads
Compression Speed
      LZO 2x faster than gzip
      LZO ~15-20x faster than bzip2
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/




 http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html
Splittable LZO to the rescue
   LZO format not internally splittable, but we can create
    a separate, accompanying index of split points
Recipe
   Get LZO from Cloudera or elsewhere, and setup
       See URL on last slide for instructions
   LZO compress files, copy to HDFS at /path
   Index them: $ hadoop jar /path/to/hadoop-lzo.jar
    com.hadoop.compression.lzo.LzoIndexer /path
   Use hadoop-lzo’s LzoTextInputFormat instead of TextInputFormat
   Voila!
Compression API for persistent data
   JobConf helper functions –or– set properties
   Input
       conf.setInputFormatClass(LzoTextInputFormat.class);
   Persistent (reducer) output
       FileOutputFormat.setCompressOutput(conf, true)
       FileOutputFormat.setOutputCompressorClass(conf, LzopCodec.class)




                                                              Courtesy of Tom White’s
                                                              Hadoop: The Definitive Guide,
                                                              2nd Edition (2010), p. 85
Compression API for intermediate data
    Similar JobConf helper functions –or– set properties
        conf.setCompressMapOutput()
        Conf.setMapOutputCompressClass(LzopCodec.class)




                                                     Courtesy of Chuck Lam’s
                                                     Hadoop In Action(2010),
                                                     pp. 153-155
SequenceFile & compression
   Use SequenceFile for passing data between Hadoop jobs
       Optimized for this usage case
       conf.setOutputFormat(SequenceFileOutputFormat.class)
   With compression, one more parameter to set
       Default compression per-record; almost always preferable to
        compress on a per-block basis
From “hadoop fs X” -> Mounted HDFS




                       See White p. 50;
                       hadoop: src/contrib/fuse-dfs
Hadoop Workflow

                                        1. Load data into HDFS


    2. Develop code locally



                   3. Submit MapReduce job
                   3a. Go back to Step 2
                                                Hadoop Cluster
    You


                                        4. Retrieve data from HDFS
On Amazon: With EC2
                                          0. Allocate Hadoop cluster
                                          1. Load data into HDFS

                                                        EC2
      2. Develop code locally



                     3. Submit MapReduce job
                     3a. Go back to Step 2
      You                                       Your Hadoop Cluster


                                          4. Retrieve data from HDFS
                                          5. Clean up!


Uh oh. Where did the data go?
On Amazon: EC2 and S3

                          Copy from S3 to HDFS

   EC2                                                  S3
 (The Cloud)                                     (Persistent Store)




        Your Hadoop Cluster




                          Copy from HFDS to S3

Mais conteúdo relacionado

Mais procurados

SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopJan Pieter Posthuma
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014rpbrehm
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 

Mais procurados (20)

SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 

Semelhante a Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conferencenkabra
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 

Semelhante a Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
final report
final reportfinal report
final report
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 

Mais de Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopMatthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing ScienceMatthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease
 

Mais de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Último

Hyundai World Rally Team in action at 2024 WRC
Hyundai World Rally Team in action at 2024 WRCHyundai World Rally Team in action at 2024 WRC
Hyundai World Rally Team in action at 2024 WRCHyundai Motor Group
 
VIP Mumbai Call Girls Thakur village Just Call 9920874524 with A/C Room Cash ...
VIP Mumbai Call Girls Thakur village Just Call 9920874524 with A/C Room Cash ...VIP Mumbai Call Girls Thakur village Just Call 9920874524 with A/C Room Cash ...
VIP Mumbai Call Girls Thakur village Just Call 9920874524 with A/C Room Cash ...Garima Khatri
 
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile GirlsVip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girlsshivangimorya083
 
What Causes BMW Chassis Stabilization Malfunction Warning To Appear
What Causes BMW Chassis Stabilization Malfunction Warning To AppearWhat Causes BMW Chassis Stabilization Malfunction Warning To Appear
What Causes BMW Chassis Stabilization Malfunction Warning To AppearJCL Automotive
 
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Full night Service for more than 1 person
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Full night Service for more than 1 personDelhi Call Girls Saket 9711199171 ☎✔👌✔ Full night Service for more than 1 person
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Full night Service for more than 1 personshivangimorya083
 
Rockwell Automation 2711R PanelView 800 HMI
Rockwell Automation 2711R PanelView 800 HMIRockwell Automation 2711R PanelView 800 HMI
Rockwell Automation 2711R PanelView 800 HMIAsteam Techno
 
꧁༒☬ 7042364481 (Call Girl) In Dwarka Delhi Escort Service In Delhi Ncr☬༒꧂
꧁༒☬ 7042364481 (Call Girl) In Dwarka Delhi Escort Service In Delhi Ncr☬༒꧂꧁༒☬ 7042364481 (Call Girl) In Dwarka Delhi Escort Service In Delhi Ncr☬༒꧂
꧁༒☬ 7042364481 (Call Girl) In Dwarka Delhi Escort Service In Delhi Ncr☬༒꧂Hot Call Girls In Sector 58 (Noida)
 
UNIT-V-ELECTRIC AND HYBRID VEHICLES.pptx
UNIT-V-ELECTRIC AND HYBRID VEHICLES.pptxUNIT-V-ELECTRIC AND HYBRID VEHICLES.pptx
UNIT-V-ELECTRIC AND HYBRID VEHICLES.pptxDineshKumar4165
 
Alia +91-9537192988-Experience the Unmatchable Pleasure with Model Ahmedabad ...
Alia +91-9537192988-Experience the Unmatchable Pleasure with Model Ahmedabad ...Alia +91-9537192988-Experience the Unmatchable Pleasure with Model Ahmedabad ...
Alia +91-9537192988-Experience the Unmatchable Pleasure with Model Ahmedabad ...Niya Khan
 
83778-77756 ( HER.SELF ) Brings Call Girls In Laxmi Nagar
83778-77756 ( HER.SELF ) Brings Call Girls In Laxmi Nagar83778-77756 ( HER.SELF ) Brings Call Girls In Laxmi Nagar
83778-77756 ( HER.SELF ) Brings Call Girls In Laxmi Nagardollysharma2066
 
Crash Vehicle Emergency Rescue Slideshow.ppt
Crash Vehicle Emergency Rescue Slideshow.pptCrash Vehicle Emergency Rescue Slideshow.ppt
Crash Vehicle Emergency Rescue Slideshow.pptVlademirGebDubouzet1
 
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp NumberVip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Numberkumarajju5765
 
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111Sapana Sha
 
Dubai Call Girls Size E6 (O525547819) Call Girls In Dubai
Dubai Call Girls  Size E6 (O525547819) Call Girls In DubaiDubai Call Girls  Size E6 (O525547819) Call Girls In Dubai
Dubai Call Girls Size E6 (O525547819) Call Girls In Dubaikojalkojal131
 
FULL ENJOY - 9953040155 Call Girls in Sector 61 | Noida
FULL ENJOY - 9953040155 Call Girls in Sector 61 | NoidaFULL ENJOY - 9953040155 Call Girls in Sector 61 | Noida
FULL ENJOY - 9953040155 Call Girls in Sector 61 | NoidaMalviyaNagarCallGirl
 
Beautiful Vip Call Girls Punjabi Bagh 9711199012 Call /Whatsapps
Beautiful Vip  Call Girls Punjabi Bagh 9711199012 Call /WhatsappsBeautiful Vip  Call Girls Punjabi Bagh 9711199012 Call /Whatsapps
Beautiful Vip Call Girls Punjabi Bagh 9711199012 Call /Whatsappssapnasaifi408
 
(COD) ̄Young Call Girls In Dwarka , New Delhi꧁❤ 7042364481❤꧂ Escorts Service i...
(COD) ̄Young Call Girls In Dwarka , New Delhi꧁❤ 7042364481❤꧂ Escorts Service i...(COD) ̄Young Call Girls In Dwarka , New Delhi꧁❤ 7042364481❤꧂ Escorts Service i...
(COD) ̄Young Call Girls In Dwarka , New Delhi꧁❤ 7042364481❤꧂ Escorts Service i...Hot Call Girls In Sector 58 (Noida)
 

Último (20)

Hyundai World Rally Team in action at 2024 WRC
Hyundai World Rally Team in action at 2024 WRCHyundai World Rally Team in action at 2024 WRC
Hyundai World Rally Team in action at 2024 WRC
 
VIP Mumbai Call Girls Thakur village Just Call 9920874524 with A/C Room Cash ...
VIP Mumbai Call Girls Thakur village Just Call 9920874524 with A/C Room Cash ...VIP Mumbai Call Girls Thakur village Just Call 9920874524 with A/C Room Cash ...
VIP Mumbai Call Girls Thakur village Just Call 9920874524 with A/C Room Cash ...
 
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile GirlsVip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
 
What Causes BMW Chassis Stabilization Malfunction Warning To Appear
What Causes BMW Chassis Stabilization Malfunction Warning To AppearWhat Causes BMW Chassis Stabilization Malfunction Warning To Appear
What Causes BMW Chassis Stabilization Malfunction Warning To Appear
 
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Full night Service for more than 1 person
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Full night Service for more than 1 personDelhi Call Girls Saket 9711199171 ☎✔👌✔ Full night Service for more than 1 person
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Full night Service for more than 1 person
 
Rockwell Automation 2711R PanelView 800 HMI
Rockwell Automation 2711R PanelView 800 HMIRockwell Automation 2711R PanelView 800 HMI
Rockwell Automation 2711R PanelView 800 HMI
 
꧁༒☬ 7042364481 (Call Girl) In Dwarka Delhi Escort Service In Delhi Ncr☬༒꧂
꧁༒☬ 7042364481 (Call Girl) In Dwarka Delhi Escort Service In Delhi Ncr☬༒꧂꧁༒☬ 7042364481 (Call Girl) In Dwarka Delhi Escort Service In Delhi Ncr☬༒꧂
꧁༒☬ 7042364481 (Call Girl) In Dwarka Delhi Escort Service In Delhi Ncr☬༒꧂
 
UNIT-V-ELECTRIC AND HYBRID VEHICLES.pptx
UNIT-V-ELECTRIC AND HYBRID VEHICLES.pptxUNIT-V-ELECTRIC AND HYBRID VEHICLES.pptx
UNIT-V-ELECTRIC AND HYBRID VEHICLES.pptx
 
Alia +91-9537192988-Experience the Unmatchable Pleasure with Model Ahmedabad ...
Alia +91-9537192988-Experience the Unmatchable Pleasure with Model Ahmedabad ...Alia +91-9537192988-Experience the Unmatchable Pleasure with Model Ahmedabad ...
Alia +91-9537192988-Experience the Unmatchable Pleasure with Model Ahmedabad ...
 
83778-77756 ( HER.SELF ) Brings Call Girls In Laxmi Nagar
83778-77756 ( HER.SELF ) Brings Call Girls In Laxmi Nagar83778-77756 ( HER.SELF ) Brings Call Girls In Laxmi Nagar
83778-77756 ( HER.SELF ) Brings Call Girls In Laxmi Nagar
 
Call Girls In Kirti Nagar 7042364481 Escort Service 24x7 Delhi
Call Girls In Kirti Nagar 7042364481 Escort Service 24x7 DelhiCall Girls In Kirti Nagar 7042364481 Escort Service 24x7 Delhi
Call Girls In Kirti Nagar 7042364481 Escort Service 24x7 Delhi
 
Crash Vehicle Emergency Rescue Slideshow.ppt
Crash Vehicle Emergency Rescue Slideshow.pptCrash Vehicle Emergency Rescue Slideshow.ppt
Crash Vehicle Emergency Rescue Slideshow.ppt
 
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp NumberVip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
 
Hotel Escorts Sushant Golf City - 9548273370 Call Girls Service in Lucknow, c...
Hotel Escorts Sushant Golf City - 9548273370 Call Girls Service in Lucknow, c...Hotel Escorts Sushant Golf City - 9548273370 Call Girls Service in Lucknow, c...
Hotel Escorts Sushant Golf City - 9548273370 Call Girls Service in Lucknow, c...
 
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
 
Dubai Call Girls Size E6 (O525547819) Call Girls In Dubai
Dubai Call Girls  Size E6 (O525547819) Call Girls In DubaiDubai Call Girls  Size E6 (O525547819) Call Girls In Dubai
Dubai Call Girls Size E6 (O525547819) Call Girls In Dubai
 
FULL ENJOY - 9953040155 Call Girls in Sector 61 | Noida
FULL ENJOY - 9953040155 Call Girls in Sector 61 | NoidaFULL ENJOY - 9953040155 Call Girls in Sector 61 | Noida
FULL ENJOY - 9953040155 Call Girls in Sector 61 | Noida
 
Beautiful Vip Call Girls Punjabi Bagh 9711199012 Call /Whatsapps
Beautiful Vip  Call Girls Punjabi Bagh 9711199012 Call /WhatsappsBeautiful Vip  Call Girls Punjabi Bagh 9711199012 Call /Whatsapps
Beautiful Vip Call Girls Punjabi Bagh 9711199012 Call /Whatsapps
 
Call Girls in Shri Niwas Puri Delhi 💯Call Us 🔝9953056974🔝
Call Girls in  Shri Niwas Puri  Delhi 💯Call Us 🔝9953056974🔝Call Girls in  Shri Niwas Puri  Delhi 💯Call Us 🔝9953056974🔝
Call Girls in Shri Niwas Puri Delhi 💯Call Us 🔝9953056974🔝
 
(COD) ̄Young Call Girls In Dwarka , New Delhi꧁❤ 7042364481❤꧂ Escorts Service i...
(COD) ̄Young Call Girls In Dwarka , New Delhi꧁❤ 7042364481❤꧂ Escorts Service i...(COD) ̄Young Call Girls In Dwarka , New Delhi꧁❤ 7042364481❤꧂ Escorts Service i...
(COD) ̄Young Call Girls In Dwarka , New Delhi꧁❤ 7042364481❤꧂ Escorts Service i...
 

Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 4 September 15, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of the following excellent Hadoop books (order yours today!) • Chuck Lam’s Hadoop In Action (2010) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Today’s Agenda • Practical Hadoop – Input/Ouput – Splits: small file and whole file operations – Compression – Mounting HDFS – Hadoop Workflow and EC2/S3
  • 5. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 6. Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 17
  • 7. Courtesy of Chuck Lam’s Hadoop In Action (2010), pp. 48-49
  • 8. Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 51
  • 9. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 191
  • 10. Command-Line Parsing Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 135
  • 11. Data Types in Hadoop Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComparable Defines a sort order. All keys must be of this type (but not values). IntWritable Concrete classes for different data types. LongWritable Text … SequenceFiles Binary encoded of a sequence of key/value pairs
  • 12. Hadoop basic types Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 46
  • 13. Complex Data Types in Hadoop  How do you implement complex data types?  The easiest way:  Encoded it as Text, e.g., (a, b) = “a:b”  Use regular expressions to parse and extract data  Works, but pretty hack-ish  The hard way:  Define a custom implementation of WritableComprable  Must implement: readFields, write, compareTo  Computationally efficient, but slow for rapid prototyping  Alternatives:  Cloud9 offers two other choices: Tuple and JSON  (Actually, not that useful in practice)
  • 14. InputFormat &RecordReader Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), pp. 198-199 Split is logical; atomic records are never split Note re-use key & value objects!
  • 15. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 201
  • 16. Input Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 53
  • 17. Output Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 58
  • 18. OutputFormat Reducer Reducer Reduce RecordWriter RecordWriter RecordWriter Output File Output File Output File Source: redrawn from a slide by Cloduera, cc-licensed
  • 19. Creating Input Splits (White p. 202-203)  FileInputFormat: large files split into blocks  isSplitable() – default TRUE  computeSplitSize() = max(minSize, min(maxSize,blockSize) )  getSplits()…  How to prevent splitting?  Option 1: set mapred.min.splitsize=Long.MAX_VALUE  Option 2: subclass FileInputFormat, set isSplitable()=FALSE
  • 20. How to process whole file as a single record?  e.g. file conversion  Preventing splitting is necessary, but not sufficient  Need a RecordReader that delivers entire file as a record  Implement WholeFile input format & record reader recipe  See White pp. 206-209  Overrides getRecordReader() in FileInputFormat  Defines new WholeFileRecordReader
  • 21. Small Files  Files < Hadoop block size are never split (by default)  Note this is with default mapred.min.splitsize = 1 byte  Could extend FileInputFormat to override this behavior  Using many small files inefficient in Hadoop  Overhead for TaskTracker, JobTracker, Map object, …  Requires more disk seeks  Wasteful for NameNode memory  How to deal with small files??
  • 22. Dealing with small files  Pre-processing: merge into one or more bigger files  Doubles disk space, unless clever (can delete after merge)  Create Hadoop Archive (White pp. 72-73) • Doesn’t solve splitting problem, just reduces NameNode memory  Simple text: just concatenate (e.g. each record on a single line)  XML: concatenate, specify start/end tags StreamXmlRecordReader (as newline is end tag for Text)  Create a SequenceFile (see White pp. 117-118) • Sequence of records, all with same (key,value) type • E.g. Key=filename, Value=text or bytes of original file • Can also use for larger files, e.g. if block processing is really fast  Use CombineFileInputFormat  Reduces map overhead, but not seeks or NameNode memory…  Only an abstract class provided, you get to implement it… :-<  Could use to speed up the pre-processing above…
  • 23. Multiple File Formats?  What if you have multiple formats for same content type?  MultipleInputs (White pp. 214-215)  Specify InputFormat & Mapper to use on a per-path basis • Path could be a directory or a single file • Even a single file could have many records (e.g. Hadoop archive or SequenceFile)  All mappers must have the same output signature! • Same reducer used for all (only input format is different, not the logical records being processed by the different mappers)  What about multiple file formats stored in the same Archive or SequenceFile?  Multiple formats stored in the same directory?  How are multiple file types typically handled in general?  e.g. factory pattern, White p. 80
  • 24. White 77-86, Lam 153-155 Data Compression  Big data = big disk space & I/O (bound) transfer times  Affects both intermediate (mapper output) and persistent data  Compression makes big data less big (but still cool)  Often 1/4th size of original data  Main issues  Does the compression format support splitting? • What happens to parallelization if an entire 8GB compressed file has to be decompressed before we can access the splits?  Compression/decompression ratio vs. speed • More compression reduces disk space and transfer times, but… • Slow compression can take longer than reduced transfer time savings • Use native libraries!
  • 25. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), Ch. 4 Slow; decompression can’t keep pace disk reads
  • 26. Compression Speed  LZO 2x faster than gzip  LZO ~15-20x faster than bzip2 http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html
  • 27. Splittable LZO to the rescue  LZO format not internally splittable, but we can create a separate, accompanying index of split points Recipe  Get LZO from Cloudera or elsewhere, and setup  See URL on last slide for instructions  LZO compress files, copy to HDFS at /path  Index them: $ hadoop jar /path/to/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer /path  Use hadoop-lzo’s LzoTextInputFormat instead of TextInputFormat  Voila!
  • 28. Compression API for persistent data  JobConf helper functions –or– set properties  Input  conf.setInputFormatClass(LzoTextInputFormat.class);  Persistent (reducer) output  FileOutputFormat.setCompressOutput(conf, true)  FileOutputFormat.setOutputCompressorClass(conf, LzopCodec.class) Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 85
  • 29. Compression API for intermediate data  Similar JobConf helper functions –or– set properties  conf.setCompressMapOutput()  Conf.setMapOutputCompressClass(LzopCodec.class) Courtesy of Chuck Lam’s Hadoop In Action(2010), pp. 153-155
  • 30. SequenceFile & compression  Use SequenceFile for passing data between Hadoop jobs  Optimized for this usage case  conf.setOutputFormat(SequenceFileOutputFormat.class)  With compression, one more parameter to set  Default compression per-record; almost always preferable to compress on a per-block basis
  • 31. From “hadoop fs X” -> Mounted HDFS See White p. 50; hadoop: src/contrib/fuse-dfs
  • 32. Hadoop Workflow 1. Load data into HDFS 2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step 2 Hadoop Cluster You 4. Retrieve data from HDFS
  • 33. On Amazon: With EC2 0. Allocate Hadoop cluster 1. Load data into HDFS EC2 2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step 2 You Your Hadoop Cluster 4. Retrieve data from HDFS 5. Clean up! Uh oh. Where did the data go?
  • 34. On Amazon: EC2 and S3 Copy from S3 to HDFS EC2 S3 (The Cloud) (Persistent Store) Your Hadoop Cluster Copy from HFDS to S3