Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Computing for Text Analysis
CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011

Lecture 4
September 15, 2011

Jason Baldridge Matt Lease
Department of Linguistics School of Information
University of Texas at Austin University of Texas at Austin
Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu

Acknowledgments
Course design and slides based on
Jimmy Lin’s cloud computing courses at
the University of Maryland, College Park

Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
2nd Edition (2010)

Today’s Agenda
• Practical Hadoop
– Input/Ouput
– Splits: small file and whole file operations
– Compression
– Mounting HDFS
– Hadoop Workflow and EC2/S3

“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

Map(String docid, String text):
for each word w in text:
Emit(w, 1);

Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);

Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 17

Courtesy of Chuck Lam’s Hadoop In Action (2010), pp. 48-49

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 191

Command-Line Parsing

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 135

Data Types in Hadoop

Writable Defines a de/serialization protocol.
Every data type in Hadoop is a Writable.

WritableComparable Defines a sort order. All keys must be
of this type (but not values).

IntWritable Concrete classes for different data types.
LongWritable
Text
…

SequenceFiles Binary encoded of a sequence of
key/value pairs

Hadoop basic types


Complex Data Types in Hadoop
 How do you implement complex data types?
 The easiest way:
 Encoded it as Text, e.g., (a, b) = “a:b”
 Use regular expressions to parse and extract data
 Works, but pretty hack-ish
 The hard way:
 Define a custom implementation of WritableComprable
 Must implement: readFields, write, compareTo
 Computationally efficient, but slow for rapid prototyping
 Alternatives:
 Cloud9 offers two other choices: Tuple and JSON
 (Actually, not that useful in practice)

InputFormat &RecordReader
Courtesy of Tom White’s
Hadoop: The Definitive Guide,
2nd Edition (2010), pp. 198-199

Split is logical; atomic
records are never split

Note re-use key & value objects!

2nd Edition (2010), p. 201

Input


Output


OutputFormat Reducer Reducer Reduce

RecordWriter RecordWriter RecordWriter

Output File Output File Output File

Source: redrawn from a slide by Cloduera, cc-licensed

Creating Input Splits (White p. 202-203)

 FileInputFormat: large files split into blocks
 isSplitable() – default TRUE
 computeSplitSize() = max(minSize, min(maxSize,blockSize) )
 getSplits()…
 How to prevent splitting?
 Option 1: set mapred.min.splitsize=Long.MAX_VALUE
 Option 2: subclass FileInputFormat, set isSplitable()=FALSE

How to process whole file as a single record?

 e.g. file conversion

 Preventing splitting is necessary, but not sufficient
 Need a RecordReader that delivers entire file as a record

 Implement WholeFile input format & record reader recipe
 See White pp. 206-209
 Overrides getRecordReader() in FileInputFormat
 Defines new WholeFileRecordReader

Small Files
 Files < Hadoop block size are never split (by default)
 Note this is with default mapred.min.splitsize = 1 byte
 Could extend FileInputFormat to override this behavior
 Using many small files inefficient in Hadoop
 Overhead for TaskTracker, JobTracker, Map object, …
 Requires more disk seeks
 Wasteful for NameNode memory
 How to deal with small files??

Dealing with small files
 Pre-processing: merge into one or more bigger files
 Doubles disk space, unless clever (can delete after merge)
 Create Hadoop Archive (White pp. 72-73)
• Doesn’t solve splitting problem, just reduces NameNode memory
 Simple text: just concatenate (e.g. each record on a single line)
 XML: concatenate, specify start/end tags
StreamXmlRecordReader (as newline is end tag for Text)
 Create a SequenceFile (see White pp. 117-118)
• Sequence of records, all with same (key,value) type
• E.g. Key=filename, Value=text or bytes of original file
• Can also use for larger files, e.g. if block processing is really fast
 Use CombineFileInputFormat
 Reduces map overhead, but not seeks or NameNode memory…
 Only an abstract class provided, you get to implement it… :-<
 Could use to speed up the pre-processing above…

Multiple File Formats?
 What if you have multiple formats for same content type?
 MultipleInputs (White pp. 214-215)
 Specify InputFormat & Mapper to use on a per-path basis
• Path could be a directory or a single file
• Even a single file could have many records (e.g. Hadoop archive or
SequenceFile)
 All mappers must have the same output signature!
• Same reducer used for all (only input format is different, not the
logical records being processed by the different mappers)
 What about multiple file formats stored in the same
Archive or SequenceFile?
 Multiple formats stored in the same directory?
 How are multiple file types typically handled in general?
 e.g. factory pattern, White p. 80

White 77-86, Lam 153-155

Data Compression
 Big data = big disk space & I/O (bound) transfer times
 Affects both intermediate (mapper output) and persistent data
 Compression makes big data less big (but still cool)
 Often 1/4th size of original data
 Main issues
 Does the compression format support splitting?
• What happens to parallelization if an entire 8GB compressed file has
to be decompressed before we can access the splits?
 Compression/decompression ratio vs. speed
• More compression reduces disk space and transfer times, but…
• Slow compression can take longer than reduced transfer time savings
• Use native libraries!

2nd Edition (2010), Ch. 4

Slow; decompression can’t keep pace disk reads

Compression Speed
 LZO 2x faster than gzip
 LZO ~15-20x faster than bzip2
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html

Splittable LZO to the rescue
 LZO format not internally splittable, but we can create
a separate, accompanying index of split points
Recipe
 Get LZO from Cloudera or elsewhere, and setup
 See URL on last slide for instructions
 LZO compress files, copy to HDFS at /path
 Index them: $ hadoop jar /path/to/hadoop-lzo.jar
com.hadoop.compression.lzo.LzoIndexer /path
 Use hadoop-lzo’s LzoTextInputFormat instead of TextInputFormat
 Voila!

Compression API for persistent data
 JobConf helper functions –or– set properties
 Input
 conf.setInputFormatClass(LzoTextInputFormat.class);
 Persistent (reducer) output
 FileOutputFormat.setCompressOutput(conf, true)
 FileOutputFormat.setOutputCompressorClass(conf, LzopCodec.class)

2nd Edition (2010), p. 85

Compression API for intermediate data
 Similar JobConf helper functions –or– set properties
 conf.setCompressMapOutput()
 Conf.setMapOutputCompressClass(LzopCodec.class)

Courtesy of Chuck Lam’s
Hadoop In Action(2010),
pp. 153-155

SequenceFile & compression
 Use SequenceFile for passing data between Hadoop jobs
 Optimized for this usage case
 conf.setOutputFormat(SequenceFileOutputFormat.class)
 With compression, one more parameter to set
 Default compression per-record; almost always preferable to
compress on a per-block basis

From “hadoop fs X” -> Mounted HDFS

See White p. 50;
hadoop: src/contrib/fuse-dfs

Hadoop Workflow

1. Load data into HDFS

2. Develop code locally

3. Submit MapReduce job
3a. Go back to Step 2
Hadoop Cluster
You

4. Retrieve data from HDFS

On Amazon: With EC2
0. Allocate Hadoop cluster
1. Load data into HDFS

EC2
2. Develop code locally

3. Submit MapReduce job
3a. Go back to Step 2
You Your Hadoop Cluster

4. Retrieve data from HDFS
5. Clean up!

Uh oh. Where did the data go?

On Amazon: EC2 and S3

Copy from S3 to HDFS

EC2 S3
(The Cloud) (Persistent Store)

Your Hadoop Cluster

Copy from HFDS to S3

Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Semelhante a Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Mais de Matthew Lease

Mais de Matthew Lease (20)

Último

Último (20)

Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)