MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney

MapReduce
frameworks and methods
Presenter:
Adam Horvath
(adam@teamleadnet.com)

23 May 2012, Sydney Adam Horvath 1

Count the flowers
• Given a flower field
• Given unlimited resources
• How to count the different colours of flowers
in minimal time?


Original problem
• Building product recommendations
– For a ‘web shop’
• Looking for easy to use frameworks
– Not too many generic ones
• Data warehouse
– Way too complex, way too expensive
• Maximum 1 screen to write the whole processing
– If it’s longer, you wouldn’t understand
• Functional languages - steep learning curve
– Neither you nor others would understand


MapReduce based solutions
• Mapreduce data processing
– Takes a while to get used to it
• Hadoop
– Way too hard to start with
• Lightweight frameworks
– A little limitation here and there
• MapReduce.NET
– Roll your own


How can I learn to fly/MapReduce?
• Hadoop, petabytes of data
– Get a Boeing 747
– Pass the required 1500 hours of flying
– Ready to go!
• Simple frameworks, gigabytes of data
– Order a plane model online for $30
– Play with a simulator for 60 minutes
– Ready to go!


What is MapReduce
• Simple framework to parallel batch process
data (even semi- or unstructured data)
• The output is usually reasonable size (human
readable or for other systems)
• Working with key-value pairs
• Two major steps: Map and Reduce


What is not MapReduce
• Database
– The input is usually plain text files
• Online tool
– The process starts in batches, might run for hours
• A language
– Any programming language can be used to create
Map and Reduce steps (depending on the
framework used)
• Compressing cartographic data


Map
• Transform the input data into something that can be
aggregated (counted, grouped…)
• The “SELECT T1,T2… FROM X…” part of the query
• Map from Key1-Value1 to
– Key1-Value2 OR
– Key2-Value2
• Example, log file line:
– “20120501 /urlxxx”
– Input key: might be null or file name
– Input value: the line itself
• Output
– Key: “urlxxx” Value: “20120501”


Reduce
• Aggregates the output of the Map step
• The simplest case is just to output the Map result
without processing
• The “…SUM() … GROUP BY…” part of the query
• Reduces from Key1-Value1 to
– Key1-Value1 (ideally, value1 is a list or a sum)
– Key1-Value2 (input and output are incompatible)
• Input
• Output
– Key: “urlxxx” Values: “20120501, 20120502”


Taken from Google Code University


“Embarrassingly parallel”
• MapReduce only makes sense when the data
can be processed parallel
• Easy to slice the input into smaller
chunks/problems
• Nothing is too big/complex if can be taken
apart
• There is no overlapping/dependence between
the chunks


Structured vs Unstructured data
• MapReduce frequently works on unstructured
data or semi structured data
• Structured is great for answering requests
– Normalised SQL tables
– Typically quick answers
– Reasonable size of data
• Unstructured is great for finding hidden trends
– Logs, documents, images…
– Typically batch processed, slow answers
– Any size of data


History of MapReduce
• Google described it in 2004
• Map and reduce exists in many languages,
manly in functional ones
• Input/output is stored on a distributed file
system (GFS)
• Google Inc. holds PN/7,650,331 - System and
method for efficient large-scale data
processing


Questions for MapReduce
• Create a database with inverted index
• Recommend content based on previous visits
• Parse logs, create visitor analytics
• Process text documents, categorise them
• Process images
• Find partners for life


When to choose SQL
• Small structured data
– Normalised tables
– The database engine is capable of running the
JOINs and GROUP BYs
• Online requirement
– If the answer needs to be served within a second


Typical framework limitations
• Same reduce output as input
– Reduce input: [K1, [V1]], [K1, [V2]], output: [K1,[V1,V2]]
– No need to create mergers but limits the complexity
• Memory
– The hard disk is the new tape
• Debugging
– No framework is flawless but usually hard to debug
• Typed/untyped, generics
– Very hard to understand the process without types
• Network usage
• Faulty/dead machines slow the whole process down


Well known frameworks
• Hadoop – Java based, data stored on HDFS
– HDFS – open source distributed file system for
large files (~Google File System)
– HBase – built on top of HDFS, provides fast key-
value lookup for tables (~Google Bigtable)
• Pig – Map Reduce expressed in Pig Latin
• Hive - SQL over Hadoop
• Sqoop - data from SQL to Hadoop


MapReduce components
• Input
– Reads data storage (files, database…)
• Slicer
– Distributes the huge data into small chunks
• Map
– Transfers the data for easy processing
• Partition/sort
– Sorts the output of Map operation to transfer to Reducers
• Reduce
– Aggregates the data
• Merger
– Merges two or more outputs of Reduces
• After process
– Some cleanup
• Output
– Writes the output of the MapReduce to disk (files, database…)


MapReduce.NET
• Open source
• Written in c#
• Made for couple of machines with several Gb
of data
• Can process TB with batches
• In comparison
– Facebook: Hadoop, 4800 cores, 5.5 PetaBytes,
– 12 TB per node

Counting flowers
• Make small areas on the field
• Place Mappers on each areas
• Have one or more reducer for some areas
• Mapper yells color, reducer takes note
• Another reducer/merger merges the output of
reducers


Counting flowers - code
// MAP
public override void Map(string key, string value, IQueue<Color, int> result)
{
Flower[] flowers = Flower.ParseArr(value);
foreach (var item in flowers)
{
result.Push(item.Color, 1);
}
}

// REDUCE
public override int Reduce(Color key, int value, int result)
{
return result + value;
}


Inverted index
• Mapping from content to location
– “expression” -> file.txt, 44
• Lucene is a great tool utilising inverted indices
• Index types
– DocId list - which documents contain the term
– Positional list - where do the documents contain the terms
– Schema-independent - unique position of the word in the
whole data store
• Stop words
– Can crate noise, like the word “the”
– Omitting it can cause problems
• “The veronicas”


Demo
Inverted index


Product recommendation
• Recommend content based on user behaviour
– Use visit log, but other factors like clicking previous
recommendations and aging are important too
• Process
– MAP
Log  Session-ProdId
– REDUCE
Session-ProdId  Session (ProdId 1, ProdId 2...)
– MAP
Session (ProdId 1, ProdId 2...) 
Transition (ProdId 1-ProdId 2), Transition(ProdId 2 -ProdId 2)
– REDUCE
Transition(ProdId 1-ProdId 2)  ProdId 1 (ProdId 2, ProdId 3…)


Demo
Product recommendation


How to get started
• Download MapReduce.NET, write simple
processors
• Download Hadoop VM from Cloudera
– Fix all the bugs in the VM…
– Start creating distributed processing
• Create theories, prove them using your logs!


Case studies
• Works for me
– Find churning customers/visitors
– Detecting keywords and searching by keyword
– Item recommendation
• Yahoo
– Search, search assist
• eHarmony
– Matchmaking
• Nugg.ad
– Predictive Behavioral Targeting
• Facebook
– Recommendations, etc.


CS101 – why is it slow?
• Use custom data structures
– Built in one like LinkedList, Stack, Queue will blow
up – create fixed size structures
• Serialization is always slow, deserialization is
even worse
– Avro IDL, Apache Thrift IDL, ProtoBuf
• Dictionary*“slow”+++
– is a find/read/modify/find/write operation
– Write custom dictionary with direct indexing


• Do not use synchronisation
– Few locks and waits are okay, thousands are not
• Use CPU native data types
– String is not
• Do not use in-built string operations
– Split, replace, … creates extra String instances
• Don’t use RegEx
• Store data in the file system
– Anything that resembles a database (even SQLite) is at
least a magnitude slower


• Allow thread scheduler to do his job
– Use Thread.Sleep(0) and not Thread.Sleep(1)
• Do not use sorted data structures
– Sort the data when you are done
• Reflection is fine to call a long running method
– But slow to call a function many times
• Choose hash function very carefully
– Validate the choice, collision is not your friend
• Never performance test under Virtual Machines
• Win32 on 32 bit platform
– 2Gb max process size
– On some Linux distributions depending on Kernel this holds too


URLs
• This presentation
http://slidesha.re/mapreduce-intro
• MapReduce.NET
http://code.google.com/p/mapreduce-net
• Apache Hadoop
http://hadoop.apache.org
• Apache Pig
http://pig.apache.org
• Apache Hive
http://hive.apache.org
• Cloudera Hadoop demo VM
https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+De
mo+VM
• A Comparison of Approaches to Large-Scale Data Analysis
http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/be
nchmarks-sigmod09.pdf
• Information Retrieval: Implementing and Evaluating Search Engines
http://www.ir.uwaterloo.ca/book


Questions

Adam Horvath


After hour labs

Adam Horvath


MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney

Semelhante a MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney (20)

Último

Último (20)

MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney