SlideShare uma empresa Scribd logo
MapReduce
   frameworks and methods
    Presenter:
          Adam Horvath
          (adam@teamleadnet.com)


23 May 2012, Sydney    Adam Horvath   1
Count the flowers
• Given a flower field
• Given unlimited resources
• How to count the different colours of flowers
  in minimal time?




23 May 2012, Sydney         Adam Horvath          2
Original problem
• Building product recommendations
      – For a ‘web shop’
• Looking for easy to use frameworks
      – Not too many generic ones
• Data warehouse
      – Way too complex, way too expensive
• Maximum 1 screen to write the whole processing
      – If it’s longer, you wouldn’t understand
• Functional languages - steep learning curve
      – Neither you nor others would understand

23 May 2012, Sydney          Adam Horvath          3
MapReduce based solutions
• Mapreduce data processing
      – Takes a while to get used to it
• Hadoop
      – Way too hard to start with
• Lightweight frameworks
      – A little limitation here and there
• MapReduce.NET
      – Roll your own

23 May 2012, Sydney         Adam Horvath     4
How can I learn to fly/MapReduce?
• Hadoop, petabytes of data
      – Get a Boeing 747
      – Pass the required 1500 hours of flying
      – Ready to go!
• Simple frameworks, gigabytes of data
      – Order a plane model online for $30
      – Play with a simulator for 60 minutes
      – Ready to go!

23 May 2012, Sydney        Adam Horvath          5
What is MapReduce
• Simple framework to parallel batch process
  data (even semi- or unstructured data)
• The output is usually reasonable size (human
  readable or for other systems)
• Working with key-value pairs
• Two major steps: Map and Reduce



23 May 2012, Sydney         Adam Horvath         6
What is not MapReduce
• Database
      – The input is usually plain text files
• Online tool
      – The process starts in batches, might run for hours
• A language
      – Any programming language can be used to create
        Map and Reduce steps (depending on the
        framework used)
• Compressing cartographic data

23 May 2012, Sydney          Adam Horvath                    7
Map
• Transform the input data into something that can be
  aggregated (counted, grouped…)
• The “SELECT T1,T2… FROM X…” part of the query
• Map from Key1-Value1 to
      – Key1-Value2 OR
      – Key2-Value2
• Example, log file line:
      – “20120501 /urlxxx”
      – Input key: might be null or file name
      – Input value: the line itself
• Output
      – Key: “urlxxx” Value: “20120501”


23 May 2012, Sydney              Adam Horvath           8
Reduce
• Aggregates the output of the Map step
• The simplest case is just to output the Map result
  without processing
• The “…SUM() … GROUP BY…” part of the query
• Reduces from Key1-Value1 to
      – Key1-Value1 (ideally, value1 is a list or a sum)
      – Key1-Value2 (input and output are incompatible)
• Input
      – Key: “urlxxx” Value: “20120501”
      – Key: “urlxxx” Value: “20120502”
• Output
      – Key: “urlxxx” Values: “20120501, 20120502”

23 May 2012, Sydney           Adam Horvath                 9
Taken from Google Code University

      23 May 2012, Sydney           Adam Horvath   10
“Embarrassingly parallel”
• MapReduce only makes sense when the data
  can be processed parallel
• Easy to slice the input into smaller
  chunks/problems
• Nothing is too big/complex if can be taken
  apart
• There is no overlapping/dependence between
  the chunks

23 May 2012, Sydney       Adam Horvath      11
Structured vs Unstructured data
• MapReduce frequently works on unstructured
  data or semi structured data
• Structured is great for answering requests
      – Normalised SQL tables
      – Typically quick answers
      – Reasonable size of data
• Unstructured is great for finding hidden trends
      – Logs, documents, images…
      – Typically batch processed, slow answers
      – Any size of data


23 May 2012, Sydney         Adam Horvath            12
History of MapReduce
• Google described it in 2004
• Map and reduce exists in many languages,
  manly in functional ones
• Input/output is stored on a distributed file
  system (GFS)
• Google Inc. holds PN/7,650,331 - System and
  method for efficient large-scale data
  processing

23 May 2012, Sydney           Adam Horvath       13
Questions for MapReduce
•   Create a database with inverted index
•   Recommend content based on previous visits
•   Parse logs, create visitor analytics
•   Process text documents, categorise them
•   Process images
•   Find partners for life



23 May 2012, Sydney     Adam Horvath             14
When to choose SQL
• Small structured data
      – Normalised tables
      – The database engine is capable of running the
        JOINs and GROUP BYs
• Online requirement
      – If the answer needs to be served within a second




23 May 2012, Sydney          Adam Horvath                  15
Typical framework limitations
• Same reduce output as input
      – Reduce input: [K1, [V1]], [K1, [V2]], output: [K1,[V1,V2]]
      – No need to create mergers but limits the complexity
• Memory
      – The hard disk is the new tape
• Debugging
      – No framework is flawless but usually hard to debug
• Typed/untyped, generics
      – Very hard to understand the process without types
• Network usage
• Faulty/dead machines slow the whole process down

23 May 2012, Sydney             Adam Horvath                         16
Well known frameworks
• Hadoop – Java based, data stored on HDFS
      – HDFS – open source distributed file system for
        large files (~Google File System)
      – HBase – built on top of HDFS, provides fast key-
        value lookup for tables (~Google Bigtable)
• Pig – Map Reduce expressed in Pig Latin
• Hive - SQL over Hadoop
• Sqoop - data from SQL to Hadoop

23 May 2012, Sydney        Adam Horvath                    17
MapReduce components
• Input
      – Reads data storage (files, database…)
• Slicer
      – Distributes the huge data into small chunks
• Map
      – Transfers the data for easy processing
• Partition/sort
      – Sorts the output of Map operation to transfer to Reducers
• Reduce
      – Aggregates the data
• Merger
      – Merges two or more outputs of Reduces
• After process
      – Some cleanup
• Output
      – Writes the output of the MapReduce to disk (files, database…)


23 May 2012, Sydney                    Adam Horvath                     18
MapReduce.NET
• Open source
• Written in c#
• Made for couple of machines with several Gb
  of data
• Can process TB with batches
• In comparison
      – Facebook: Hadoop, 4800 cores, 5.5 PetaBytes,
      – 12 TB per node
23 May 2012, Sydney       Adam Horvath                 19
Counting flowers
•   Make small areas on the field
•   Place Mappers on each areas
•   Have one or more reducer for some areas
•   Mapper yells color, reducer takes note
•   Another reducer/merger merges the output of
    reducers



23 May 2012, Sydney         Adam Horvath      20
Counting flowers - code
// MAP
public override void Map(string key, string value, IQueue<Color, int> result)
{
         Flower[] flowers = Flower.ParseArr(value);
         foreach (var item in flowers)
         {
                  result.Push(item.Color, 1);
         }
}

// REDUCE
public override int Reduce(Color key, int value, int result)
{
         return result + value;
}

23 May 2012, Sydney                 Adam Horvath                                21
Inverted index
• Mapping from content to location
      – “expression” -> file.txt, 44
• Lucene is a great tool utilising inverted indices
• Index types
      – DocId list - which documents contain the term
      – Positional list - where do the documents contain the terms
      – Schema-independent - unique position of the word in the
        whole data store
• Stop words
      – Can crate noise, like the word “the”
      – Omitting it can cause problems
             • “The veronicas”

23 May 2012, Sydney              Adam Horvath                    22
Demo
                      Inverted index




23 May 2012, Sydney       Adam Horvath   23
Product recommendation
• Recommend content based on user behaviour
      – Use visit log, but other factors like clicking previous
        recommendations and aging are important too
• Process
      – MAP
             Log  Session-ProdId
      – REDUCE
             Session-ProdId  Session (ProdId 1, ProdId 2...)
      – MAP
             Session (ProdId 1, ProdId 2...) 
              Transition (ProdId 1-ProdId 2), Transition(ProdId 2 -ProdId 2)
      – REDUCE
             Transition(ProdId 1-ProdId 2)  ProdId 1 (ProdId 2, ProdId 3…)


23 May 2012, Sydney                  Adam Horvath                              24
Demo
                      Product recommendation




23 May 2012, Sydney           Adam Horvath     25
How to get started
• Download MapReduce.NET, write simple
  processors
• Download Hadoop VM from Cloudera
      – Fix all the bugs in the VM…
      – Start creating distributed processing
• Create theories, prove them using your logs!



23 May 2012, Sydney          Adam Horvath        26
Case studies
• Works for me
      – Find churning customers/visitors
      – Detecting keywords and searching by keyword
      – Item recommendation
• Yahoo
      – Search, search assist
• eHarmony
      – Matchmaking
• Nugg.ad
      – Predictive Behavioral Targeting
• Facebook
      – Recommendations, etc.


23 May 2012, Sydney              Adam Horvath         27
CS101 – why is it slow?
• Use custom data structures
      – Built in one like LinkedList, Stack, Queue will blow
        up – create fixed size structures
• Serialization is always slow, deserialization is
  even worse
      – Avro IDL, Apache Thrift IDL, ProtoBuf
• Dictionary*“slow”+++
      – is a find/read/modify/find/write operation
      – Write custom dictionary with direct indexing

23 May 2012, Sydney            Adam Horvath                    28
CS101 – why is it slow?
• Do not use synchronisation
      – Few locks and waits are okay, thousands are not
• Use CPU native data types
      – String is not
• Do not use in-built string operations
      – Split, replace, … creates extra String instances
• Don’t use RegEx
• Store data in the file system
      – Anything that resembles a database (even SQLite) is at
        least a magnitude slower

23 May 2012, Sydney            Adam Horvath                  29
CS101 – why is it slow?
• Allow thread scheduler to do his job
      – Use Thread.Sleep(0) and not Thread.Sleep(1)
• Do not use sorted data structures
      – Sort the data when you are done
• Reflection is fine to call a long running method
      – But slow to call a function many times
• Choose hash function very carefully
      – Validate the choice, collision is not your friend
• Never performance test under Virtual Machines
• Win32 on 32 bit platform
      – 2Gb max process size
      – On some Linux distributions depending on Kernel this holds too



23 May 2012, Sydney               Adam Horvath                       30
URLs
• This presentation
  http://slidesha.re/mapreduce-intro
• MapReduce.NET
  http://code.google.com/p/mapreduce-net
• Apache Hadoop
  http://hadoop.apache.org
• Apache Pig
  http://pig.apache.org
• Apache Hive
  http://hive.apache.org
• Cloudera Hadoop demo VM
  https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+De
  mo+VM
• A Comparison of Approaches to Large-Scale Data Analysis
  http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/be
  nchmarks-sigmod09.pdf
• Information Retrieval: Implementing and Evaluating Search Engines
  http://www.ir.uwaterloo.ca/book

23 May 2012, Sydney          Adam Horvath                        31
Questions




    Adam Horvath
    (adam@teamleadnet.com)



23 May 2012, Sydney     Adam Horvath   32
After hour labs




    Adam Horvath
    (adam@teamleadnet.com)



23 May 2012, Sydney        Adam Horvath   33

Mais conteúdo relacionado

Semelhante a MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 

Semelhante a MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney (20)

Hadoop
HadoopHadoop
Hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 

Último

Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
Avinash Rai
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
ashishpaul799
 

Último (20)

Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptx
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. Henry
 

MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney

  • 1. MapReduce frameworks and methods Presenter: Adam Horvath (adam@teamleadnet.com) 23 May 2012, Sydney Adam Horvath 1
  • 2. Count the flowers • Given a flower field • Given unlimited resources • How to count the different colours of flowers in minimal time? 23 May 2012, Sydney Adam Horvath 2
  • 3. Original problem • Building product recommendations – For a ‘web shop’ • Looking for easy to use frameworks – Not too many generic ones • Data warehouse – Way too complex, way too expensive • Maximum 1 screen to write the whole processing – If it’s longer, you wouldn’t understand • Functional languages - steep learning curve – Neither you nor others would understand 23 May 2012, Sydney Adam Horvath 3
  • 4. MapReduce based solutions • Mapreduce data processing – Takes a while to get used to it • Hadoop – Way too hard to start with • Lightweight frameworks – A little limitation here and there • MapReduce.NET – Roll your own 23 May 2012, Sydney Adam Horvath 4
  • 5. How can I learn to fly/MapReduce? • Hadoop, petabytes of data – Get a Boeing 747 – Pass the required 1500 hours of flying – Ready to go! • Simple frameworks, gigabytes of data – Order a plane model online for $30 – Play with a simulator for 60 minutes – Ready to go! 23 May 2012, Sydney Adam Horvath 5
  • 6. What is MapReduce • Simple framework to parallel batch process data (even semi- or unstructured data) • The output is usually reasonable size (human readable or for other systems) • Working with key-value pairs • Two major steps: Map and Reduce 23 May 2012, Sydney Adam Horvath 6
  • 7. What is not MapReduce • Database – The input is usually plain text files • Online tool – The process starts in batches, might run for hours • A language – Any programming language can be used to create Map and Reduce steps (depending on the framework used) • Compressing cartographic data 23 May 2012, Sydney Adam Horvath 7
  • 8. Map • Transform the input data into something that can be aggregated (counted, grouped…) • The “SELECT T1,T2… FROM X…” part of the query • Map from Key1-Value1 to – Key1-Value2 OR – Key2-Value2 • Example, log file line: – “20120501 /urlxxx” – Input key: might be null or file name – Input value: the line itself • Output – Key: “urlxxx” Value: “20120501” 23 May 2012, Sydney Adam Horvath 8
  • 9. Reduce • Aggregates the output of the Map step • The simplest case is just to output the Map result without processing • The “…SUM() … GROUP BY…” part of the query • Reduces from Key1-Value1 to – Key1-Value1 (ideally, value1 is a list or a sum) – Key1-Value2 (input and output are incompatible) • Input – Key: “urlxxx” Value: “20120501” – Key: “urlxxx” Value: “20120502” • Output – Key: “urlxxx” Values: “20120501, 20120502” 23 May 2012, Sydney Adam Horvath 9
  • 10. Taken from Google Code University 23 May 2012, Sydney Adam Horvath 10
  • 11. “Embarrassingly parallel” • MapReduce only makes sense when the data can be processed parallel • Easy to slice the input into smaller chunks/problems • Nothing is too big/complex if can be taken apart • There is no overlapping/dependence between the chunks 23 May 2012, Sydney Adam Horvath 11
  • 12. Structured vs Unstructured data • MapReduce frequently works on unstructured data or semi structured data • Structured is great for answering requests – Normalised SQL tables – Typically quick answers – Reasonable size of data • Unstructured is great for finding hidden trends – Logs, documents, images… – Typically batch processed, slow answers – Any size of data 23 May 2012, Sydney Adam Horvath 12
  • 13. History of MapReduce • Google described it in 2004 • Map and reduce exists in many languages, manly in functional ones • Input/output is stored on a distributed file system (GFS) • Google Inc. holds PN/7,650,331 - System and method for efficient large-scale data processing 23 May 2012, Sydney Adam Horvath 13
  • 14. Questions for MapReduce • Create a database with inverted index • Recommend content based on previous visits • Parse logs, create visitor analytics • Process text documents, categorise them • Process images • Find partners for life 23 May 2012, Sydney Adam Horvath 14
  • 15. When to choose SQL • Small structured data – Normalised tables – The database engine is capable of running the JOINs and GROUP BYs • Online requirement – If the answer needs to be served within a second 23 May 2012, Sydney Adam Horvath 15
  • 16. Typical framework limitations • Same reduce output as input – Reduce input: [K1, [V1]], [K1, [V2]], output: [K1,[V1,V2]] – No need to create mergers but limits the complexity • Memory – The hard disk is the new tape • Debugging – No framework is flawless but usually hard to debug • Typed/untyped, generics – Very hard to understand the process without types • Network usage • Faulty/dead machines slow the whole process down 23 May 2012, Sydney Adam Horvath 16
  • 17. Well known frameworks • Hadoop – Java based, data stored on HDFS – HDFS – open source distributed file system for large files (~Google File System) – HBase – built on top of HDFS, provides fast key- value lookup for tables (~Google Bigtable) • Pig – Map Reduce expressed in Pig Latin • Hive - SQL over Hadoop • Sqoop - data from SQL to Hadoop 23 May 2012, Sydney Adam Horvath 17
  • 18. MapReduce components • Input – Reads data storage (files, database…) • Slicer – Distributes the huge data into small chunks • Map – Transfers the data for easy processing • Partition/sort – Sorts the output of Map operation to transfer to Reducers • Reduce – Aggregates the data • Merger – Merges two or more outputs of Reduces • After process – Some cleanup • Output – Writes the output of the MapReduce to disk (files, database…) 23 May 2012, Sydney Adam Horvath 18
  • 19. MapReduce.NET • Open source • Written in c# • Made for couple of machines with several Gb of data • Can process TB with batches • In comparison – Facebook: Hadoop, 4800 cores, 5.5 PetaBytes, – 12 TB per node 23 May 2012, Sydney Adam Horvath 19
  • 20. Counting flowers • Make small areas on the field • Place Mappers on each areas • Have one or more reducer for some areas • Mapper yells color, reducer takes note • Another reducer/merger merges the output of reducers 23 May 2012, Sydney Adam Horvath 20
  • 21. Counting flowers - code // MAP public override void Map(string key, string value, IQueue<Color, int> result) { Flower[] flowers = Flower.ParseArr(value); foreach (var item in flowers) { result.Push(item.Color, 1); } } // REDUCE public override int Reduce(Color key, int value, int result) { return result + value; } 23 May 2012, Sydney Adam Horvath 21
  • 22. Inverted index • Mapping from content to location – “expression” -> file.txt, 44 • Lucene is a great tool utilising inverted indices • Index types – DocId list - which documents contain the term – Positional list - where do the documents contain the terms – Schema-independent - unique position of the word in the whole data store • Stop words – Can crate noise, like the word “the” – Omitting it can cause problems • “The veronicas” 23 May 2012, Sydney Adam Horvath 22
  • 23. Demo Inverted index 23 May 2012, Sydney Adam Horvath 23
  • 24. Product recommendation • Recommend content based on user behaviour – Use visit log, but other factors like clicking previous recommendations and aging are important too • Process – MAP Log  Session-ProdId – REDUCE Session-ProdId  Session (ProdId 1, ProdId 2...) – MAP Session (ProdId 1, ProdId 2...)  Transition (ProdId 1-ProdId 2), Transition(ProdId 2 -ProdId 2) – REDUCE Transition(ProdId 1-ProdId 2)  ProdId 1 (ProdId 2, ProdId 3…) 23 May 2012, Sydney Adam Horvath 24
  • 25. Demo Product recommendation 23 May 2012, Sydney Adam Horvath 25
  • 26. How to get started • Download MapReduce.NET, write simple processors • Download Hadoop VM from Cloudera – Fix all the bugs in the VM… – Start creating distributed processing • Create theories, prove them using your logs! 23 May 2012, Sydney Adam Horvath 26
  • 27. Case studies • Works for me – Find churning customers/visitors – Detecting keywords and searching by keyword – Item recommendation • Yahoo – Search, search assist • eHarmony – Matchmaking • Nugg.ad – Predictive Behavioral Targeting • Facebook – Recommendations, etc. 23 May 2012, Sydney Adam Horvath 27
  • 28. CS101 – why is it slow? • Use custom data structures – Built in one like LinkedList, Stack, Queue will blow up – create fixed size structures • Serialization is always slow, deserialization is even worse – Avro IDL, Apache Thrift IDL, ProtoBuf • Dictionary*“slow”+++ – is a find/read/modify/find/write operation – Write custom dictionary with direct indexing 23 May 2012, Sydney Adam Horvath 28
  • 29. CS101 – why is it slow? • Do not use synchronisation – Few locks and waits are okay, thousands are not • Use CPU native data types – String is not • Do not use in-built string operations – Split, replace, … creates extra String instances • Don’t use RegEx • Store data in the file system – Anything that resembles a database (even SQLite) is at least a magnitude slower 23 May 2012, Sydney Adam Horvath 29
  • 30. CS101 – why is it slow? • Allow thread scheduler to do his job – Use Thread.Sleep(0) and not Thread.Sleep(1) • Do not use sorted data structures – Sort the data when you are done • Reflection is fine to call a long running method – But slow to call a function many times • Choose hash function very carefully – Validate the choice, collision is not your friend • Never performance test under Virtual Machines • Win32 on 32 bit platform – 2Gb max process size – On some Linux distributions depending on Kernel this holds too 23 May 2012, Sydney Adam Horvath 30
  • 31. URLs • This presentation http://slidesha.re/mapreduce-intro • MapReduce.NET http://code.google.com/p/mapreduce-net • Apache Hadoop http://hadoop.apache.org • Apache Pig http://pig.apache.org • Apache Hive http://hive.apache.org • Cloudera Hadoop demo VM https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+De mo+VM • A Comparison of Approaches to Large-Scale Data Analysis http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/be nchmarks-sigmod09.pdf • Information Retrieval: Implementing and Evaluating Search Engines http://www.ir.uwaterloo.ca/book 23 May 2012, Sydney Adam Horvath 31
  • 32. Questions Adam Horvath (adam@teamleadnet.com) 23 May 2012, Sydney Adam Horvath 32
  • 33. After hour labs Adam Horvath (adam@teamleadnet.com) 23 May 2012, Sydney Adam Horvath 33