SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Elastic MapReduce
   Outsourcing BigData

        Nathan McCourtney
            @beaknit
What is MapReduce?
From Wikipedia:

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use
different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a
database (structured).

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker
nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the
smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way
to form the output – the answer to the problem it was originally trying to solve.
The Map
Mapping involves taking raw data and converting it into a
series of symbols.

For example, DNA sequencing:
ddATP   ->   A
ddGTP   ->   G
ddCTP   ->   C
ddTTP   ->   T

Results in representations like "GATTACA"
Practical Mapping
Inputs are generally flat-files containing lines of text.
   clever_critters.txt:
       foxes are clever
       cats are clever




Files are read in and fed to a mapper one line at a time via
STDIN.
   cat clever_critters.txt | mapper.rb
Practical Mapping Cont'd
The mapper processes the line and outputs a key/value
pair to STDOUT for each symbol it maps
   foxes 1
   are 1
   clever 1
   cats 1
   are 1
   clever 1
Work Partitioning
These key/value pairs are passed to a "partition function"
which organizes the output and assigns it to reducer nodes

   foxes -> node 1
   are -> node 2
   clever -> node 3
   cat -> node 4
Practical Reduction
The Reducers each receive the sharded
workload assigned to them by the partitioning.

Typically the work is received as a stream of
key/value pairs via STDIN:
 "foxes 1" -> node 1
 "are 1|are 1" -> node 2
 "clever 1|clever 1" -> node 3
 "cats 1|cats 1" -> node 4
Practical Reduction Cont'd
The reduction is essentially whatever you want it to be.
There are common patterns that are often pre-solved by
the map-reduce framework.

See Hadoop's Built-In Reducers

eg, "Aggregate" - give me a total of all the key/values
  foxes - 1
  are - 2
  clever -2
  cats - 1
What is Hadoop?
From wikipedia:
Apache Hadoop is a software framework that supports data-intensive distributed applications under a
free license.[1] It enables applications to work with thousands of computational independent
computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File
System (GFS) papers.


Essentially, Hadoop is a practical implementation of all the pieces you'd need to
accomplish everything we've discussed thus far. It takes in the data, organizes
the tasks, passes the data through its entire path and finally outputs the
reduction.
Hadoop's Guts




source: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html
Fun to build?



    No
Solution?
Amazon's Elastic MapReduce
Look complex? It's not
1.   Sign up for the service
2.   Download the tools (requires ruby 1.8)
3.   mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli
4.   Create your credentials.json file
      {
      "access_id": "<key>",
      "private_key": "<secret key>",
      "keypair": "<name of keypair>",
      "key-pair-file": "~/.ssh/<key>.pem",
      "log_uri": "s3://<unique s3 bucket/",
      "region": "us-east-1"
      }

5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
Run it

  ruby   elastic-mapreduce        --list
  ruby   elastic-mapreduce        --create --alive
  ruby   elastic-mapreduce        --list
  ruby   elastic-mapreduce        --terminate <JobFlowID>

  Note you can also view it in the Amazon EMR web interface

  Logs can be viewed by looking into the s3 bucket you specified in your
  credentials.json file. Just drill down via the s3 web interface and double-
  click the file.
Creating a minimal job
1. Set up a dedicated s3 bucket

2. Create a folder called "input" in that bucket

3. Upload your inputs into s3://bucket/input
     s3cmd put *log s3://bucket/input
Minimal Job Cont'd
4. Write a mapper
     eg:
     ARGF.each do |line|

        # remove any newline
        line = line.chomp

        if /ERROR/.match(line)
           puts "ERRORt1"
        end
        if /INFO/.match(line)
           puts "INFOt1"
        end
        if /DEBUG/.match(line)
           puts "DEBUGt1"
        end
     end


See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for
examples
Minimal Job Cont'd
5. Upload your mapper to your s3 bucket
     s3cmd put mapper.rb s3://bucket


6. Run it
     elastic-mapreduce --create --stream 
          --mapper s3://bucket/mapper.rb 
          --input   s3://bucket/input 
          --output s3://bucket/output 
          --reducer aggregate


      NOTE: This job uses the built-in aggregator.
      NOTE: The output directory must NOT exist at the time of the run

      Amazon will scale ec2 instances to consume the load dynamically.

7. Pick up your results in the output folder
AWS Demo App
AWS has a very cool publicly-available app to
run:

elastic-mapreduce --create --stream 
     --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py 
     --input   s3://elasticmapreduce/samples/wordcount/input 
     --output s3://bucket/output 
     --reducer aggregate



See Amazon Example Doc
Possibilities
EMR is a fully-functional Hadoop
implementation.

Mappers and reducers can be written in python,
ruby, PHP and Java

Go crazy.
Further Reading
Tom White's O'Reilly on Hadoop

AWS EMR Getting Started Guide

Hadoop Wiki

Mais conteúdo relacionado

Mais procurados

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 

Mais procurados (18)

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Destaque

Mi vida (sebastián)
Mi vida (sebastián)Mi vida (sebastián)
Mi vida (sebastián)najuldb
 
hello ( julián)
hello ( julián)hello ( julián)
hello ( julián)najuldb
 
ALL ABOUT ME
ALL ABOUT ME ALL ABOUT ME
ALL ABOUT ME najuldb
 
ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA) ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA) najuldb
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreducebeaknit
 
all about me
all about meall about me
all about menajuldb
 

Destaque (7)

Mi vida (sebastián)
Mi vida (sebastián)Mi vida (sebastián)
Mi vida (sebastián)
 
hello ( julián)
hello ( julián)hello ( julián)
hello ( julián)
 
ALL ABOUT ME
ALL ABOUT ME ALL ABOUT ME
ALL ABOUT ME
 
ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA) ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA)
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
 
all about me
all about meall about me
all about me
 
Mi vida
Mi vidaMi vida
Mi vida
 

Semelhante a Aws dc elastic-mapreduce

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outspardhavi reddy
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 

Semelhante a Aws dc elastic-mapreduce (20)

Scala+data
Scala+dataScala+data
Scala+data
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Aws dc elastic-mapreduce

  • 1. Elastic MapReduce Outsourcing BigData Nathan McCourtney @beaknit
  • 2. What is MapReduce? From Wikipedia: MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
  • 3. The Map Mapping involves taking raw data and converting it into a series of symbols. For example, DNA sequencing: ddATP -> A ddGTP -> G ddCTP -> C ddTTP -> T Results in representations like "GATTACA"
  • 4. Practical Mapping Inputs are generally flat-files containing lines of text. clever_critters.txt: foxes are clever cats are clever Files are read in and fed to a mapper one line at a time via STDIN. cat clever_critters.txt | mapper.rb
  • 5. Practical Mapping Cont'd The mapper processes the line and outputs a key/value pair to STDOUT for each symbol it maps foxes 1 are 1 clever 1 cats 1 are 1 clever 1
  • 6. Work Partitioning These key/value pairs are passed to a "partition function" which organizes the output and assigns it to reducer nodes foxes -> node 1 are -> node 2 clever -> node 3 cat -> node 4
  • 7. Practical Reduction The Reducers each receive the sharded workload assigned to them by the partitioning. Typically the work is received as a stream of key/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4
  • 8. Practical Reduction Cont'd The reduction is essentially whatever you want it to be. There are common patterns that are often pre-solved by the map-reduce framework. See Hadoop's Built-In Reducers eg, "Aggregate" - give me a total of all the key/values foxes - 1 are - 2 clever -2 cats - 1
  • 9. What is Hadoop? From wikipedia: Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license.[1] It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Essentially, Hadoop is a practical implementation of all the pieces you'd need to accomplish everything we've discussed thus far. It takes in the data, organizes the tasks, passes the data through its entire path and finally outputs the reduction.
  • 13. Look complex? It's not 1. Sign up for the service 2. Download the tools (requires ruby 1.8) 3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli 4. Create your credentials.json file { "access_id": "<key>", "private_key": "<secret key>", "keypair": "<name of keypair>", "key-pair-file": "~/.ssh/<key>.pem", "log_uri": "s3://<unique s3 bucket/", "region": "us-east-1" } 5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
  • 14. Run it ruby elastic-mapreduce --list ruby elastic-mapreduce --create --alive ruby elastic-mapreduce --list ruby elastic-mapreduce --terminate <JobFlowID> Note you can also view it in the Amazon EMR web interface Logs can be viewed by looking into the s3 bucket you specified in your credentials.json file. Just drill down via the s3 web interface and double- click the file.
  • 15. Creating a minimal job 1. Set up a dedicated s3 bucket 2. Create a folder called "input" in that bucket 3. Upload your inputs into s3://bucket/input s3cmd put *log s3://bucket/input
  • 16. Minimal Job Cont'd 4. Write a mapper eg: ARGF.each do |line| # remove any newline line = line.chomp if /ERROR/.match(line) puts "ERRORt1" end if /INFO/.match(line) puts "INFOt1" end if /DEBUG/.match(line) puts "DEBUGt1" end end See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for examples
  • 17. Minimal Job Cont'd 5. Upload your mapper to your s3 bucket s3cmd put mapper.rb s3://bucket 6. Run it elastic-mapreduce --create --stream --mapper s3://bucket/mapper.rb --input s3://bucket/input --output s3://bucket/output --reducer aggregate NOTE: This job uses the built-in aggregator. NOTE: The output directory must NOT exist at the time of the run Amazon will scale ec2 instances to consume the load dynamically. 7. Pick up your results in the output folder
  • 18. AWS Demo App AWS has a very cool publicly-available app to run: elastic-mapreduce --create --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://elasticmapreduce/samples/wordcount/input --output s3://bucket/output --reducer aggregate See Amazon Example Doc
  • 19. Possibilities EMR is a fully-functional Hadoop implementation. Mappers and reducers can be written in python, ruby, PHP and Java Go crazy.
  • 20. Further Reading Tom White's O'Reilly on Hadoop AWS EMR Getting Started Guide Hadoop Wiki