SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Elastic MapReduce
   Outsourcing BigData

        Nathan McCourtney
            @beaknit
What is MapReduce?
From Wikipedia:

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use
different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a
database (structured).

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker
nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the
smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way
to form the output – the answer to the problem it was originally trying to solve.
The Map
Mapping involves taking raw data and converting it into a
series of symbols.

For example, DNA sequencing:
ddATP   ->   A
ddGTP   ->   G
ddCTP   ->   C
ddTTP   ->   T

Results in representations like "GATTACA"
Practical Mapping
Inputs are generally flat-files containing lines of text.
   clever_critters.txt:
       foxes are clever
       cats are clever




Files are read in and fed to a mapper one line at a time via
STDIN.
   cat clever_critters.txt | mapper.rb
Practical Mapping Cont'd
The mapper processes the line and outputs a key/value
pair to STDOUT for each symbol it maps
   foxes 1
   are 1
   clever 1
   cats 1
   are 1
   clever 1
Work Partitioning
These key/value pairs are passed to a "partition function"
which organizes the output and assigns it to reducer nodes

   foxes -> node 1
   are -> node 2
   clever -> node 3
   cat -> node 4
Practical Reduction
The Reducers each receive the sharded
workload assigned to them by the partitioning.

Typically the work is received as a stream of
key/value pairs via STDIN:
 "foxes 1" -> node 1
 "are 1|are 1" -> node 2
 "clever 1|clever 1" -> node 3
 "cats 1|cats 1" -> node 4
Practical Reduction Cont'd
The reduction is essentially whatever you want it to be.
There are common patterns that are often pre-solved by
the map-reduce framework.

See Hadoop's Built-In Reducers

eg, "Aggregate" - give me a total of all the key/values
  foxes - 1
  are - 2
  clever -2
  cats - 1
What is Hadoop?
From wikipedia:
Apache Hadoop is a software framework that supports data-intensive distributed applications under a
free license.[1] It enables applications to work with thousands of computational independent
computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File
System (GFS) papers.


Essentially, Hadoop is a practical implementation of all the pieces you'd need to
accomplish everything we've discussed thus far. It takes in the data, organizes
the tasks, passes the data through its entire path and finally outputs the
reduction.
Hadoop's Guts




source: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html
Fun to build?



    No
Solution?
Amazon's Elastic MapReduce
Look complex? It's not
1.   Sign up for the service
2.   Download the tools (requires ruby 1.8)
3.   mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli
4.   Create your credentials.json file
      {
      "access_id": "<key>",
      "private_key": "<secret key>",
      "keypair": "<name of keypair>",
      "key-pair-file": "~/.ssh/<key>.pem",
      "log_uri": "s3://<unique s3 bucket/",
      "region": "us-east-1"
      }

5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
Run it

  ruby   elastic-mapreduce        --list
  ruby   elastic-mapreduce        --create --alive
  ruby   elastic-mapreduce        --list
  ruby   elastic-mapreduce        --terminate <JobFlowID>

  Note you can also view it in the Amazon EMR web interface

  Logs can be viewed by looking into the s3 bucket you specified in your
  credentials.json file. Just drill down via the s3 web interface and double-
  click the file.
Creating a minimal job
1. Set up a dedicated s3 bucket

2. Create a folder called "input" in that bucket

3. Upload your inputs into s3://bucket/input
     s3cmd put *log s3://bucket/input
Minimal Job Cont'd
4. Write a mapper
     eg:
     ARGF.each do |line|

        # remove any newline
        line = line.chomp

        if /ERROR/.match(line)
           puts "ERRORt1"
        end
        if /INFO/.match(line)
           puts "INFOt1"
        end
        if /DEBUG/.match(line)
           puts "DEBUGt1"
        end
     end


See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for
examples
Minimal Job Cont'd
5. Upload your mapper to your s3 bucket
     s3cmd put mapper.rb s3://bucket


6. Run it
     elastic-mapreduce --create --stream 
          --mapper s3://bucket/mapper.rb 
          --input   s3://bucket/input 
          --output s3://bucket/output 
          --reducer aggregate


      NOTE: This job uses the built-in aggregator.
      NOTE: The output directory must NOT exist at the time of the run

      Amazon will scale ec2 instances to consume the load dynamically.

7. Pick up your results in the output folder
AWS Demo App
AWS has a very cool publicly-available app to
run:

elastic-mapreduce --create --stream 
     --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py 
     --input   s3://elasticmapreduce/samples/wordcount/input 
     --output s3://bucket/output 
     --reducer aggregate



See Amazon Example Doc
Possibilities
EMR is a fully-functional Hadoop
implementation.

Mappers and reducers can be written in python,
ruby, PHP and Java

Go crazy.
Further Reading
Tom White's O'Reilly on Hadoop

AWS EMR Getting Started Guide

Hadoop Wiki

Mais conteúdo relacionado

Mais procurados

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 

Mais procurados (18)

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Destaque

Mi vida (sebastián)
Mi vida (sebastián)Mi vida (sebastián)
Mi vida (sebastián)najuldb
 
hello ( julián)
hello ( julián)hello ( julián)
hello ( julián)najuldb
 
ALL ABOUT ME
ALL ABOUT ME ALL ABOUT ME
ALL ABOUT ME najuldb
 
ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA) ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA) najuldb
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreducebeaknit
 
all about me
all about meall about me
all about menajuldb
 

Destaque (7)

Mi vida (sebastián)
Mi vida (sebastián)Mi vida (sebastián)
Mi vida (sebastián)
 
hello ( julián)
hello ( julián)hello ( julián)
hello ( julián)
 
ALL ABOUT ME
ALL ABOUT ME ALL ABOUT ME
ALL ABOUT ME
 
ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA) ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA)
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
 
all about me
all about meall about me
all about me
 
Mi vida
Mi vidaMi vida
Mi vida
 

Semelhante a Aws dc elastic-mapreduce

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outspardhavi reddy
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 

Semelhante a Aws dc elastic-mapreduce (20)

Scala+data
Scala+dataScala+data
Scala+data
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 

Último

Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfOverkill Security
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهMohamed Sweelam
 

Último (20)

Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 

Aws dc elastic-mapreduce

  • 1. Elastic MapReduce Outsourcing BigData Nathan McCourtney @beaknit
  • 2. What is MapReduce? From Wikipedia: MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
  • 3. The Map Mapping involves taking raw data and converting it into a series of symbols. For example, DNA sequencing: ddATP -> A ddGTP -> G ddCTP -> C ddTTP -> T Results in representations like "GATTACA"
  • 4. Practical Mapping Inputs are generally flat-files containing lines of text. clever_critters.txt: foxes are clever cats are clever Files are read in and fed to a mapper one line at a time via STDIN. cat clever_critters.txt | mapper.rb
  • 5. Practical Mapping Cont'd The mapper processes the line and outputs a key/value pair to STDOUT for each symbol it maps foxes 1 are 1 clever 1 cats 1 are 1 clever 1
  • 6. Work Partitioning These key/value pairs are passed to a "partition function" which organizes the output and assigns it to reducer nodes foxes -> node 1 are -> node 2 clever -> node 3 cat -> node 4
  • 7. Practical Reduction The Reducers each receive the sharded workload assigned to them by the partitioning. Typically the work is received as a stream of key/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4
  • 8. Practical Reduction Cont'd The reduction is essentially whatever you want it to be. There are common patterns that are often pre-solved by the map-reduce framework. See Hadoop's Built-In Reducers eg, "Aggregate" - give me a total of all the key/values foxes - 1 are - 2 clever -2 cats - 1
  • 9. What is Hadoop? From wikipedia: Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license.[1] It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Essentially, Hadoop is a practical implementation of all the pieces you'd need to accomplish everything we've discussed thus far. It takes in the data, organizes the tasks, passes the data through its entire path and finally outputs the reduction.
  • 13. Look complex? It's not 1. Sign up for the service 2. Download the tools (requires ruby 1.8) 3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli 4. Create your credentials.json file { "access_id": "<key>", "private_key": "<secret key>", "keypair": "<name of keypair>", "key-pair-file": "~/.ssh/<key>.pem", "log_uri": "s3://<unique s3 bucket/", "region": "us-east-1" } 5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
  • 14. Run it ruby elastic-mapreduce --list ruby elastic-mapreduce --create --alive ruby elastic-mapreduce --list ruby elastic-mapreduce --terminate <JobFlowID> Note you can also view it in the Amazon EMR web interface Logs can be viewed by looking into the s3 bucket you specified in your credentials.json file. Just drill down via the s3 web interface and double- click the file.
  • 15. Creating a minimal job 1. Set up a dedicated s3 bucket 2. Create a folder called "input" in that bucket 3. Upload your inputs into s3://bucket/input s3cmd put *log s3://bucket/input
  • 16. Minimal Job Cont'd 4. Write a mapper eg: ARGF.each do |line| # remove any newline line = line.chomp if /ERROR/.match(line) puts "ERRORt1" end if /INFO/.match(line) puts "INFOt1" end if /DEBUG/.match(line) puts "DEBUGt1" end end See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for examples
  • 17. Minimal Job Cont'd 5. Upload your mapper to your s3 bucket s3cmd put mapper.rb s3://bucket 6. Run it elastic-mapreduce --create --stream --mapper s3://bucket/mapper.rb --input s3://bucket/input --output s3://bucket/output --reducer aggregate NOTE: This job uses the built-in aggregator. NOTE: The output directory must NOT exist at the time of the run Amazon will scale ec2 instances to consume the load dynamically. 7. Pick up your results in the output folder
  • 18. AWS Demo App AWS has a very cool publicly-available app to run: elastic-mapreduce --create --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://elasticmapreduce/samples/wordcount/input --output s3://bucket/output --reducer aggregate See Amazon Example Doc
  • 19. Possibilities EMR is a fully-functional Hadoop implementation. Mappers and reducers can be written in python, ruby, PHP and Java Go crazy.
  • 20. Further Reading Tom White's O'Reilly on Hadoop AWS EMR Getting Started Guide Hadoop Wiki