SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Possible real-world situation
● We have big data and/or very long,
  embarrassingly parallel computation
● Our data may grow fast
● We want to start and try Hadoop asap

● We do not have our own infrastructure
● We do not have Hadoop administrators
● We have limited funds
Possible solution
Amazon Elastic MapReduce (EMR)
● Hadoop framework running on the web scale
  infrastructure of Amazon
EMR Benefits
Elastic (scalable)
● Use one, hundred, or even thousands of
   instances to process even petabytes of data
● Modify the number of instances while the job
   flow is running
● Start computation within minutes
EMR Benefits
Easy to use
● No configuration necessary
  ○ Do not worry about setting up hardware and
       networking, running, managing and tuning the
       performance of Hadoop cluster
● Easy-to-use tools and plugins available
   ○   AWS Web Management Console
   ○   Command Line Tools by Amazon
   ○   Amazon EMR API, SDK, Libraries
   ○   Plugins for IDEs (e.g. Eclipse & Karmasphere Studio
       for EMR)
EMR Benefits
Reliable
● Build on Amazon's highly available and
  battle-tested infrastructure
● Provision new nodes to replace those that
  fail
● Used by e.g.:
EMR Benefits
Cost effective
● Pay for what you use (for each started hour)
● Choose various instance types that meets
  your requirements
● Possibility to reserve instances for 1 or 3
  years to pay less for hour
EMR Overview
Amazon Elastic MapReduce (Amazon EMR)
works in conjunction with
● Amazon EC2 to rent computing instances
  (with Hadoop installed)
● Amazon S3 to store input and output data,
  scripts/applications and logs
EMR Architectural Overview




* image from the Internet
EC2 Instance Types




* image from Big Data University, Course: "Hadoop and the Amazon Cloud"
EMR Pricing - "On-demand"
instances
Standard Family Instances (US East Region)




http://aws.amazon.com/elasticmapreduce/pricing/
EC2 & S3 Pricing - Real-world example
New York Times wanted to host all public
domain articles from 1851 to 1922.
● 11 million articles
● 4 TB of raw image TIFF input data converted
  to 1.5 TB of PDF documents
● 100 EC2 Instances rented
● < 24 hours of computation
● $240 paid (not including storage & bandwidth)
● 1 employee assigned to this task
EC2 & S3 Pricing - Real-world example



           How much
    did they pay for storage
        and bandwidth?
S3 Pricing




http://aws.amazon.com/s3/pricing/
EC2 & S3 Pricing Calculator
Simple Monthly Calculator:
http://calculator.s3.amazonaws.com/calc5.html
AWS Free Usage Tier (Per Month)
Available for free to new AWS customers for 12
months following AWS sign-up date e.g.:
● 750 hours of Amazon EC2 Micro Instance
  usage
    ○ 613 MB of memory and 32-bit or 64-bit platform
● 5 GB of Amazon S3 standard storage,
  20,000 Get and 2,000 Put Requests
● 15 GB of bandwidth out aggregated across
  all AWS services
EMR - Support for Hadoop
Ecosystem
Develop and run MapReduce application using:
● Java
● Streaming (e.g. Ruby, Perl, Python, PHP, R,
  or C++)
● Pig
● Hive

HBase can be easily installed using set of EC2
scripts
●
EMR - Featured Users




* logos form http://aws.amazon.com/elasticmapreduce/
EMR - Case Study - Yelp

● help people connect
  with great local business
● share reviews and insights

● as of November 2010:
  ○ 39 million monthly unique visitors
  ○ in total, 14 million reviews posted
 ●
EMR - Case Study - Yelp
EMR - Case Study - Yelp
● uses S3 to store daily logs (~100GB/day)
  and photos
● uses EMR to power features like
    ○   People who viewed this also viewed
    ○   Review highlights
    ○   Autocomplete in search box
    ○   Top searches
●   implements jobs in Python and uses their
    own open-source library, mrjob, to run them
    on EMR
mrjob - WordCount example
from mrjob.job import MRJob

class MRWordCounter(MRJob):
   def mapper(self, key, line):
     for word in line.split():
        yield word, 1

  def reducer(self, word, occurrences):
    yield word, sum(occurrences)

if __name__ == '__main__':
   MRWordCounter.run()
mrjob - run on EMR
$ python wordcount.py
  --ec2_instance_type c1.medium
  --num-ec2-instances 10
  -r emr < 's3://input-bucket/*.txt' > output
Demo
Million Song Dataset
● Contains detailed acoustic and contextual
  data for one million popular songs
● ~300 GB of data
● Publicly available
  ○ for download: http://www.infochimps.
      com/collections/million-songs
  ○   for processing using EMR: http://tbmmsd.s3.
      amazonaws.com/
Million Song Dataset
Contains data such as:
● Song's title, year and hotness
● Song's tempo, duration, danceability,
  energy, loudness, segments count, preview
  (URL to mp3 file) and so on
● Artist's name and hotness
Million Song Dataset - Song's
density
Song's density* can be defined as the average
number of notes or atomic sounds (called
segments) per second in a song.

        density = segmentCnt / duration
 
 
 
* based on Paul Lamere's blog - http://bit.ly/qUbLdQ
Million Song Dataset - Task*
Simple music recommendation system
● Calculate density for each song
● Find hot songs with similar density




* based on Paul Lamere's blog - http://bit.ly/qUbLdQ
Million Song Dataset - MapReduce
Input data
● 339 files
● Each file contains ~3 000 songs
● Each song is represented by one line in
   input file
● Fields are separated by a tab character
Million Song Dataset - MapReduce
Mapper
● Reads song's data from each line of input
  text
● Calculate song's density
● Emits song's density as key with some other
  details as value

<line_offset, song_data> ->
           <density, (artist_name, song_title, song_url)>
public void map(LongWritable key, Text value,
    OutputCollector<FloatWritable, TripleTextWritable> output, Reporter
    reporter) throws IOException {
 
    song.parseLine(value.toString());
    if (song.tempo > 0 && song.duration > 0 ) {
        // calculate density
        float density = ((float) song.segmentCnt) / song.duration;


        denstyWritable.set(density);
        songWritable.set(song.artistName, song.title, song.preview);


        output.collect(denstyWritable, songWritable);
    }
}
Million Song Dataset - MapReduce
Reducer
● Identity Reducer
● Each Reducer gets density values from
  different range: <i,i+1)*,**

<density, [(artist_name, song_title, song_url)]> ->
               <density, (artist_name, song_title, song_url)>


* thanks to a custom Partitioner
** not optimal partitioning (partitions are not balanced)
Demo - used software
● Karmasphere Studio for EMR (Eclipse
  plugin)
  ○ graphical environment that supports the complete
    lifecycle for developing for Amazon Elastic
    MapReduce, including prototyping, developing,
    testing, debugging, deploying and optimizing
    Hadoop Jobs (http://www.karmasphere.
    com/ksc/karmasphere-studio-for-amazon.html)
Demo - used software
● Karmasphere Studio for EMR (Eclipse
  plugin)




images from:
http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html
Video
Please watch video on WHUG channel on
YouTube

http://www.youtube.com/watch?
v=Azwilbn8GCs
Thank you!
Join us !
whug.org

Mais conteúdo relacionado

Mais procurados

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 

Mais procurados (19)

Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 

Destaque

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Adam Kawa
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
Adam Kawa
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
Adam Kawa
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
Adam Kawa
 

Destaque (20)

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java API
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Data Mining Music
Data Mining MusicData Mining Music
Data Mining Music
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Waltz ballroom dancing Angel
Waltz ballroom dancing AngelWaltz ballroom dancing Angel
Waltz ballroom dancing Angel
 
Last Waltz
Last WaltzLast Waltz
Last Waltz
 
Lean Change - Organisationsentwicklung mit Design Thinking
Lean Change -  Organisationsentwicklung mit Design ThinkingLean Change -  Organisationsentwicklung mit Design Thinking
Lean Change - Organisationsentwicklung mit Design Thinking
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
ballroom dancing lessons
ballroom dancing lessonsballroom dancing lessons
ballroom dancing lessons
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
 
HCatalog
HCatalogHCatalog
HCatalog
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 

Semelhante a Introduction To Elastic MapReduce at WHUG

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 

Semelhante a Introduction To Elastic MapReduce at WHUG (20)

Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Amazon web services : Layman Introduction
Amazon web services : Layman IntroductionAmazon web services : Layman Introduction
Amazon web services : Layman Introduction
 

Último

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 

Último (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

Introduction To Elastic MapReduce at WHUG

  • 1.
  • 2. Possible real-world situation ● We have big data and/or very long, embarrassingly parallel computation ● Our data may grow fast ● We want to start and try Hadoop asap ● We do not have our own infrastructure ● We do not have Hadoop administrators ● We have limited funds
  • 3. Possible solution Amazon Elastic MapReduce (EMR) ● Hadoop framework running on the web scale infrastructure of Amazon
  • 4. EMR Benefits Elastic (scalable) ● Use one, hundred, or even thousands of instances to process even petabytes of data ● Modify the number of instances while the job flow is running ● Start computation within minutes
  • 5. EMR Benefits Easy to use ● No configuration necessary ○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster ● Easy-to-use tools and plugins available ○ AWS Web Management Console ○ Command Line Tools by Amazon ○ Amazon EMR API, SDK, Libraries ○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio for EMR)
  • 6. EMR Benefits Reliable ● Build on Amazon's highly available and battle-tested infrastructure ● Provision new nodes to replace those that fail ● Used by e.g.:
  • 7. EMR Benefits Cost effective ● Pay for what you use (for each started hour) ● Choose various instance types that meets your requirements ● Possibility to reserve instances for 1 or 3 years to pay less for hour
  • 8. EMR Overview Amazon Elastic MapReduce (Amazon EMR) works in conjunction with ● Amazon EC2 to rent computing instances (with Hadoop installed) ● Amazon S3 to store input and output data, scripts/applications and logs
  • 9. EMR Architectural Overview * image from the Internet
  • 10. EC2 Instance Types * image from Big Data University, Course: "Hadoop and the Amazon Cloud"
  • 11. EMR Pricing - "On-demand" instances Standard Family Instances (US East Region) http://aws.amazon.com/elasticmapreduce/pricing/
  • 12. EC2 & S3 Pricing - Real-world example New York Times wanted to host all public domain articles from 1851 to 1922. ● 11 million articles ● 4 TB of raw image TIFF input data converted to 1.5 TB of PDF documents ● 100 EC2 Instances rented ● < 24 hours of computation ● $240 paid (not including storage & bandwidth) ● 1 employee assigned to this task
  • 13.
  • 14. EC2 & S3 Pricing - Real-world example How much did they pay for storage and bandwidth?
  • 16. EC2 & S3 Pricing Calculator Simple Monthly Calculator: http://calculator.s3.amazonaws.com/calc5.html
  • 17. AWS Free Usage Tier (Per Month) Available for free to new AWS customers for 12 months following AWS sign-up date e.g.: ● 750 hours of Amazon EC2 Micro Instance usage ○ 613 MB of memory and 32-bit or 64-bit platform ● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests ● 15 GB of bandwidth out aggregated across all AWS services
  • 18. EMR - Support for Hadoop Ecosystem Develop and run MapReduce application using: ● Java ● Streaming (e.g. Ruby, Perl, Python, PHP, R, or C++) ● Pig ● Hive HBase can be easily installed using set of EC2 scripts ●
  • 19. EMR - Featured Users * logos form http://aws.amazon.com/elasticmapreduce/
  • 20. EMR - Case Study - Yelp ● help people connect with great local business ● share reviews and insights ● as of November 2010: ○ 39 million monthly unique visitors ○ in total, 14 million reviews posted ●
  • 21. EMR - Case Study - Yelp
  • 22. EMR - Case Study - Yelp ● uses S3 to store daily logs (~100GB/day) and photos ● uses EMR to power features like ○ People who viewed this also viewed ○ Review highlights ○ Autocomplete in search box ○ Top searches ● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR
  • 23. mrjob - WordCount example from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences) if __name__ == '__main__': MRWordCounter.run()
  • 24. mrjob - run on EMR $ python wordcount.py --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < 's3://input-bucket/*.txt' > output
  • 25. Demo
  • 26. Million Song Dataset ● Contains detailed acoustic and contextual data for one million popular songs ● ~300 GB of data ● Publicly available ○ for download: http://www.infochimps. com/collections/million-songs ○ for processing using EMR: http://tbmmsd.s3. amazonaws.com/
  • 27. Million Song Dataset Contains data such as: ● Song's title, year and hotness ● Song's tempo, duration, danceability, energy, loudness, segments count, preview (URL to mp3 file) and so on ● Artist's name and hotness
  • 28. Million Song Dataset - Song's density Song's density* can be defined as the average number of notes or atomic sounds (called segments) per second in a song. density = segmentCnt / duration       * based on Paul Lamere's blog - http://bit.ly/qUbLdQ
  • 29. Million Song Dataset - Task* Simple music recommendation system ● Calculate density for each song ● Find hot songs with similar density * based on Paul Lamere's blog - http://bit.ly/qUbLdQ
  • 30. Million Song Dataset - MapReduce Input data ● 339 files ● Each file contains ~3 000 songs ● Each song is represented by one line in input file ● Fields are separated by a tab character
  • 31. Million Song Dataset - MapReduce Mapper ● Reads song's data from each line of input text ● Calculate song's density ● Emits song's density as key with some other details as value <line_offset, song_data> -> <density, (artist_name, song_title, song_url)>
  • 32. public void map(LongWritable key, Text value, OutputCollector<FloatWritable, TripleTextWritable> output, Reporter reporter) throws IOException {   song.parseLine(value.toString()); if (song.tempo > 0 && song.duration > 0 ) { // calculate density float density = ((float) song.segmentCnt) / song.duration; denstyWritable.set(density); songWritable.set(song.artistName, song.title, song.preview); output.collect(denstyWritable, songWritable); } }
  • 33. Million Song Dataset - MapReduce Reducer ● Identity Reducer ● Each Reducer gets density values from different range: <i,i+1)*,** <density, [(artist_name, song_title, song_url)]> -> <density, (artist_name, song_title, song_url)> * thanks to a custom Partitioner ** not optimal partitioning (partitions are not balanced)
  • 34. Demo - used software ● Karmasphere Studio for EMR (Eclipse plugin) ○ graphical environment that supports the complete lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere. com/ksc/karmasphere-studio-for-amazon.html)
  • 35. Demo - used software ● Karmasphere Studio for EMR (Eclipse plugin) images from: http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html
  • 36. Video
  • 37. Please watch video on WHUG channel on YouTube http://www.youtube.com/watch? v=Azwilbn8GCs