How KKBOX use mrjob to link python, hadoop, aws

•

4 gostaram•1,524 visualizações

This document summarizes Aaron Lin's presentation about using the mrjob library to write and run Python map-reduce jobs on Hadoop clusters and AWS EMR. It discusses how mrjob allows testing jobs locally, running jobs on Hadoop, and optimizing jobs by trying different AWS instance types to minimize costs. Key points are that mrjob provides an easy way to write map-reduce jobs in Python and run them on various systems, and brute force testing different AWS configurations can help identify lower cost options but may be inefficient.

Engenharia

Aaronlin
KKBOX 如何使用 mrjob 連結
Python, hadoop, aws

透過網路與技術的創新，提供歌手藝人與他們的
音樂更多宣傳平台、管道

為音樂愛好者創造最全面性的音樂體驗

•  Aaron Lin
–  研究中心頭子
–  aaronlin@kkbox.com
–  http://about.me/aaron.yclin
•  KKBOX 研究中心過去成果
About me!

•  Need to use map-reduce to perform experiments
–  map-reduce: map à sort à reduce
兩團巨量資料交會之下!

•  What is mrjob
–  Open source project founded by Yelp
•  https://github.com/Yelp/mrjob
•  Docs: https://pythonhosted.org/mrjob/
–  A python library for writing map-reduce job
–  Can cooperate with hadoop cluster and AWS very
easily
為什麼要使用 mrjob?!

•  Why python?
–  Because of we love python
•  Why AWS Elastic MapReduce (EMR)?
–  if hadoop cluster has no resources left, use EMR
–  If hadoop cluster cannot ﬁnish the job in time, use
EMR
–  mrjob can audit your expense and effectiveness of
each job
為什麼要使用 mrjob?!

•  Three steps
–  Deﬁne your question into map-reduce
–  Write your mapper(s)
–  Write your reducer(s)
•  That’s it!
First mrjob program!

•  mrjob can run in three ways
–  Locally
–  Hadoop
–  AWS EMR
First mrjob program!

•  Either way works
–  python wordcount news.txt
–  cat news.txt | python wordcount.py
–  cat news.txt | python wordcount.py --mapper | sort |  
python wordcount.py --reducer
Run mrjob locally!

•  Easy to test since mapper/reducer can be run
individually
–  cat news.txt | python wordcount.py --mapper
–  cat news.txt | python wordcount.py --mapper | sort |  
python wordcount.py --reducer
•  Good for Development
Run mrjob locally!

•  Write .mrjob.conf in HOME folder
Run mrjob in EMR!

Instance type of each group!
task
c3.2xlarge
c3.2xlarge
m1.small

•  Use -r to specify the runner
–  python wordcount.py -r emr news.txt
–  python wordcount.py -r emr s3://xxxx/news.txt
Run mrjob in EMR!

•  How to audit emr usage
–  mrjob audit-emr-usage
•  If you have ValueError due to mismatched datetime
format
–  Fix it in mrjob folder/audit_usage.py
Run mrjob in EMR!

•  Write a cool program to compute it
•  But we don’t know which AWS instance type is
the best
悲劇!

•  http://docs.aws.amazon.com/ElasticMapReduce/latest
/DeveloperGuide/emr-plan-instances.html
If you check the ofﬁcial document!

I like brute force…!
Memory
optimized
Compute
optimized
General
purpose

•  For instances with Similar Cost and same number of
vCPU, Current generation instance is better
Focus on compute optimized instance!

•  Conﬁguration of number of mapper/reducer is
different
Focus on compute optimized instance!

•  Evaluation is speciﬁc to this task
•  Brute force search is too lazy……
•  Cost about 1500 NTD per run……
•  Hadoop/AWS is a buzz word
–  The money you spend is real
–  Buying some low-cost computers  
is always an option
Conclusion!

•  Mrjob
–  https://github.com/Yelp/mrjob
–  Docs: https://pythonhosted.org/mrjob/
•  Hardware spec of each instance type
–  http://aws.amazon.com/ec2/instance-types/
–  http://aws.amazon.com/ec2/previous-generation/
•  Number of mapper/reducer of instance type
–  http://docs.aws.amazon.com/ElasticMapReduce/latest
/DeveloperGuide/TaskConﬁguration_H1.0.3.html
Reference!

•  Slides and script
–  https://github.com/KKBOX/coscup.tw.2014
Reference!

z

We
are
hiring!

h,p://www.kkbox.com/jobs/

Mais conteúdo relacionado

Destaque

Diverging six factors circular flow arrows diagram software power point slidesSlideTeam.net

Audience ProfilingVictory Media

Extended Audience Profilechessromeo

Four leading reason for cause cycle process diagram power point slidesSlideTeam.net

Powerpoint presentations process management solution cycle flow network templ...SlideTeam.net

Group of nine coverging arrows circular layout process power point slidesSlideTeam.net

Business powerpoint presentations process diagram six decisions cycle flow ch...SlideTeam.net

Destaque (7)

Diverging six factors circular flow arrows diagram software power point slides

Audience Profiling

Extended Audience Profile

Four leading reason for cause cycle process diagram power point slides

Powerpoint presentations process management solution cycle flow network templ...

Group of nine coverging arrows circular layout process power point slides

Business powerpoint presentations process diagram six decisions cycle flow ch...

Semelhante a How KKBOX use mrjob to link python, hadoop, aws

Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner

Message:Passing - lpw 2012Tomas Doran

Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit

Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Behar Veliqi

ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...Paul Brebner

Serial-WarXuechao Wu

Japanese CloudSearch Use-Cases and Tech Deep DiveEiji Shinohara

AWS meetup「Apache Spark on EMR」SmartNews, Inc.

Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit

London devops loggingTomas Doran

Hunting for anglerfish in datalakesDominic Egger

Storm AnatomyEiichiro Uchiumi

Cdn cs6740Aravindharamanan S

Clojure Conj 2014 - Paradigms of core.async - Julian GambleJulian Gamble

Re invent announcements_2016_hcls_use_cases_mchampionMia D Champion

AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner VogelsAmazon Web Services

Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico

Log Analytics with Amazon Elasticsearch Service & KibanaAmazon Web Services

Accelerating Analytics for the Future of GenomicsAmazon Web Services

AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...Amazon Web Services

Semelhante a How KKBOX use mrjob to link python, hadoop, aws (20)

Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...

Message:Passing - lpw 2012

Spark Summit EU talk by Ruben Pulido and Behar Veliqi

Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...

ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...

Serial-War

Japanese CloudSearch Use-Cases and Tech Deep Dive

AWS meetup「Apache Spark on EMR」

Spark Summit EU talk by Ruben Pulido Behar Veliqi

London devops logging

Hunting for anglerfish in datalakes

Storm Anatomy

Cdn cs6740

Clojure Conj 2014 - Paradigms of core.async - Julian Gamble

Re invent announcements_2016_hcls_use_cases_mchampion

AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels

Frontera распределенный робот для обхода веба в больших объемах / Александр С...

Log Analytics with Amazon Elasticsearch Service & Kibana

Accelerating Analytics for the Future of Genomics

AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...

Último

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

Extrusion Processes and Their Limitations120cr0395

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome

Introduction and different types of Ethernet.pptxupamatechverse

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

How KKBOX use mrjob to link python, hadoop, aws

1. Aaronlin KKBOX 如何使用 mrjob 連結 Python, hadoop, aws

2. About KKBOX!

3. About KKBOX!

5. 透過網路與技術的創新，提供歌手藝人與他們的音樂更多宣傳平台、管道為音樂愛好者創造最全面性的音樂體驗

6. •  Aaron Lin –  研究中心頭子 –  aaronlin@kkbox.com –  http://about.me/aaron.yclin •  KKBOX 研究中心過去成果 About me!

7. 為什麼今天會有這場演講？

8. 一切就來自於科科科技面對到的科科難題

9. MORE THAN 10 MILLION USERS

10. MORE THAN 10 MILLION SONGS

11. •  Need to use map-reduce to perform experiments –  map-reduce: map à sort à reduce 兩團巨量資料交會之下!

12. •  What is mrjob –  Open source project founded by Yelp •  https://github.com/Yelp/mrjob •  Docs: https://pythonhosted.org/mrjob/ –  A python library for writing map-reduce job –  Can cooperate with hadoop cluster and AWS very easily 為什麼要使用 mrjob?!

13. •  Why python? –  Because of we love python •  Why AWS Elastic MapReduce (EMR)? –  if hadoop cluster has no resources left, use EMR –  If hadoop cluster cannot ﬁnish the job in time, use EMR –  mrjob can audit your expense and effectiveness of each job 為什麼要使用 mrjob?!

14. •  Three steps –  Deﬁne your question into map-reduce –  Write your mapper(s) –  Write your reducer(s) •  That’s it! First mrjob program!

15. First mrjob program!

16. •  mrjob can run in three ways –  Locally –  Hadoop –  AWS EMR First mrjob program!

17. •  Either way works –  python wordcount news.txt –  cat news.txt | python wordcount.py –  cat news.txt | python wordcount.py --mapper | sort |   python wordcount.py --reducer Run mrjob locally!

18. •  Easy to test since mapper/reducer can be run individually –  cat news.txt | python wordcount.py --mapper –  cat news.txt | python wordcount.py --mapper | sort |   python wordcount.py --reducer •  Good for Development Run mrjob locally!

19. •  Write .mrjob.conf in HOME folder Run mrjob in EMR!

20. Instance type of each group! task c3.2xlarge c3.2xlarge m1.small

21. •  Use -r to specify the runner –  python wordcount.py -r emr news.txt –  python wordcount.py -r emr s3://xxxx/news.txt Run mrjob in EMR!

22. Run mrjob in EMR!

23. •  How to audit emr usage –  mrjob audit-emr-usage •  If you have ValueError due to mismatched datetime format –  Fix it in mrjob folder/audit_usage.py Run mrjob in EMR!

24. 但使用上總還是有些問題得先解決

25. •  Write a cool program to compute it •  But we don’t know which AWS instance type is the best 悲劇!

26.

27. •  http://docs.aws.amazon.com/ElasticMapReduce/latest /DeveloperGuide/emr-plan-instances.html If you check the ofﬁcial document!

28.

29. I like brute force…! Memory optimized Compute optimized General purpose

30. •  For instances with Similar Cost and same number of vCPU, Current generation instance is better Focus on compute optimized instance!

31. •  For instances with Similar Cost and same number of vCPU, Current generation instance is better Focus on compute optimized instance!

32. •  Conﬁguration of number of mapper/reducer is different Focus on compute optimized instance!

33. •  Conﬁguration of number of mapper/reducer is different Focus on compute optimized instance!

34. •  Evaluation is speciﬁc to this task •  Brute force search is too lazy…… •  Cost about 1500 NTD per run…… •  Hadoop/AWS is a buzz word –  The money you spend is real –  Buying some low-cost computers   is always an option Conclusion!

35. •  Mrjob –  https://github.com/Yelp/mrjob –  Docs: https://pythonhosted.org/mrjob/ •  Hardware spec of each instance type –  http://aws.amazon.com/ec2/instance-types/ –  http://aws.amazon.com/ec2/previous-generation/ •  Number of mapper/reducer of instance type –  http://docs.aws.amazon.com/ElasticMapReduce/latest /DeveloperGuide/TaskConﬁguration_H1.0.3.html Reference!

36. •  Slides and script –  https://github.com/KKBOX/coscup.tw.2014 Reference!

37. z We are hiring! h,p://www.kkbox.com/jobs/

How KKBOX use mrjob to link python, hadoop, aws

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Semelhante a How KKBOX use mrjob to link python, hadoop, aws

Semelhante a How KKBOX use mrjob to link python, hadoop, aws (20)

Último

Último (20)

How KKBOX use mrjob to link python, hadoop, aws