SlideShare a Scribd company logo
1 of 35
Download to read offline
Apache Spark on EMR
Yuyang Lan
SmartNews Inc.
MOKUJI
• Intro
• Recent Spark
• How we use Spark in Smartnews
• Best Practices
Who am I
• @y2_lan
• Engineer at SmartNews Inc. (AD team)
• Hacker, Data Engineer, Beer Lover
何か要望・問題あったら @kaiseh :)
About Apache Spark
maybe
just skip?
About Apache Spark
Quick catch up
RDD
action
transformations
Recent Spark at a glance
• Databricks Cloud goes public
• Spark 1.4.x
• Project Tungsten
• AWS adds support for Apache Spark on EMR
• …
Spark 1.4.x
• SparkR
• DataFrame API
• ML Pipeline
• Streaming UI
• …
Spark at SmartNews
• AD CTR Prediction ( Logistic Regression )
Spark at SmartNews
• Scoring articles by Kinesis + Spark Streaming
Spark at SmartNews
• Ad-Hoc Analysis, Faster (& Hive-compatible) SQL
Spark at SmartNews
• Realtime Stats by Kinesis + Spark Streaming
Spark at SmartNews
• ML experiments
• AD targeting
• User Clustering
• Recommendation
• …
Best Practices
#1
• Should use the default Spark with EMR ?
• Yes Sure
• EMR 4.0 is great ! (Released today ?!)
• Hadoop 2.6 + Hive 1.0 + Spark 1.4.1
Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• Cutting Edge Version
• Native netlib-java ( mvn -Pnetlib-lgpl )
• Custom dependency version
• …
Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• --bootstrap-actions bootstrap.json
Best Practices
#1
• Should use the default Spark with EMR ?
• But only if you need a custom-build Spark
• Remember to start SparkHistoryServer
Best Practices
#2
• Run Spark on Yarn
• Use yarn-cluster mode to distribute Drivers
• specify jars and files to distribute necessary
resources
Best Practices
#3
• Tuning Memory
• CPU shortage only slow down your program, but short in
memory make it crash
• you can even set --executor-cores bigger than your CPU num
• Cache-able heap != JVM’s Xmx
• (normally about 50%)
Best Practices
#3
• Tuning Memory
• CPU shortage only slow down
your program, but short in
memory make it crash
• Cache-able heap != JVM’s
Xmx
Image from: http://0x0fff.com/spark-architecture/
Best Practices
#3
• Tuning Memory
• CPU shortage only slow down your program, but short in memory make
it crash
• Cache-able heap != JVM’s Xmx
• spark.yarn.executor.memoryOverhead
• spark.executor.memory
• spark.storage.memoryFraction
• …
• Split your executors if HEAP_SIZE > 64GB (GC)
• -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Best Practices
#4
• If your ML job is really CPU-bound
• Try using OpenBLAS + netlib.NativeSystemBLAS
Best Practices
#4
• Try using OpenBLAS + netlib.NativeSystemBLAS
4~5 times FAST
Best Practices
#5
• Minimize data shuffle
• Prefer reduceByKey over groupByKey+map
• RDD.repartition(NUM_OF_CORES) before cache
• Try to do filter early
Best Practices
#5
• Minimize data shuffle
Best Practices
#6
• Prefer DataFrame APIs over
low level RDD APIs
• Better DAG Optimization
• Same interface & same
performance
Best Practices
#7
• Use Kryo serialization if possible
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
Best Practices
#8
• Pick up a notebook tool (iPython or Zeppelin or ?
• For memo, sharing, visualisation
• Convenient for non-engineer users
Best Practices
#9
• Multiple small & task-driven EMR clusters
Best Practices
#10
• use Dynamic scaling with Spark Streaming
• spark.dynamicAllocation.enabled = true
• spark.shuffle.service.enabled = true
• be careful if you use cached data
Best Practices
#11
• Use Spot Instance
• Be more aggressive in bid price : p
• BID_PRICE != MONEY_TO_PAY
• Check Spot Instance Pricing History
• Find the instance type with relative stable price
• often Previous Generation Instance ?
• Prepare failure, don’t use them in critical missions
Further Reading
• To use Spark Streaming in Production
• http://www.slideshare.net/SparkSummit/recipes-
for-running-spark-streaming-apploications-in-
production-tathagata-daspptx
Further Reading
• If you’re interested in new ML pipelines
• http://www.slideshare.net/SparkSummit/building-
debugging-and-tuning-spark-machine-leaning-
pipelinesjoseph-bradley
Thanks!
We’re hiring!
http://about.smartnews.com/ja/careers/
iOSエンジニア / Androidエンジニア
/ Webアプリケーションエンジニア
/ プロダクティビティエンジニア
/ 機械学習 / 自然言語処理エンジニア
/ グロースハックエンジニア
/ サーバサイドエンジニア
/ 広告エンジニア…

More Related Content

What's hot

AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
Amazon Web Services
 
일본 시골 개발자의 AWS 활용기 - AWS Summit Seoul 2017
일본 시골 개발자의 AWS 활용기 - AWS Summit Seoul 2017일본 시골 개발자의 AWS 활용기 - AWS Summit Seoul 2017
일본 시골 개발자의 AWS 활용기 - AWS Summit Seoul 2017
Amazon Web Services Korea
 

What's hot (20)

AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)
 
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
 
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon AuroraNEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
 
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
 
Picking the right AWS backend for your application (September 2017)
Picking the right AWS backend for your application (September 2017)Picking the right AWS backend for your application (September 2017)
Picking the right AWS backend for your application (September 2017)
 
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
Autoscaling Spark on AWS EC2 - 11th Spark London meetupAutoscaling Spark on AWS EC2 - 11th Spark London meetup
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
 
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연
 
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
 
(CMP311) This One Weird API Request Will Save You Thousands
(CMP311) This One Weird API Request Will Save You Thousands(CMP311) This One Weird API Request Will Save You Thousands
(CMP311) This One Weird API Request Will Save You Thousands
 
Deep Learning on AWS (November 2016)
Deep Learning on AWS (November 2016)Deep Learning on AWS (November 2016)
Deep Learning on AWS (November 2016)
 
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
 
Getting started with amazon aurora - Toronto
Getting started with amazon aurora - TorontoGetting started with amazon aurora - Toronto
Getting started with amazon aurora - Toronto
 
Quilt - Distributed Load Simulation from AWS
Quilt - Distributed Load Simulation from AWSQuilt - Distributed Load Simulation from AWS
Quilt - Distributed Load Simulation from AWS
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
AWS 마이그레이션 서비스 - 김일호 :: 2015 리인벤트 리캡 게이밍
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)
 
Introduction on Amazon EC2
 Introduction on Amazon EC2 Introduction on Amazon EC2
Introduction on Amazon EC2
 
일본 시골 개발자의 AWS 활용기 - AWS Summit Seoul 2017
일본 시골 개발자의 AWS 활용기 - AWS Summit Seoul 2017일본 시골 개발자의 AWS 활용기 - AWS Summit Seoul 2017
일본 시골 개발자의 AWS 활용기 - AWS Summit Seoul 2017
 

Viewers also liked

Viewers also liked (20)

Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
 
NLP in SmartNews
NLP in SmartNewsNLP in SmartNews
NLP in SmartNews
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservices
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager Night
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォーム
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
 
LDAを用いた教師なし単語分類
LDAを用いた教師なし単語分類LDAを用いた教師なし単語分類
LDAを用いた教師なし単語分類
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへ
 

Similar to AWS meetup「Apache Spark on EMR」

Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 

Similar to AWS meetup「Apache Spark on EMR」 (20)

Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark Hsinchu meetup
Spark Hsinchu meetupSpark Hsinchu meetup
Spark Hsinchu meetup
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 

Recently uploaded

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 

Recently uploaded (20)

A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 

AWS meetup「Apache Spark on EMR」

  • 1. Apache Spark on EMR Yuyang Lan SmartNews Inc.
  • 2. MOKUJI • Intro • Recent Spark • How we use Spark in Smartnews • Best Practices
  • 3. Who am I • @y2_lan • Engineer at SmartNews Inc. (AD team) • Hacker, Data Engineer, Beer Lover
  • 6. About Apache Spark Quick catch up RDD action transformations
  • 7. Recent Spark at a glance • Databricks Cloud goes public • Spark 1.4.x • Project Tungsten • AWS adds support for Apache Spark on EMR • …
  • 8. Spark 1.4.x • SparkR • DataFrame API • ML Pipeline • Streaming UI • …
  • 9. Spark at SmartNews • AD CTR Prediction ( Logistic Regression )
  • 10.
  • 11. Spark at SmartNews • Scoring articles by Kinesis + Spark Streaming
  • 12. Spark at SmartNews • Ad-Hoc Analysis, Faster (& Hive-compatible) SQL
  • 13. Spark at SmartNews • Realtime Stats by Kinesis + Spark Streaming
  • 14. Spark at SmartNews • ML experiments • AD targeting • User Clustering • Recommendation • …
  • 15. Best Practices #1 • Should use the default Spark with EMR ? • Yes Sure • EMR 4.0 is great ! (Released today ?!) • Hadoop 2.6 + Hive 1.0 + Spark 1.4.1
  • 16. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • Cutting Edge Version • Native netlib-java ( mvn -Pnetlib-lgpl ) • Custom dependency version • …
  • 17. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • --bootstrap-actions bootstrap.json
  • 18. Best Practices #1 • Should use the default Spark with EMR ? • But only if you need a custom-build Spark • Remember to start SparkHistoryServer
  • 19. Best Practices #2 • Run Spark on Yarn • Use yarn-cluster mode to distribute Drivers • specify jars and files to distribute necessary resources
  • 20. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • you can even set --executor-cores bigger than your CPU num • Cache-able heap != JVM’s Xmx • (normally about 50%)
  • 21. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • Cache-able heap != JVM’s Xmx Image from: http://0x0fff.com/spark-architecture/
  • 22. Best Practices #3 • Tuning Memory • CPU shortage only slow down your program, but short in memory make it crash • Cache-able heap != JVM’s Xmx • spark.yarn.executor.memoryOverhead • spark.executor.memory • spark.storage.memoryFraction • … • Split your executors if HEAP_SIZE > 64GB (GC) • -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
  • 23. Best Practices #4 • If your ML job is really CPU-bound • Try using OpenBLAS + netlib.NativeSystemBLAS
  • 24. Best Practices #4 • Try using OpenBLAS + netlib.NativeSystemBLAS 4~5 times FAST
  • 25. Best Practices #5 • Minimize data shuffle • Prefer reduceByKey over groupByKey+map • RDD.repartition(NUM_OF_CORES) before cache • Try to do filter early
  • 27. Best Practices #6 • Prefer DataFrame APIs over low level RDD APIs • Better DAG Optimization • Same interface & same performance
  • 28. Best Practices #7 • Use Kryo serialization if possible --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
  • 29. Best Practices #8 • Pick up a notebook tool (iPython or Zeppelin or ? • For memo, sharing, visualisation • Convenient for non-engineer users
  • 30. Best Practices #9 • Multiple small & task-driven EMR clusters
  • 31. Best Practices #10 • use Dynamic scaling with Spark Streaming • spark.dynamicAllocation.enabled = true • spark.shuffle.service.enabled = true • be careful if you use cached data
  • 32. Best Practices #11 • Use Spot Instance • Be more aggressive in bid price : p • BID_PRICE != MONEY_TO_PAY • Check Spot Instance Pricing History • Find the instance type with relative stable price • often Previous Generation Instance ? • Prepare failure, don’t use them in critical missions
  • 33. Further Reading • To use Spark Streaming in Production • http://www.slideshare.net/SparkSummit/recipes- for-running-spark-streaming-apploications-in- production-tathagata-daspptx
  • 34. Further Reading • If you’re interested in new ML pipelines • http://www.slideshare.net/SparkSummit/building- debugging-and-tuning-spark-machine-leaning- pipelinesjoseph-bradley
  • 35. Thanks! We’re hiring! http://about.smartnews.com/ja/careers/ iOSエンジニア / Androidエンジニア / Webアプリケーションエンジニア / プロダクティビティエンジニア / 機械学習 / 自然言語処理エンジニア / グロースハックエンジニア / サーバサイドエンジニア / 広告エンジニア…