SlideShare a Scribd company logo
1 of 12
Download to read offline
Spark and Hadoop at Yahoo:
Brought to you by YARN

Andy Feng
Yahoo! Hadoop
(afeng@yahoo-inc.com)
Personalized Web
Big-Data in Yahoo!

3

9/10/13
Hadoop + Spark:
Empowered by YARN

30k+ Yahoo! production nodes on YARN since Q1 2013
Shark Pilot: Advertising Data Analytics
§  Business questions
› 

Are two sets of audience cohorts similar to each other?

› 

What audience segment is most likely to be interested in this ad
campaign?

› 

In what way was the new front page rollout different than the
previous front page as far as audience engagement goes?

› 

What are the right metrics to define user engagement?

§  Shark pilot
› 

50 nodes, each w/ 96GB RAM
•  Currently loaded w/ 3.2 TB sample data in memory

› 

Homegrown BI tools for ad-hoc queries
•  Using Shark Server (contributed to community by Yahoo!)
Shark Perf: TCP-H Benchmark
Average
Seconds
600
500
400
300
200
100
0
Spark Pilot: Model Training Pipeline
§  A DAG of M/R jobs in Hadoop Streaming
› 

Feature extraction

› 

Train models

› 

Score and analyze models

§  Initial Spark prototype
› 

3x speedup on feature extraction

§  Production launch
› 

Apply Spark against complete pipeline

› 

Spark on 80 node cluster
•  Thanks to the enhanced UI and metrics in Spark 0.8

7

9/10/13
Use Case: Ad Targeting

Spark

M/R and Storm

8

9/10/13
Use Case: Content Recommendation
w/ Collaborative Filtering
Input

CF Learning

Ranking

Spark

Spark

9

9/10/13

Output
Spark-YARN: Deployment Simplified
run spark.deploy.yarn.Client --jar … --class … --args …
--queue …--num-workers … --worker-memory …

Spark-YARN (contributed by Yahoo!) is being adopted by
community (ex. Taobao) for production use. You should try it
on your Hadoop cluster.
10

9/10/13
Acknowledgement
§  AMPLab team
› 

Outstanding collaboration: Ion, Matei, Reynold, Patrick, Matt, …

§  Yahoo! Hadoop team
› 

Thomas, Bobby, Paul, Rajiv, Mithun, …

§  Yahoo! Lab.
› 

Mridul, Nathan, …

§  Yahoo! data analytics
› 

Supreeth, Ram, Tim, …

§  Yahoo! spark users
› 

Gavin, Jay, Hirakendu, …

11

9/10/13
We Are Hiring!
http://careers.yahoo.com/

More Related Content

What's hot

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache Spark
Tejas Patil
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 

What's hot (20)

Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudOracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Dm from databases perspective u 1
Dm from databases perspective u 1Dm from databases perspective u 1
Dm from databases perspective u 1
 
Database Basics Theory
Database Basics TheoryDatabase Basics Theory
Database Basics Theory
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoop
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
03 spark rdd operations
03 spark rdd operations03 spark rdd operations
03 spark rdd operations
 
New in Oracle Universal Installer (OUI)
New in Oracle Universal Installer (OUI) New in Oracle Universal Installer (OUI)
New in Oracle Universal Installer (OUI)
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Similar to Yahoo spark

Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 
Capstone Project Slides- Yelper
Capstone Project Slides- YelperCapstone Project Slides- Yelper
Capstone Project Slides- Yelper
Chuan Sun
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
yhadoop
 

Similar to Yahoo spark (20)

Fikrimuhal TRHUG 2016 Machine Learning
Fikrimuhal TRHUG 2016 Machine LearningFikrimuhal TRHUG 2016 Machine Learning
Fikrimuhal TRHUG 2016 Machine Learning
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Capstone Project Slides- Yelper
Capstone Project Slides- YelperCapstone Project Slides- Yelper
Capstone Project Slides- Yelper
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
Power Platform Leeds - November 2019 - Microsoft Ignite Announcements
Power Platform Leeds - November 2019 - Microsoft Ignite AnnouncementsPower Platform Leeds - November 2019 - Microsoft Ignite Announcements
Power Platform Leeds - November 2019 - Microsoft Ignite Announcements
 
The Future Of SEO/Content Marketing
The Future Of SEO/Content MarketingThe Future Of SEO/Content Marketing
The Future Of SEO/Content Marketing
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Yahoo spark

  • 1. Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)
  • 4. Hadoop + Spark: Empowered by YARN 30k+ Yahoo! production nodes on YARN since Q1 2013
  • 5. Shark Pilot: Advertising Data Analytics §  Business questions ›  Are two sets of audience cohorts similar to each other? ›  What audience segment is most likely to be interested in this ad campaign? ›  In what way was the new front page rollout different than the previous front page as far as audience engagement goes? ›  What are the right metrics to define user engagement? §  Shark pilot ›  50 nodes, each w/ 96GB RAM •  Currently loaded w/ 3.2 TB sample data in memory ›  Homegrown BI tools for ad-hoc queries •  Using Shark Server (contributed to community by Yahoo!)
  • 6. Shark Perf: TCP-H Benchmark Average Seconds 600 500 400 300 200 100 0
  • 7. Spark Pilot: Model Training Pipeline §  A DAG of M/R jobs in Hadoop Streaming ›  Feature extraction ›  Train models ›  Score and analyze models §  Initial Spark prototype ›  3x speedup on feature extraction §  Production launch ›  Apply Spark against complete pipeline ›  Spark on 80 node cluster •  Thanks to the enhanced UI and metrics in Spark 0.8 7 9/10/13
  • 8. Use Case: Ad Targeting Spark M/R and Storm 8 9/10/13
  • 9. Use Case: Content Recommendation w/ Collaborative Filtering Input CF Learning Ranking Spark Spark 9 9/10/13 Output
  • 10. Spark-YARN: Deployment Simplified run spark.deploy.yarn.Client --jar … --class … --args … --queue …--num-workers … --worker-memory … Spark-YARN (contributed by Yahoo!) is being adopted by community (ex. Taobao) for production use. You should try it on your Hadoop cluster. 10 9/10/13
  • 11. Acknowledgement §  AMPLab team ›  Outstanding collaboration: Ion, Matei, Reynold, Patrick, Matt, … §  Yahoo! Hadoop team ›  Thomas, Bobby, Paul, Rajiv, Mithun, … §  Yahoo! Lab. ›  Mridul, Nathan, … §  Yahoo! data analytics ›  Supreeth, Ram, Tim, … §  Yahoo! spark users ›  Gavin, Jay, Hirakendu, … 11 9/10/13