SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
HADOOP AND SPARK
JOIN FORCES AT YAHOO
Andy Feng (afeng@yahoo-inc.com)
Distinguished Architect, Platforms, Yahoo

1
Monday, December 2, 13
YAHOO 2012: TODAY MODULE

http://visualize.yahoo.com/core/#
Monday, December 2, 13

2
YAHOO 2013: PERSONALIZED HOMEPAGE
http://www.yahoo.com

Mobile

3
Monday, December 2, 13
YAHOO 2013: PERSONALIZED PROPERTIES
http://finance.yahoo.com

Mobile

4
Monday, December 2, 13
YAHOO 2013: IMPROVED WEB SEARCH
W/ VERTICAL CONTENT
http://search.yahoo.com
Mobile

5
Monday, December 2, 13
DATA SCIENCE AT SCALE

Data Scientist

6
Monday, December 2, 13
1. CHALLENGE: SCIENCE
✦ Single model for all items in homepage stream
✴ Millions of items
✴ 1000’s of item/user features
• Yahoo content categories
• Wikipedia entity names

✴ Over 800 million users

✦ Objective function
✴ Relevance & user engagement
✴ Freshness & popularity
✴ Diversity

★ Algorithm exploration
✴ Logistic regression?
✴ Collaborative filtering?
✴ Decision trees?
✴ Hybrid?

7
Monday, December 2, 13
II. CHALLENGE: SPEED
✦Ex. Item CTR in Yahoo homepage Today Module
* Short Lifetimes
* Temporal effect

✦Models should be constructed hourly or faster
8
Monday, December 2, 13

* Breaking news
III. CHALLENGE: SCALE
✦150 PB of data on Yahoo Hadoop clusters
✴Yahoo data scientists need the data for
‣ Model building
‣ BI analytics
✴Such datasets should be accessed efficiently
‣ avoid latency caused by data movement
✦35,000 servers in Hadoop cluster
✴Science projects need to leverage
9
Monday, December 2, 13

all these servers for computation
SOLUTION: HADOOP + SPARK

I. science ... Spark API & MLlib ease development of ML algorithms
II.speed ... Spark reduces latency of model training via in-memory RDD etc
III.scale ... YARN brings Hadoop datasets & servers at scientists’ fingertips
10
Monday, December 2, 13
PILOT PROJECT: E-COMMERCE
Yahoo Taiwan Shopping & Auction
✦ Collaborative

filtering algorithms for
✴ Viewed-also-viewed
✴ Bought-also-bought
✴ Bought-after-viewed

✦ 30

LOC in Spark/Scala
✴ 14 min. on 10 servers
- Hadoop-based algorithm: 106 min.

11
Monday, December 2, 13
PILOT PROJECT: STREAM ADS
✦A logistic regression algorithm
✴120 LOC in Spark/Scala
‣ Alternative: Vowpal Wabbit ... Difficult to extend
✴30 min. on model creation for 100M samples
and 13K features with 30 iterations

✦“I used Spark-on-Yarn package
today, works great.” Amit (July 26, 2013 2:51 PM)

✴Initial algorithm was launched within 2 hours
after Spark-YARN package announcement
✴Compare: Several weeks on system setup and
data movement

12
Monday, December 2, 13
SUMMARY
✦Spark plays an important role in machine learning at Yahoo
✴Hadoop continues to be the core of our big-data platform
✦Yahoo is excited about your continued contribution to Apache
✴4 committers
✴ex. Spark-on-Yarn, Shark, security, scalability, operability etc.

13
Monday, December 2, 13

Spark
KEY TECHNOLOGY:
SPARK ON YARN

Thomas Graves (tgraves@yahoo-inc.com)
Spark Committer & Hadoop PMC
Yahoo

14
Monday, December 2, 13
SPARK-ON-YARN: ROADMAP
✦ Spark-0.6 & 0.7 - Experimental support of YARN
✴ Hadoop 0.23.X and early Hadoop 2.X releases
✦ Spark-0.8.0

- Spark-on-Yarn merged into master Spark branch

✦ Spark-0.8.X

- Future integration with Hadoop

✦ Spark-0.9.X

- Client-mode introduced

✴ Secure HDFS access
✴ Use YARN approved directories
✴ Link Spark UI to YARN UI

✴ Add authentication to Spark
✴ Support running spark on YARN from HDFS
✴ Support files/archives in Hadoop distributed cache
✴ Support Spark Shell on YARN
✴ Spark on YARN running on Hadoop

2.2.X
15

Monday, December 2, 13
ARCHITECTURE: STANDALONE MODE

16
Monday, December 2, 13
ARCHITECTURE: CLIENT MODE

17
Monday, December 2, 13
FUTURE DIRECTIONS
✦Support long running
✴Shark
✴Spark Streaming
✦Dynamic

jobs

resource allocation

✦Integrate with Hadoop enhancements
✴Generic History Server
✴Preemption, etc.
18
Monday, December 2, 13
MORE INFO:
✦http://spark.incubator.apache.org/docs/latest/running-on-yarn.html

19
Monday, December 2, 13

Mais conteúdo relacionado

Mais procurados

The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
Michael Joseph
 

Mais procurados (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Apache spark
Apache sparkApache spark
Apache spark
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 

Destaque

Destaque (13)

Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
 
Food Recommendation System Using Clustering Analysis for Diabetic patients
Food Recommendation System Using Clustering Analysis for Diabetic patientsFood Recommendation System Using Clustering Analysis for Diabetic patients
Food Recommendation System Using Clustering Analysis for Diabetic patients
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
iRIS: A Large-Scale Food and Recipe Recommendation System Using Spark-(Joohyu...
iRIS: A Large-Scale Food and Recipe Recommendation System Using Spark-(Joohyu...iRIS: A Large-Scale Food and Recipe Recommendation System Using Spark-(Joohyu...
iRIS: A Large-Scale Food and Recipe Recommendation System Using Spark-(Joohyu...
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 

Semelhante a December 2013 HUG: Spark at Yahoo!

Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
NLJUG
 
Drupal for Project Managers, Part 3: Launching
Drupal for Project Managers, Part 3: LaunchingDrupal for Project Managers, Part 3: Launching
Drupal for Project Managers, Part 3: Launching
Acquia
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012
scorlosquet
 
Case Study: Digital Asset Management in Drupal
Case Study: Digital Asset Management in DrupalCase Study: Digital Asset Management in Drupal
Case Study: Digital Asset Management in Drupal
Eric Sembrat
 

Semelhante a December 2013 HUG: Spark at Yahoo! (20)

Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Drupal: Internet Lego - What is Drupal?
Drupal: Internet Lego - What is Drupal?Drupal: Internet Lego - What is Drupal?
Drupal: Internet Lego - What is Drupal?
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Productionalizing Spark Streaming
Productionalizing Spark StreamingProductionalizing Spark Streaming
Productionalizing Spark Streaming
 
For a Social Local and Mobile Drupal
For a Social Local and Mobile DrupalFor a Social Local and Mobile Drupal
For a Social Local and Mobile Drupal
 
Lightweight javaEE with Guice
Lightweight javaEE with GuiceLightweight javaEE with Guice
Lightweight javaEE with Guice
 
Drupal for Project Managers, Part 3: Launching
Drupal for Project Managers, Part 3: LaunchingDrupal for Project Managers, Part 3: Launching
Drupal for Project Managers, Part 3: Launching
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 
Power of open source, Business perspective
Power of open source, Business perspectivePower of open source, Business perspective
Power of open source, Business perspective
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012
 
Python + NoSQL in Animations
Python + NoSQL in AnimationsPython + NoSQL in Animations
Python + NoSQL in Animations
 
Oracle GoldenGate Studio Intro
Oracle GoldenGate Studio IntroOracle GoldenGate Studio Intro
Oracle GoldenGate Studio Intro
 
Introduction to Drupal for Absolute Beginners
Introduction to Drupal for Absolute BeginnersIntroduction to Drupal for Absolute Beginners
Introduction to Drupal for Absolute Beginners
 
Case Study: Digital Asset Management in Drupal
Case Study: Digital Asset Management in DrupalCase Study: Digital Asset Management in Drupal
Case Study: Digital Asset Management in Drupal
 
Infrastructure as Code with Chef / Puppet
Infrastructure as Code with Chef / PuppetInfrastructure as Code with Chef / Puppet
Infrastructure as Code with Chef / Puppet
 

Mais de Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

December 2013 HUG: Spark at Yahoo!

  • 1. HADOOP AND SPARK JOIN FORCES AT YAHOO Andy Feng (afeng@yahoo-inc.com) Distinguished Architect, Platforms, Yahoo 1 Monday, December 2, 13
  • 2. YAHOO 2012: TODAY MODULE http://visualize.yahoo.com/core/# Monday, December 2, 13 2
  • 3. YAHOO 2013: PERSONALIZED HOMEPAGE http://www.yahoo.com Mobile 3 Monday, December 2, 13
  • 4. YAHOO 2013: PERSONALIZED PROPERTIES http://finance.yahoo.com Mobile 4 Monday, December 2, 13
  • 5. YAHOO 2013: IMPROVED WEB SEARCH W/ VERTICAL CONTENT http://search.yahoo.com Mobile 5 Monday, December 2, 13
  • 6. DATA SCIENCE AT SCALE Data Scientist 6 Monday, December 2, 13
  • 7. 1. CHALLENGE: SCIENCE ✦ Single model for all items in homepage stream ✴ Millions of items ✴ 1000’s of item/user features • Yahoo content categories • Wikipedia entity names ✴ Over 800 million users ✦ Objective function ✴ Relevance & user engagement ✴ Freshness & popularity ✴ Diversity ★ Algorithm exploration ✴ Logistic regression? ✴ Collaborative filtering? ✴ Decision trees? ✴ Hybrid? 7 Monday, December 2, 13
  • 8. II. CHALLENGE: SPEED ✦Ex. Item CTR in Yahoo homepage Today Module * Short Lifetimes * Temporal effect ✦Models should be constructed hourly or faster 8 Monday, December 2, 13 * Breaking news
  • 9. III. CHALLENGE: SCALE ✦150 PB of data on Yahoo Hadoop clusters ✴Yahoo data scientists need the data for ‣ Model building ‣ BI analytics ✴Such datasets should be accessed efficiently ‣ avoid latency caused by data movement ✦35,000 servers in Hadoop cluster ✴Science projects need to leverage 9 Monday, December 2, 13 all these servers for computation
  • 10. SOLUTION: HADOOP + SPARK I. science ... Spark API & MLlib ease development of ML algorithms II.speed ... Spark reduces latency of model training via in-memory RDD etc III.scale ... YARN brings Hadoop datasets & servers at scientists’ fingertips 10 Monday, December 2, 13
  • 11. PILOT PROJECT: E-COMMERCE Yahoo Taiwan Shopping & Auction ✦ Collaborative filtering algorithms for ✴ Viewed-also-viewed ✴ Bought-also-bought ✴ Bought-after-viewed ✦ 30 LOC in Spark/Scala ✴ 14 min. on 10 servers - Hadoop-based algorithm: 106 min. 11 Monday, December 2, 13
  • 12. PILOT PROJECT: STREAM ADS ✦A logistic regression algorithm ✴120 LOC in Spark/Scala ‣ Alternative: Vowpal Wabbit ... Difficult to extend ✴30 min. on model creation for 100M samples and 13K features with 30 iterations ✦“I used Spark-on-Yarn package today, works great.” Amit (July 26, 2013 2:51 PM) ✴Initial algorithm was launched within 2 hours after Spark-YARN package announcement ✴Compare: Several weeks on system setup and data movement 12 Monday, December 2, 13
  • 13. SUMMARY ✦Spark plays an important role in machine learning at Yahoo ✴Hadoop continues to be the core of our big-data platform ✦Yahoo is excited about your continued contribution to Apache ✴4 committers ✴ex. Spark-on-Yarn, Shark, security, scalability, operability etc. 13 Monday, December 2, 13 Spark
  • 14. KEY TECHNOLOGY: SPARK ON YARN Thomas Graves (tgraves@yahoo-inc.com) Spark Committer & Hadoop PMC Yahoo 14 Monday, December 2, 13
  • 15. SPARK-ON-YARN: ROADMAP ✦ Spark-0.6 & 0.7 - Experimental support of YARN ✴ Hadoop 0.23.X and early Hadoop 2.X releases ✦ Spark-0.8.0 - Spark-on-Yarn merged into master Spark branch ✦ Spark-0.8.X - Future integration with Hadoop ✦ Spark-0.9.X - Client-mode introduced ✴ Secure HDFS access ✴ Use YARN approved directories ✴ Link Spark UI to YARN UI ✴ Add authentication to Spark ✴ Support running spark on YARN from HDFS ✴ Support files/archives in Hadoop distributed cache ✴ Support Spark Shell on YARN ✴ Spark on YARN running on Hadoop 2.2.X 15 Monday, December 2, 13
  • 18. FUTURE DIRECTIONS ✦Support long running ✴Shark ✴Spark Streaming ✦Dynamic jobs resource allocation ✦Integrate with Hadoop enhancements ✴Generic History Server ✴Preemption, etc. 18 Monday, December 2, 13