SlideShare uma empresa Scribd logo
1 de 17
Hadoop Summit 2011 Online Content Optimization using Hadoop Shail Aditya shailg@yahoo-inc.com
What do we do ? ,[object Object]
Effectively and “pro-actively” learn from user interactions with content that are displayed to maximize our objectives
A new scientific discipline at the interface of
Large scale Machine Learning and Statistics
Multi-objective optimization in the presence of uncertainty
User understanding
Content understanding,[object Object]
Content Ranking Problems X Y Most Popular Most engaging overall based on objective metrics Most Popular + Per User History Rotate stories I’ve already seen Light Personalization More relevant to me based on my age, gender, location, and property usage Deep Personalization Most relevant to me based on my deep interests (entities, sources, categories, keywords) Related Items and Context-Sensitive Models Behavioral Affinity: People who did X, did Y Most engaging in this page/section/property/device/referral context? Layout Optimization Which modules/ad units should be shown to this user in this context? Revenue Optimization Voice and Business Rules Real-time Dashboard
Yahoo Frontpage Trending Now  (Most popular) Today Module (Light personalization) Personal Assistant (Light Personalization) National News (Most Popular +  User History bucket) Deals  (most popular)
Recommendation: A Match-making Problem ,[object Object]
 Search: Web, Vertical
 Online advertising
 …Item Inventory Articles, web page,  ads, … Use an automated algorithm  to select item(s) to show Get feedback (click, time spent,..)  Refine the models Repeat (large number of times) Measure metric(s) of interest (Total clicks, Total revenue,…) Opportunity Users, queries,   pages, …
Problem Characteristics : Today module Traffic obtained from a controlled randomized experiment Things to note:  a) Short lifetimes b) temporal effects c) often breaking news story
Scale: Why use Hadoop? Million events per second (user view/click, content update) Hundreds of GB data collected and modeled per run  Millions of items in pool Millions of user profiles Tens of thousands of Features (Content and/or User)
Data Flow Optimization Engine Content feed with biz rules Rules Engine Content Metadata Exploit ~99% Explore ~1% Near Real-time Feedback Real-time Insights Dashboard Optimized Module

Mais conteúdo relacionado

Mais procurados

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
IBM Cloud Data Services
 
JingningCao12272014
JingningCao12272014JingningCao12272014
JingningCao12272014
JINGNING CAO
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 

Mais procurados (20)

Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Spark Summit East Keynote by Anjul Bhambhri
Spark Summit East Keynote by Anjul BhambhriSpark Summit East Keynote by Anjul Bhambhri
Spark Summit East Keynote by Anjul Bhambhri
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training
 
JingningCao12272014
JingningCao12272014JingningCao12272014
JingningCao12272014
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Scaling Production Machine Learning Pipelines with Databricks
Scaling Production Machine Learning Pipelines with DatabricksScaling Production Machine Learning Pipelines with Databricks
Scaling Production Machine Learning Pipelines with Databricks
 
Validating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learningValidating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learning
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 

Destaque

Marinheiros Do Poseidon Em CelebraçãO Pela Viagem Ao
Marinheiros Do Poseidon Em CelebraçãO Pela Viagem AoMarinheiros Do Poseidon Em CelebraçãO Pela Viagem Ao
Marinheiros Do Poseidon Em CelebraçãO Pela Viagem Ao
Andrea Bruzaca
 
12年6月11日讲座
12年6月11日讲座12年6月11日讲座
12年6月11日讲座
langwan
 
Top 10 microblogging tools
Top 10 microblogging toolsTop 10 microblogging tools
Top 10 microblogging tools
iumstech
 
Sunbeach Knowledge Base
Sunbeach Knowledge BaseSunbeach Knowledge Base
Sunbeach Knowledge Base
marioparris
 
экономические проблемы материнства в современной россии
экономические  проблемы материнства в современной россииэкономические  проблемы материнства в современной россии
экономические проблемы материнства в современной россии
Елена
 
Java in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challengesJava in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challenges
Rogue Wave Software
 
Oratory Instruction
Oratory InstructionOratory Instruction
Oratory Instruction
Yenny Koh
 
Keynote | The Rise and Fall and Rise of Java | James Governor
Keynote | The Rise and Fall and Rise of Java | James GovernorKeynote | The Rise and Fall and Rise of Java | James Governor
Keynote | The Rise and Fall and Rise of Java | James Governor
JAX London
 

Destaque (20)

Marinheiros Do Poseidon Em CelebraçãO Pela Viagem Ao
Marinheiros Do Poseidon Em CelebraçãO Pela Viagem AoMarinheiros Do Poseidon Em CelebraçãO Pela Viagem Ao
Marinheiros Do Poseidon Em CelebraçãO Pela Viagem Ao
 
The True Confessions of Charlotte Doyle Vocab 1-3
The True Confessions of Charlotte Doyle Vocab 1-3The True Confessions of Charlotte Doyle Vocab 1-3
The True Confessions of Charlotte Doyle Vocab 1-3
 
12年6月11日讲座
12年6月11日讲座12年6月11日讲座
12年6月11日讲座
 
Change is Relative : Persistence in the Urban Environment
Change is Relative : Persistence in the Urban EnvironmentChange is Relative : Persistence in the Urban Environment
Change is Relative : Persistence in the Urban Environment
 
Juicer - A fast template engine using javascript
Juicer - A fast template engine using javascriptJuicer - A fast template engine using javascript
Juicer - A fast template engine using javascript
 
Top 10 microblogging tools
Top 10 microblogging toolsTop 10 microblogging tools
Top 10 microblogging tools
 
Introduction to Hadoop at Data-360 Conference
Introduction to Hadoop at Data-360 ConferenceIntroduction to Hadoop at Data-360 Conference
Introduction to Hadoop at Data-360 Conference
 
Stackato v5
Stackato v5Stackato v5
Stackato v5
 
Sunbeach Knowledge Base
Sunbeach Knowledge BaseSunbeach Knowledge Base
Sunbeach Knowledge Base
 
Toki
TokiToki
Toki
 
Metrics
MetricsMetrics
Metrics
 
Nottingham hack soc
Nottingham hack socNottingham hack soc
Nottingham hack soc
 
2013:7:15 pump
2013:7:15 pump2013:7:15 pump
2013:7:15 pump
 
God is in the details
God is in the detailsGod is in the details
God is in the details
 
Open Historical Map: re-using obsolete information - State of the Map 2013
Open Historical Map: re-using obsolete information - State of the Map 2013Open Historical Map: re-using obsolete information - State of the Map 2013
Open Historical Map: re-using obsolete information - State of the Map 2013
 
экономические проблемы материнства в современной россии
экономические  проблемы материнства в современной россииэкономические  проблемы материнства в современной россии
экономические проблемы материнства в современной россии
 
Java in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challengesJava in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challenges
 
Oratory Instruction
Oratory InstructionOratory Instruction
Oratory Instruction
 
Keynote | The Rise and Fall and Rise of Java | James Governor
Keynote | The Rise and Fall and Rise of Java | James GovernorKeynote | The Rise and Fall and Rise of Java | James Governor
Keynote | The Rise and Fall and Rise of Java | James Governor
 
Pijon
PijonPijon
Pijon
 

Semelhante a Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
hannonhill
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
nzhang
 
Extreme SSAS- SQL 2011
Extreme SSAS- SQL 2011Extreme SSAS- SQL 2011
Extreme SSAS- SQL 2011
Itay Braun
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
renjan131
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
jobinwilson
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Yael Garten
 

Semelhante a Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya (20)

Modular Documentation Joe Gelb Techshoret 2009
Modular Documentation Joe Gelb Techshoret 2009Modular Documentation Joe Gelb Techshoret 2009
Modular Documentation Joe Gelb Techshoret 2009
 
Making IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyMaking IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture Strategy
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Productionalize content recommendation engine
Productionalize content recommendation engine Productionalize content recommendation engine
Productionalize content recommendation engine
 
SharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content ManagementSharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content Management
 
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
Extreme SSAS- SQL 2011
Extreme SSAS- SQL 2011Extreme SSAS- SQL 2011
Extreme SSAS- SQL 2011
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
Measuring the New Wikipedia Community (PyData SV 2013)
Measuring the New Wikipedia Community (PyData SV 2013)Measuring the New Wikipedia Community (PyData SV 2013)
Measuring the New Wikipedia Community (PyData SV 2013)
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to users
 
Mr bi amrp
Mr bi amrpMr bi amrp
Mr bi amrp
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
29.4 mb
29.4 mb29.4 mb
29.4 mb
 
29.4 Mb
29.4 Mb29.4 Mb
29.4 Mb
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 

Mais de Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 

Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

  • 1. Hadoop Summit 2011 Online Content Optimization using Hadoop Shail Aditya shailg@yahoo-inc.com
  • 2.
  • 3. Effectively and “pro-actively” learn from user interactions with content that are displayed to maximize our objectives
  • 4. A new scientific discipline at the interface of
  • 5. Large scale Machine Learning and Statistics
  • 6. Multi-objective optimization in the presence of uncertainty
  • 8.
  • 9. Content Ranking Problems X Y Most Popular Most engaging overall based on objective metrics Most Popular + Per User History Rotate stories I’ve already seen Light Personalization More relevant to me based on my age, gender, location, and property usage Deep Personalization Most relevant to me based on my deep interests (entities, sources, categories, keywords) Related Items and Context-Sensitive Models Behavioral Affinity: People who did X, did Y Most engaging in this page/section/property/device/referral context? Layout Optimization Which modules/ad units should be shown to this user in this context? Revenue Optimization Voice and Business Rules Real-time Dashboard
  • 10. Yahoo Frontpage Trending Now (Most popular) Today Module (Light personalization) Personal Assistant (Light Personalization) National News (Most Popular + User History bucket) Deals (most popular)
  • 11.
  • 12. Search: Web, Vertical
  • 14. …Item Inventory Articles, web page, ads, … Use an automated algorithm to select item(s) to show Get feedback (click, time spent,..) Refine the models Repeat (large number of times) Measure metric(s) of interest (Total clicks, Total revenue,…) Opportunity Users, queries, pages, …
  • 15. Problem Characteristics : Today module Traffic obtained from a controlled randomized experiment Things to note: a) Short lifetimes b) temporal effects c) often breaking news story
  • 16. Scale: Why use Hadoop? Million events per second (user view/click, content update) Hundreds of GB data collected and modeled per run Millions of items in pool Millions of user profiles Tens of thousands of Features (Content and/or User)
  • 17. Data Flow Optimization Engine Content feed with biz rules Rules Engine Content Metadata Exploit ~99% Explore ~1% Near Real-time Feedback Real-time Insights Dashboard Optimized Module
  • 18. How it happens ? Additional Content & User Feature Generation Feature Generation ITEM Model STORE: HBASE 5 min latency Request User Events Modeling Ranking B-Rules SLA 50 ms – 200 ms 5 – 30 min latency At time ‘t’ User ‘u’ (user attr: age, gen, loc) interacted withContent ‘id’ at Position ‘o’ Property/Site ‘p’ Section - s Module – m International - i’ STORE: PNUTS USER Model Content ‘id’ Has associated metadata ‘meta’ meta = {entity, keyword, geo, topic, category} Item Metadata
  • 19.
  • 20. Technology Stack Ingest Analytics and Debugging
  • 21.
  • 22. Hadoop processing via a collection of PIG UDFs
  • 23. Different flows for modeling or stages assembled in PIG
  • 24. OLR, Clustering, Affinity, Regression Models, Decompositions (Cholesky…)
  • 25. Timeseries models (generally trends – extract of user activity on content)
  • 26. Configuration based behavior for various stages of modeling
  • 27. Type of Features to be generated
  • 28. Type of joins to perform – User / Item / Feature
  • 29. Input : DFS and/or HBase
  • 30.
  • 31. Stores item related features
  • 32. Stores ITEM x USER FEATURES model
  • 33. Stores parameters about item like view count, click count, unique user count.
  • 34. 10 of Millions of Items
  • 35. Updated every 5 minutes
  • 37. Store USER x CONTENT FEATURES model for each individual user by either a Unique ID
  • 38. Stores summarized user history – Essential for Modeling in terms of item decay
  • 40. Updated every 5 to 30 minutes
  • 42. Inverts the Item Table and stores statistics for the terms.
  • 43. Used to find the trending features and provide baselines for user features
  • 44. Millions of terms and hundreds of parameters tracked
  • 45.
  • 46. Provides ability to control non-gridifyable solutions to be deployed easily
  • 47. Have different scaling characteristics (E.g. Memory, CPU)
  • 48. Provide gateway for accessing external data sources in M/R
  • 49. Map and/or Reduce step interact with Edge Services using standard client
  • 53.
  • 54. Run complex queries for analysis
  • 55. Easy to use interface
  • 56. PM, Engineers, Research use this cluster to get near-real time insights
  • 57. 10s of Modeling monitoring and Reporting queries every 5 minute
  • 58.
  • 59. Made it simple to build different kind of science models
  • 60. Point lookup using HBase has proven to be very useful
  • 61. Modeling = Matrices
  • 62. HBase provides a natural way to represent and access them
  • 64. Have provided simplicity to whole stack
  • 65. Management (Upgrades, Outage) has been easy
  • 66. HIVE has provided us a great way for analyzing the results
  • 67.

Notas do Editor

  1. This is the Title slide.Please use the name of the presentation that was used in the abstract submission.
  2. This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  3. This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  4. This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  5. This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  6. This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  7. This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  8. This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  9. This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  10. This is the final slide; generally for questions at the end of the talk.Please post your contact information here.