Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo

•

1 gostou•472 visualizações

Data Con LA

Big Data Day LA 2016 Keynote - Andy Feng, VP-Architecture at Yahoo talks about Hadoop and Big Data Innovation

Tecnologia

Big-Data Innovation
Hadoop , Real - time, and Machine Learning
A n d y F e n g
Ya h o o

How does Yahoo use its Big Data Tech?
3Yahoo Confidential & Proprietary

4
Personalized Content
Content augmentation
User profiling
Recommendation

Mail Smart Views
5 Yahoo Confidential & Proprietary

Weather
7
 Beauty
› Computational
assessed
 Relevant
› Location
› Time
› Cloudy
› Shower
› …
Weather App Yahoo Weather App

Search
Query intention
Page ranking
Ads matching

Search2Vec: Cosine similarity b/w ads and queries
9 Yahoo Confidential & Proprietary

Big-Data Impact: Search2Vec
http://bit.ly/28SLdjU
10
Bucket Tests Query
Coverage
Auction Depth Revenue per Search
Simple model vs.
Baseline
+1.14% +2.13% +7.07%
Advanced model
vs. Small model
+2.44% +2.39% +9.39%
(= +17.12% vs. Baseline)

Open Source
11Yahoo Confidential & Proprietary

CaffeOnSpark: Distributed Deep Learning
github.com/yahoo/caffeonspark
Powerful
DL Platform
Fully
Distributed
High-level
API
Incremental
Learning
Existing
Clusters
12

Interactive Analytics
Data Sketches Algorithms Library
datasketches.github.io
Sub-second User Facing Analytics
druid.io
13

Apache Storm: Real-time Processing
https://storm.apache.org
MT & RA
Scheduler
Dist. Cache
API
8 x
Throughput
Improved
Debuggability
1 github.com/yahoo/streaming-benchmarks
Pacemaker
Server
Streaming
Benchmark 1
14

Apache Omid: Transactions for NoSQL DB
http://omid.incubator.apache.org/
• Multi-row/multi-table transactions
• Snapshot isolation
• Lock-free
15
ACID
Transactions

Yahoo Hadoop Stack
16Yahoo Confidential & Proprietary

18
Thanks!
yahoohadoop.tumblr.com
bigdata@yahoo-inc.com

Mais conteúdo relacionado

Destaque

Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityData Con LA

Apache Mesos and the new Open Source Architecture of the Modern DatacenterData Con LA

Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...Data Con LA

Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...Data Con LA

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA

Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...Data Con LA

Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Data Con LA

Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Data Con LA

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA

Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Data Con LA

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA

Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Data Con LA

Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Data Con LA

Destaque (15)

Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University

Apache Mesos and the new Open Source Architecture of the Modern Datacenter

Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...

Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...

Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...

Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...

Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...

Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...

Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...

Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...

Semelhante a Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo

SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh

Agencies Developer ProductsJeff Eddings

BigData Meets the Federal Data CenterAbe Usher

Piwik: An Analytics Alternative (Chicago Summit)Open Analytics

Future of Search | Yury Lifshits, Yahoo! ResearchYury Lifshits

Apache Hadoop India Summit 2011 Keynote talk "Hadoop & the Future of Cloud Co...Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

Facebook Open Graph - The Semantic WalletJonathan Laba

Data driven mobile UX - UX insight 2017, uxinsight.nlJorden Lentze

The Future of Vertical Search EnginesTed Drake

Search V Next FinalMarianne Sweeny

Not Your Mom's SEOMarianne Sweeny

Data 2.0 - Harnessing New Data Visualization Tools CIL 2008Darlene Fichter

Why Progressive Web Apps will transform your websiteJason Grigsby

Advanced Information Gathering AKA Google HackingGareth Davies

Hadoop Webinar 28July15Edureka!

Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!

WordCamp London 2019 - Content monetisation platforms with WordPressAngry Creative (UK)

Pratical Deep Dive into the Semantic Web - #smconnectJan-Willem Bobbink - Freelance SEO Consultant

Structured Data & Schema.org - SMX Milan 2014Bastian Grimm

Semelhante a Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo (20)

SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!

Agencies Developer Products

BigData Meets the Federal Data Center

Piwik: An Analytics Alternative (Chicago Summit)

Future of Search | Yury Lifshits, Yahoo! Research

Apache Hadoop India Summit 2011 Keynote talk "Hadoop & the Future of Cloud Co...

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Facebook Open Graph - The Semantic Wallet

Data driven mobile UX - UX insight 2017, uxinsight.nl

The Future of Vertical Search Engines

Search V Next Final

Not Your Mom's SEO

Data 2.0 - Harnessing New Data Visualization Tools CIL 2008

Why Progressive Web Apps will transform your website

Advanced Information Gathering AKA Google Hacking

Hadoop Webinar 28July15

Is It A Right Time For Me To Learn Hadoop. Find out ?

WordCamp London 2019 - Content monetisation platforms with WordPress

Pratical Deep Dive into the Semantic Web - #smconnect

Structured Data & Schema.org - SMX Milan 2014

Mais de Data Con LA

Data Con LA 2022 KeynotesData Con LA

Data Con LA 2022 KeynoteData Con LA

Data Con LA 2022 - Startup ShowcaseData Con LA

Data Con LA 2022 KeynoteData Con LA

Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA

Data Con LA 2022 - AI EthicsData Con LA

Data Con LA 2022 - Improving disaster response with machine learningData Con LA

Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA

Data Con LA 2022 - Real world consumer segmentationData Con LA

Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA

Data Con LA 2022 - Moving Data at Scale to AWSData Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA

Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA

Data Con LA 2022 - Intro to Data ScienceData Con LA

Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA

Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA

Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA

Data Con LA 2022 - Data Streaming with KafkaData Con LA

Mais de Data Con LA (20)

Data Con LA 2022 Keynotes

Data Con LA 2022 Keynote

Data Con LA 2022 - Startup Showcase

Data Con LA 2022 Keynote

Data Con LA 2022 - Using Google trends data to build product recommendations

Data Con LA 2022 - AI Ethics

Data Con LA 2022 - Improving disaster response with machine learning

Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas

Data Con LA 2022 - Real world consumer segmentation

Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...

Data Con LA 2022 - Moving Data at Scale to AWS

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI

Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...

Data Con LA 2022 - Intro to Data Science

Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...

Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...

Data Con LA 2022- Embedding medical journeys with machine learning to improve...

Data Con LA 2022 - Data Streaming with Kafka

Último

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

unit 4 immunoblotting technique complete.pptxBkGupta21

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Rise of the Machines: Known As Drones...Rick Flair

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Artificial intelligence in cctv survelliance.pptxhariprasad279825

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Training state-of-the-art general text embeddingZilliz

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"ML in Production",Oleksandr BaganFwdays

Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo

1. Big-Data Innovation Hadoop , Real - time, and Machine Learning A n d y F e n g Ya h o o

2. Hadoop Clusters at Yahoo

3. How does Yahoo use its Big Data Tech? 3Yahoo Confidential & Proprietary

4. 4 Personalized Content Content augmentation User profiling Recommendation

5. Mail Smart Views 5 Yahoo Confidential & Proprietary

6. Flickr: flickr.com/cameraroll

7. Weather 7  Beauty › Computational assessed  Relevant › Location › Time › Cloudy › Shower › … Weather App Yahoo Weather App

8. Search Query intention Page ranking Ads matching

9. Search2Vec: Cosine similarity b/w ads and queries 9 Yahoo Confidential & Proprietary

10. Big-Data Impact: Search2Vec http://bit.ly/28SLdjU 10 Bucket Tests Query Coverage Auction Depth Revenue per Search Simple model vs. Baseline +1.14% +2.13% +7.07% Advanced model vs. Small model +2.44% +2.39% +9.39% (= +17.12% vs. Baseline)

11. Open Source 11Yahoo Confidential & Proprietary

12. CaffeOnSpark: Distributed Deep Learning github.com/yahoo/caffeonspark Powerful DL Platform Fully Distributed High-level API Incremental Learning Existing Clusters 12

13. Interactive Analytics Data Sketches Algorithms Library datasketches.github.io Sub-second User Facing Analytics druid.io 13

14. Apache Storm: Real-time Processing https://storm.apache.org MT & RA Scheduler Dist. Cache API 8 x Throughput Improved Debuggability 1 github.com/yahoo/streaming-benchmarks Pacemaker Server Streaming Benchmark 1 14

15. Apache Omid: Transactions for NoSQL DB http://omid.incubator.apache.org/ • Multi-row/multi-table transactions • Snapshot isolation • Lock-free 15 ACID Transactions

16. Yahoo Hadoop Stack 16Yahoo Confidential & Proprietary

17. 17

18. 18 Thanks! yahoohadoop.tumblr.com bigdata@yahoo-inc.com

Notas do Editor

http://storm.apache.org/2016/04/12/storm100-released.html Big-Data Technology Innovation: Hadoop, Real-time, and Machine Learning Andy Feng VP Architecture, Yahoo! Yahoo started developing big-data technology with Hadoop MapReduce and File System in 2006, and made it an Apache open source project in 2009. Since then, big data has become a major component of the global tech industry, and Yahoo is leading the way. In the past three years, Yahoo has been a leading contributor to Apache Storm for event processing, Apache HBase for distributed NoSQL stores, Apache Spark for faster processing, and Druid for sub-second analytics. We have created new open source projects such as Apache Omid for transactional support of NoSQL stores, Yahoo Data Sketches for approximate analytics, and Yahoo CaffeOnSpark for distributed deep learning. In this talk, we walk through Yahoo use cases (search, advertising, personalization, and Flickr) where our big-data technologies are best exemplified. We explain how Yahoo leverages these technologies to perform real-time processing and advanced machine learning against 600 petabytes of data, and describe the system architecture of our heterogeneous clusters of 40,000 servers for supporting a variety of workloads. We provide an overview of open source technologies (Apache Storm, Apache HBase, Apache Omid, and Yahoo CaffeOnSpark) and our in-house technology for large-scale machine learning. We discuss how academic researchers and industry technologists can help advance big-data technologies further. Bio Dr. Andy Feng is a VP of Architecture at Yahoo leading the architecture and design of big data and machine learning initiatives. He’s architected major platforms for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox.
(1min – T 2min) Yahoo has a long history of involvement with Hadoop and Big Data. From our initial work to open source Hadoop in Apache, to our effort stabilizing YARN at Scale, to recent efforts working with the Apache Community to drive Scale and utilization in Hadoop, the Hadoop Stack, and Storm. I plan to overview several areas which were all talks this year at Hadoop Summit: Overcommit of Hardware, Migration to Tez to improve Pig/Hive job utilization, and finally Storm’s Resource Aware Scheduler to improve Storm hardware utilization with the hope that you will be intrigued enough to attend the talk or go back and watch it after the summit TODOS: Work in Reference last Sumeet keynote? Work in reference the YARN talk from 2-3 years ago and breaking Scale benchmark Maybe Mention Community effort (find way to mention Hortonworks on Tez, Twitter? others?
We released the magic view as part of the Flickr 4.0 release last April, and this is the most visible user-facing feature that exposes our image recognition capabilities. Our users can switch from the traditional timeline view of their photo to an experience where their photos are arranged according to 70 categories. For example, you can see here that landscape photos are sub-categorized into different types such as mountain, rock, or shore. This is a great feature for serendipitous photo discovery. Most of us have thousands of photos that we don’t get to see very often but are emotionally very attached to, and these types of groupings help us re-discover photos.
Let’s start with WHY machine learning. Search is one of the key applications for Yahoo. For a user’s search phrase, we construct a result page with organic contents together with ads. To generate result page, we rank contents basedtheir relevance to query terms, match ads against query, and predict the probability of ad click. Several machine learning algorithms are applied in this process: decision tree, logistic regression and neural network.
(2 min – T 7 min) And, best yet, CaffeOnSpark was open sourced last month with Apache 2.0 license
(1 min – T 11 min) Certain class of problems in big data analytics don’t scale well due to queries taking too much time or resources, such as count distinct, most frequent, quantiles etc. And that’s where Sketches algorithms come in where “good enough” approximate answers work great for interactivity (and real-time stream data) We have used Sketches successfully for several use cases such as audience analytics and Flurry analytics for our Mobile Developer Suite Sketches integrates really well with Druid for sub-second OLAP where we have many lots of contributions recently such as dimension joins, reliable pull-based real-time ingestion, and schema introspection Sketches is now available in open source, and integrates well with Pig and Hive from the Hadoop ecosystem
(2 min – T 10 min) In the absence of one, we established a real-world streaming benchmark, code is available on Git. I am excited to tell you that most of these multi-tenancy, scale, and security changes are available in the community releases or are on their way to be released
(1 min – T 12 min) HBase is another cornerstone technology that we rely on extensively and there are applications on HBase that need to bundle multiple read and write operations into a single unit of work, and that’s’ exactly where Omid comes in With Omid, applications can execute transactions with ACID properties without worrying about performance and fault tolerance Omid executes millions of transactions per day for our incremental content management platform for nextgen search and personalization products And, I am pleased to say that the same technology is now available as a new Apache incubator project Applications that need to bundle multiple read and write operations on HBase into logically indivisible units of work can use Omid to execute transactions with ACID (Atomicity, Consistency, Isolation, Durability) properties, just as they would use transactions in the relational database world. Omid extends the HBase key-value access APl with transaction semantics. It can be exercised either directly, or via higher level data management API’s. For example, Apache Phoenix (SQL-on-top-of-HBase) might use Omid as its transaction management component. Development. Omid is backward-compatible with HBase APIs, making it developer friendly. Minimal extensions are introduced to enable transactions. Semantics. Omid implements a popular, well-tracted Snapshot Isolation (SI) consistency paradigm that is supported by major SQL and NoSQL technologies (for example, Percolator). Scalability. Omid provides a highly scalable, lock-free implementation of SI. To the best of our knowledge, it is the only open source NoSQL platform that can scale beyond 100K transactions per second. Reliability. Omid has a high-availability (HA) mode, in which the core service operates as primary-backup process pair with automatic failover. The HA support has zero overhead on the mainstream operation. Simplicity. Omid leverages the HBase infrastructure for managing its own metadata. It entails no additional services apart from those used by HBase. Track Record. Omid is already in use by very-large-scale production systems at Yahoo.

Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (15)

Semelhante a Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo

Semelhante a Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo (20)

Mais de Data Con LA

Mais de Data Con LA (20)

Último

Último (20)

Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo

Notas do Editor