SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
YAHOO &
HADOOP
USING	
  AND	
  IMPROVING	
  
APACHE	
  HADOOP	
  AT	
  YAHOO!

                Eric Baldeschwieler
                VP, Hadoop Software
AGENDA

         •  	
  Brief	
  Overview	
  

         •  	
  Hadoop	
  @	
  Yahoo!	
  
         	
  
         •  Hadoop	
  Momentum	
  

         •  The	
  Future	
  of	
  Hadoop	
  




                                                2	
  
WHAT’S
    happening

                      -­‐	
  Big	
  Data	
  is	
  here!	
  	
  
                      -­‐ unstructured data
                      -­‐	
  	
  petabyte scale
                      -­‐	
  	
  operationally critical




Flickr : sub_lime79
TURNING DATA
   INTO INSIGHTS

        machine learning
logic regression                            time series
      content clustering
      algorithms ad inventory modeling
            user interest prediction
                                        factorization models
Flickr : NASA Goddard Photo and Video
MAKING YAHOO
    RELEVANT




Flickr : ogimogi
HADOOP:
    POWERING
    YAHOO!
                 science	
  +	
  big	
  data + insight =
                 personal relevance = VALUE




Flickr : DDFic
WHAT IS HADOOP?
                                                                   Commodity
         Pig                          Hive               Programming Languages
                                                                   •  Computers
                                                                   •  Network
                    MapReduce                                 Computation
                                                                   Focus on
                                                                   •  Simplicity
                      HDFS
                                                                   •  Redundancy
                                                                Storage
                                                                   •  Scale
                                                                   •  Availability


Transforms commodity equipment into a service that:
•  HDFS – Stores peta bytes of data reliably
•  Map-Reduce – Allows huge distributed computations

Key Attributes
•  Redundant and reliable – Doesn’t stop or loose data even as hardware fails
•  Easy to program – Our rocket scientists use it directly!
•  Very powerful – Allows the development of big data algorithms & tools        7	
  
•  Batch processing centric
WHAT HADOOP ISN’T

•  A	
  replacement	
  for	
  relaFonal	
  and	
  data	
  
     warehouse	
  systems	
  	
  
•  A	
  transacFonal	
  /	
  online	
  /	
  serving	
  system	
  
•  A	
  low	
  latency	
  or	
  streaming	
  soluFon	
  
	
  




                                                                    8	
  
HADOOP IN THE ENTERPRISE
                                      Business	
  Intelligence	
  ApplicaFons	
  




                         HADOOP
                        CLUSTER(S)                                                                 Data	
  
                                                                    RDMS	
          EDW	
  
                                                                                                   Marts	
  




    InteracFons	
                                                TransacFons,	
  Structured	
  Data	
  
    Semi-­‐Structured	
  or	
  Un-­‐Structured	
  Data	
  



Web	
  Logs,	
  Server	
  Logs,	
                                     Business	
  
Social	
  Media,	
  etc…	
                                            ApplicaFons	
  

                                                                                                               9	
  
HADOOP @ YAHOO!




                  10	
  
HADOOP @
YAHOO!
“Where	
  Science	
  meets	
  Data”	
  
                                                     PRODUCTS
                                                     Data Analytics
                                                     Content Optimization
                                                     Content Enrichment
                                                     Yahoo! Mail Anti-Spam
                                                     Advertising Products
                      HADOOP CLUSTERS                Ad Optimization
                   Tens of thousands of servers      Ad Selection
                                                     Big Data Processing & ETL




                                                       APPLIED SCIENCE
                                                     User Interest Prediction
                                                     Ad inventory prediction
                                                     Machine learning -
                                                     search ranking
                                                     Machine learning - ad
                                                     targeting
                                                     Machine learning - spam
                                  10s of Petabytes   filtering
                                                                                11	
  
FROM PROJECT TO
CORE PLATFORM
                       90                                                                        250


                       80    40K+ Servers
                             170 PB Storage                                                      200
                       70
                             5M+ Monthly Jobs
                       60                                                              “Behind	
  
                                                                                        every	
   150
Thousands of Servers




                       50                                            Daily	
            click”	
  
                                                                     ProducFon	
       	
  




                                                                                                        Petabytes
                       40
                                                Science	
                                        100
                       30
                                                Impact	
  

                       20
                               Research	
                                                         50

                       10


                       0                                                                          0

                            2006         2007                 2008         2009      2010
                                                                                                                    12	
  
HADOOP POWERS THE
YAHOO! NETWORK



    advertising optimization data analytics
           machine learning search ranking
 advertising data systems   Yahoo! Mail anti-spam
  audience, ad and search pipelines          ad selection

 Yahoo! Homepage Content Optimization
                   ad inventory prediction
         user interest prediction

                                                            13	
  
CASE STUDY
  YAHOO! HOMEPAGE
	
  
	
  
	
   Personalized	
  	
  
	
   for	
  each	
  visitor	
  
     	
  
	
  twice	
  the	
  engagement	
  
  Result:	
  	
  
  twice	
  the	
  engagement	
  
  	
  
                                    Recommended	
  links	
       News	
  Interests	
       Top	
  Searches	
  

                                   +79% clicks                 +160% clicks              +43% clicks
                                   vs. randomly selected       vs. one size fits all     vs. editor selected

                                                                                                                 14	
  
CASE STUDY
 YAHOO! HOMEPAGE

•  Serving	
  Maps	
                                       SCIENCE          »	
  Machine learning to build ever
       •  Users	
  -­‐	
  Interests	
                       HADOOP             better categorization models
	
                                                          CLUSTER
•  Five	
  Minute	
                        USER	
                               CATEGORIZATION	
  
     ProducLon	
                       BEHAVIOR	
                               MODELS	
  (weekly)	
  
	
  
•  Weekly	
                                                PRODUCTION
     CategorizaLon	
                                          HADOOP
                                                                            »	
  Identify user interests using
     models	
                               SERVING	
  
                                                              CLUSTER
                                                                               Categorization models
                                              MAPS	
  
                             (every	
  5	
  minutes)	
  
                                                               USER	
  
                                                             BEHAVIOR	
  



                                  SERVING	
  SYSTEMS                           ENGAGED	
  USERS



	
  
Build	
  customized	
  home	
  pages	
  with	
  latest	
  data	
  (thousands	
  /	
  second)	
  
                                                                                                                 15	
  
CASE STUDY
YAHOO! MAIL

    Enabling	
  quick	
  response	
  in	
  the	
  spam	
  arms	
  race	
  

                                        •  450M	
  mail	
  boxes	
  	
  
                                        •  5B+	
  deliveries/day	
  
         SCIENCE
                                        	
  
                                        •  AnLspam	
  models	
  retrained	
  
                                             	
  every	
  few	
  hours	
  on	
  Hadoop	
  
                                        	
  
        PRODUCTION
                                               40%	
  less	
  spam	
  than	
  
                                               Hotmail	
  and	
  55%	
  less	
  
                                               spam	
  than	
  Gmail	
  



                                                                                             16	
  
YAHOO! & APACHE HADOOP
Yahoo!	
  has	
  contributed	
  70+%	
  of	
  	
  
Apache	
  Hadoop	
  code	
  to	
  date	
  
Hadoop	
  is	
  not	
  our	
  business,	
  but	
  Hadoop	
  is	
  key	
  to	
  our	
  business	
  
• 	
  Yahoo!	
  benefits	
  from	
  open	
  source	
  eco-­‐system	
  around	
  Hadoop	
  
• 	
  Hadoop	
  drives	
  revenue	
  at	
  Yahoo!	
  by	
  making	
  our	
  core	
  products	
  be`er	
  
	
  
We	
  need	
  Hadoop	
  to	
  be	
  rock	
  solid	
  
• 	
  We	
  invest	
  heavily	
  in	
  core	
  Hadoop	
  development	
  
• 	
  We	
  focus	
  on	
  scalability,	
  reliability,	
  availability	
  
	
  
We	
  fix	
  bugs	
  before	
  you	
  see	
  them	
  
• 	
  We	
  run	
  very	
  large	
  clusters	
  
• 	
  We	
  have	
  a	
  large	
  QA	
  effort	
  
• 	
  We	
  run	
  a	
  huge	
  variety	
  of	
  workloads	
  
	
  
We	
  are	
  good	
  Apache	
  Hadoop	
  ciLzens	
  
• 	
  We	
  contribute	
  our	
  work	
  to	
  Apache	
  
• 	
  We	
  share	
  the	
  exact	
  code	
  we	
  run	
  
HADOOP
MOMENTUM




           18	
  
HADOOP IS GOING
MAINSTREAM

2007       2008   2009   2010




                                The	
  Datagraph	
  Blog	
  




                                                               19	
  
THE PLATFORM EFFECT
  BIRTH OF AN ECOSYSTEM
                                	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  and other Early Adopters
                                Scale and productize Hadoop

       Apache	
  Hadoop	
  

                Enhance	
       Orgs with Internet Scale Problems
                Hadoop	
        Add tools / frameworks, enhance Hadoop
                Ecosystem	
  




                                Service Providers
                                Grow ecosystem - Training, support, enhancements

Virtuous Circle!
•  Investment -> Adoption
•  Adoption -> Investment

                                Mainstream / Enterprise adoption
                                Drive further development, enhancements                                                                                                    20	
  
THE FUTURE OF
HADOOP




                21	
  
MAKING HADOOP ENTERPRISE-READY
WHAT’S NEXT
Hadoop	
  is	
  far	
  from	
  “done”	
  
       •  Current	
  implementaFon	
  is	
  showing	
  its	
  age	
  
       •  Need	
  to	
  address	
  several	
  deficiencies	
  in	
  scalability,	
  flexibility,	
  
          ease	
  of	
  use	
  &	
  performance	
  
       	
  
Yahoo!	
  is	
  working	
  on	
  Next	
  GeneraLon	
  of	
  Hadoop	
  
       •  MapReduce:	
  Rewrite	
  to	
  improve	
  performance;	
  
          pluggable	
  support	
  for	
  new	
  programming	
  models	
  
       •  HDFS:	
  Adding	
  volumes	
  to	
  improve	
  scalability;	
  
          Flush	
  &	
  sync	
  support	
  for	
  applicaFons	
  that	
  log	
  to	
  HDFS	
  
	
  
Apache	
  should	
  remain	
  the	
  hub	
  of	
  Hadoop	
  ecosystem	
  
       •  Yahoo!	
  contributes	
  all	
  Hadoop	
  changes	
  back	
  to	
  Apache	
  Hadoop	
  
       •  Everyone	
  benefits	
  from	
  shared	
  neutral	
  foundaFon	
  
                                                                                                     22	
  
Questions?




             23	
  

Mais conteúdo relacionado

Mais procurados

Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinDmitriy Ryaboy
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 

Mais procurados (20)

201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant Twin
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 

Semelhante a hadoop @ Ibmbigdata

Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2Calpont Corporation
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
 
Introduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsIntroduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsHortonworks
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsDataWorks Summit
 

Semelhante a hadoop @ Ibmbigdata (20)

Yahoo & Hadoop
Yahoo & HadoopYahoo & Hadoop
Yahoo & Hadoop
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
Introduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsIntroduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for Windows
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
 

Último

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Último (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

hadoop @ Ibmbigdata

  • 1. YAHOO & HADOOP USING  AND  IMPROVING   APACHE  HADOOP  AT  YAHOO! Eric Baldeschwieler VP, Hadoop Software
  • 2. AGENDA •   Brief  Overview   •   Hadoop  @  Yahoo!     •  Hadoop  Momentum   •  The  Future  of  Hadoop   2  
  • 3. WHAT’S happening -­‐  Big  Data  is  here!     -­‐ unstructured data -­‐    petabyte scale -­‐    operationally critical Flickr : sub_lime79
  • 4. TURNING DATA INTO INSIGHTS machine learning logic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization models Flickr : NASA Goddard Photo and Video
  • 5. MAKING YAHOO RELEVANT Flickr : ogimogi
  • 6. HADOOP: POWERING YAHOO! science  +  big  data + insight = personal relevance = VALUE Flickr : DDFic
  • 7. WHAT IS HADOOP? Commodity Pig Hive Programming Languages •  Computers •  Network MapReduce Computation Focus on •  Simplicity HDFS •  Redundancy Storage •  Scale •  Availability Transforms commodity equipment into a service that: •  HDFS – Stores peta bytes of data reliably •  Map-Reduce – Allows huge distributed computations Key Attributes •  Redundant and reliable – Doesn’t stop or loose data even as hardware fails •  Easy to program – Our rocket scientists use it directly! •  Very powerful – Allows the development of big data algorithms & tools 7   •  Batch processing centric
  • 8. WHAT HADOOP ISN’T •  A  replacement  for  relaFonal  and  data   warehouse  systems     •  A  transacFonal  /  online  /  serving  system   •  A  low  latency  or  streaming  soluFon     8  
  • 9. HADOOP IN THE ENTERPRISE Business  Intelligence  ApplicaFons   HADOOP CLUSTER(S) Data   RDMS   EDW   Marts   InteracFons   TransacFons,  Structured  Data   Semi-­‐Structured  or  Un-­‐Structured  Data   Web  Logs,  Server  Logs,   Business   Social  Media,  etc…   ApplicaFons   9  
  • 11. HADOOP @ YAHOO! “Where  Science  meets  Data”   PRODUCTS Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL APPLIED SCIENCE User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering 11  
  • 12. FROM PROJECT TO CORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 “Behind   every   150 Thousands of Servers 50 Daily   click”   ProducFon     Petabytes 40 Science   100 30 Impact   20 Research   50 10 0 0 2006 2007 2008 2009 2010 12  
  • 13. HADOOP POWERS THE YAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction 13  
  • 14. CASE STUDY YAHOO! HOMEPAGE       Personalized       for  each  visitor      twice  the  engagement   Result:     twice  the  engagement     Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 14  
  • 15. CASE STUDY YAHOO! HOMEPAGE •  Serving  Maps   SCIENCE »  Machine learning to build ever •  Users  -­‐  Interests   HADOOP better categorization models   CLUSTER •  Five  Minute   USER   CATEGORIZATION   ProducLon   BEHAVIOR   MODELS  (weekly)     •  Weekly   PRODUCTION CategorizaLon   HADOOP »  Identify user interests using models   SERVING   CLUSTER Categorization models MAPS   (every  5  minutes)   USER   BEHAVIOR   SERVING  SYSTEMS ENGAGED  USERS   Build  customized  home  pages  with  latest  data  (thousands  /  second)   15  
  • 16. CASE STUDY YAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race   •  450M  mail  boxes     •  5B+  deliveries/day   SCIENCE   •  AnLspam  models  retrained    every  few  hours  on  Hadoop     PRODUCTION 40%  less  spam  than   Hotmail  and  55%  less   spam  than  Gmail   16  
  • 17. YAHOO! & APACHE HADOOP Yahoo!  has  contributed  70+%  of     Apache  Hadoop  code  to  date   Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business   •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop   •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er     We  need  Hadoop  to  be  rock  solid   •   We  invest  heavily  in  core  Hadoop  development   •   We  focus  on  scalability,  reliability,  availability     We  fix  bugs  before  you  see  them   •   We  run  very  large  clusters   •   We  have  a  large  QA  effort   •   We  run  a  huge  variety  of  workloads     We  are  good  Apache  Hadoop  ciLzens   •   We  contribute  our  work  to  Apache   •   We  share  the  exact  code  we  run  
  • 19. HADOOP IS GOING MAINSTREAM 2007 2008 2009 2010 The  Datagraph  Blog   19  
  • 20. THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters Scale and productize Hadoop Apache  Hadoop   Enhance   Orgs with Internet Scale Problems Hadoop   Add tools / frameworks, enhance Hadoop Ecosystem   Service Providers Grow ecosystem - Training, support, enhancements Virtuous Circle! •  Investment -> Adoption •  Adoption -> Investment Mainstream / Enterprise adoption Drive further development, enhancements 20  
  • 22. MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT Hadoop  is  far  from  “done”   •  Current  implementaFon  is  showing  its  age   •  Need  to  address  several  deficiencies  in  scalability,  flexibility,   ease  of  use  &  performance     Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop   •  MapReduce:  Rewrite  to  improve  performance;   pluggable  support  for  new  programming  models   •  HDFS:  Adding  volumes  to  improve  scalability;   Flush  &  sync  support  for  applicaFons  that  log  to  HDFS     Apache  should  remain  the  hub  of  Hadoop  ecosystem   •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop   •  Everyone  benefits  from  shared  neutral  foundaFon   22  
  • 23. Questions? 23