SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Data Analysis at Facebook


                  Jeff Hammerbacher, Ding Zhou*
                  Facebook Inc.
Outline
• How does Facebook work
• Managing Big Data
• Data Analysis for Business Intelligence
• Data Analysis for “Artificial Intelligence”
• Questions
How does Facebook work?
Profile page - content generation portal
Newsfeed page - content consumption portal
Friends page - social graph portal
App page - social app platform
Facebook Data
▪   Social Graph Data
    ▪   The Nodes:
        ▪
            100m+ users; 100+ dimensions each user (numerical, text, categorical);
        ▪
            350k registrations daily;
    ▪   The Edges:
        ▪
            200+ friends each user (median);
        ▪
            20 categories of edges (fb friends, co-workers, family, etc);

▪   Social Behavior Data
    ▪   Social Interactions: interactions among users, via 100+ interaction types;
    ▪   Social Actions: between users and 33k+ facebook apps, via 200+ action types;

▪   Social Content Data
    ▪   Content of Posts, Notes, Photos, Video, etc
Managing Big Data
▪   Data scale [backend]:
    ▪   Over 1.3 PB raw capacity in largest cluster;
    ▪   Nearly 2 TB uncompressed data per day;
    ▪   Over 20 TB read/write per day;
▪   Distributed Data management:
    ▪   HDFS/Hadoop (MapReduce in Java);
    ▪   MetaStore (MetaData management);
    ▪   Hive QL (Query language on Hadoop+MetaStore);
    ▪   Usage:
        ▪
            at least 50 engineers have run hadoop jobs
        ▪
            3,514 Jobs weekly
        ▪
            821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
Hadoop - MapReduce in Java


                     facebook:1
                     data:1                                  analysis:1
                     team:1                                  data:1
                                                             data:1
                                                             facebook:1   analysis:1
facebook data team           uses: 1                                      data:2
uses hadoop for              hadoop: 1                                    facebook:1
data analysis                for: 1                                       for:1
                                                                          hadoop:1
                                                                          team:1
                                                             for:1
                                                                          uses: 1
                                                             hadoop:1
                                                             team:1
                                                             uses: 1
                             data:1
                             analysis:1



                          MapReduce Execution Flow
                           [Dean, J and Ghemawat, S, 2004]
Data Analysis for Business Intelligence
Data for Business Intelligence
▪   General Goal:
    ▪   support growth and monetization strategies, and product decisions
▪   User Behavior Studies
    ▪   NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive
        of engagement;
    ▪   Identity*: Unsupervised learning over user session data to identify common usage patterns.
        Techniques employed include K-Means, PageRank, dimension reduction methods;
▪   Experimentation Platform
    ▪   Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...);
    ▪
        Columbus++*: A/B testing for impact of site change on site health metrics;;

▪   Reporting System
    ▪   ad-hoc analysis done by Hive queries
                                                              * - underlined are projects that Ding Zhou participates in;
Columbus
                           Geographical bird-view of
                           growth by country




      Comparison between
      user groups
Data Analysis for “Artificial Intelligence”
                       -- predicting user social behavior
who the user will
    interact with

• predict interactions between friends

• features are user profile and browsing history

• tried linear models and tree models

• applied for search, newsfeed, etc
who the user hasn’t
      found yet

• missing edge prediction problem

• observations are friend/non-friend pairs

• features include profile and local graph info

• profile info more informative

• graph info supplemental if profile incomplete
what applications the
    user may like*

• 33k apps, only 0.1% of them used;

• a different recommendation problem;

• prediction model not applicable,
 user preference unavailable;

• build a prediction model to infer “user ratings”;

• user-based + item-based recommendation

• how to combine profile, social graph, ratings?



                  * projects that Ding Zhou participates in;
what content is
          interesting*
• newsfeed as the main content distribution channel

• stories generated by 100s of social actions:
 on the site, platform, or the Web

• <0.1% of possible stories are shown

• predictions built on story features, and user
 browsing history




                    * projects that Ding Zhou participates in;
Challenges in Data
- 100s of TBs of meaningful data available
- 1,000s of non-trivial features
- sampling not always applicable (e.g. small app has no user data)
- prediction requirements
 ▪   models regularly applied for 10 billion novel samples
 ▪   models used on-the-fly for 100k samples in 50 ms
Special Machine Learning Problems
- use machine learning to predict user behavior
 ▪   labels: insufficient; inferred implicitly; imbalanced;
 ▪   features: high-dimensional; strongly correlated; noisy;


- scale requires distributed algorithms
 ▪   in-house implementation of tree ensemble methods (bagging predictors)
 ▪   larger training sets grant performance improvements


- speed and accuracy improvements underway
tip of the iceberg

    Questions?
(c) 2004-2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Mais conteúdo relacionado

Destaque

PilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankingsPilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankingsBjorn M
 
Infographic: UK social media usage - Facebook
Infographic: UK social media usage - FacebookInfographic: UK social media usage - Facebook
Infographic: UK social media usage - FacebookHarris Interactive UK
 
Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial KARMUN1295
 
Facebook tutorial
Facebook tutorialFacebook tutorial
Facebook tutorialKFCPRB
 
Facebook Usage Stats
Facebook Usage StatsFacebook Usage Stats
Facebook Usage StatsNeiman Outlen
 
Creating facebook page tutorial 2014
Creating facebook page tutorial 2014 Creating facebook page tutorial 2014
Creating facebook page tutorial 2014 Jaymar Villamor
 
After 55 facebook_tutorial
After 55 facebook_tutorialAfter 55 facebook_tutorial
After 55 facebook_tutorialTammy Fry, Ph.D.
 
Facebook Tutorial Video
Facebook Tutorial VideoFacebook Tutorial Video
Facebook Tutorial VideoMaggie Ansell
 
Facebook 101 personal usage
Facebook 101 personal usageFacebook 101 personal usage
Facebook 101 personal usageKristi Kirkland
 
AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015Neiman Outlen
 

Destaque (13)

PilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankingsPilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankings
 
Infographic: UK social media usage - Facebook
Infographic: UK social media usage - FacebookInfographic: UK social media usage - Facebook
Infographic: UK social media usage - Facebook
 
Tutorial on Twitter
Tutorial on TwitterTutorial on Twitter
Tutorial on Twitter
 
Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial
 
Facebook tutorial
Facebook tutorialFacebook tutorial
Facebook tutorial
 
Facebook Usage Stats
Facebook Usage StatsFacebook Usage Stats
Facebook Usage Stats
 
Facebook Tutorial
Facebook TutorialFacebook Tutorial
Facebook Tutorial
 
Creating facebook page tutorial 2014
Creating facebook page tutorial 2014 Creating facebook page tutorial 2014
Creating facebook page tutorial 2014
 
After 55 facebook_tutorial
After 55 facebook_tutorialAfter 55 facebook_tutorial
After 55 facebook_tutorial
 
Facebook Tutorial Video
Facebook Tutorial VideoFacebook Tutorial Video
Facebook Tutorial Video
 
Facebook 101 personal usage
Facebook 101 personal usageFacebook 101 personal usage
Facebook 101 personal usage
 
Twitter tutorial
Twitter tutorialTwitter tutorial
Twitter tutorial
 
AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015
 

Semelhante a joint statistical meeting 2008

Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...AIST
 
Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Fabien Gandon
 
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...Fabien Gandon
 
DSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfDSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfAbhiThorat6
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Sept 15 2012 bxb show me the numbers
Sept 15 2012  bxb show me the numbersSept 15 2012  bxb show me the numbers
Sept 15 2012 bxb show me the numbersHack the Hood
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...guest5b1607
 
Jan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metricsJan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metricsHack the Hood
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSotiris Beis
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearcheXascale Infolab
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation Sally Sadosky
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdfZixunZhou
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Wimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity ReportWimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity ReportFabien Gandon
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisOpen Analytics
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersitnig
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmersOutliers Collective
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmekideaport
 

Semelhante a joint statistical meeting 2008 (20)

Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
 
Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018
 
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
 
DSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfDSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdf
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Sept 15 2012 bxb show me the numbers
Sept 15 2012  bxb show me the numbersSept 15 2012  bxb show me the numbers
Sept 15 2012 bxb show me the numbers
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
Jan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metricsJan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metrics
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Wimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity ReportWimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity Report
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmers
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Último (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

joint statistical meeting 2008

  • 1.
  • 2. Data Analysis at Facebook Jeff Hammerbacher, Ding Zhou* Facebook Inc.
  • 3. Outline • How does Facebook work • Managing Big Data • Data Analysis for Business Intelligence • Data Analysis for “Artificial Intelligence” • Questions
  • 5. Profile page - content generation portal
  • 6. Newsfeed page - content consumption portal
  • 7. Friends page - social graph portal
  • 8. App page - social app platform
  • 9. Facebook Data ▪ Social Graph Data ▪ The Nodes: ▪ 100m+ users; 100+ dimensions each user (numerical, text, categorical); ▪ 350k registrations daily; ▪ The Edges: ▪ 200+ friends each user (median); ▪ 20 categories of edges (fb friends, co-workers, family, etc); ▪ Social Behavior Data ▪ Social Interactions: interactions among users, via 100+ interaction types; ▪ Social Actions: between users and 33k+ facebook apps, via 200+ action types; ▪ Social Content Data ▪ Content of Posts, Notes, Photos, Video, etc
  • 10. Managing Big Data ▪ Data scale [backend]: ▪ Over 1.3 PB raw capacity in largest cluster; ▪ Nearly 2 TB uncompressed data per day; ▪ Over 20 TB read/write per day; ▪ Distributed Data management: ▪ HDFS/Hadoop (MapReduce in Java); ▪ MetaStore (MetaData management); ▪ Hive QL (Query language on Hadoop+MetaStore); ▪ Usage: ▪ at least 50 engineers have run hadoop jobs ▪ 3,514 Jobs weekly ▪ 821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
  • 11. Hadoop - MapReduce in Java facebook:1 data:1 analysis:1 team:1 data:1 data:1 facebook:1 analysis:1 facebook data team uses: 1 data:2 uses hadoop for hadoop: 1 facebook:1 data analysis for: 1 for:1 hadoop:1 team:1 for:1 uses: 1 hadoop:1 team:1 uses: 1 data:1 analysis:1 MapReduce Execution Flow [Dean, J and Ghemawat, S, 2004]
  • 12. Data Analysis for Business Intelligence
  • 13. Data for Business Intelligence ▪ General Goal: ▪ support growth and monetization strategies, and product decisions ▪ User Behavior Studies ▪ NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive of engagement; ▪ Identity*: Unsupervised learning over user session data to identify common usage patterns. Techniques employed include K-Means, PageRank, dimension reduction methods; ▪ Experimentation Platform ▪ Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...); ▪ Columbus++*: A/B testing for impact of site change on site health metrics;; ▪ Reporting System ▪ ad-hoc analysis done by Hive queries * - underlined are projects that Ding Zhou participates in;
  • 14. Columbus Geographical bird-view of growth by country Comparison between user groups
  • 15. Data Analysis for “Artificial Intelligence” -- predicting user social behavior
  • 16. who the user will interact with • predict interactions between friends • features are user profile and browsing history • tried linear models and tree models • applied for search, newsfeed, etc
  • 17. who the user hasn’t found yet • missing edge prediction problem • observations are friend/non-friend pairs • features include profile and local graph info • profile info more informative • graph info supplemental if profile incomplete
  • 18. what applications the user may like* • 33k apps, only 0.1% of them used; • a different recommendation problem; • prediction model not applicable, user preference unavailable; • build a prediction model to infer “user ratings”; • user-based + item-based recommendation • how to combine profile, social graph, ratings? * projects that Ding Zhou participates in;
  • 19. what content is interesting* • newsfeed as the main content distribution channel • stories generated by 100s of social actions: on the site, platform, or the Web • <0.1% of possible stories are shown • predictions built on story features, and user browsing history * projects that Ding Zhou participates in;
  • 20. Challenges in Data - 100s of TBs of meaningful data available - 1,000s of non-trivial features - sampling not always applicable (e.g. small app has no user data) - prediction requirements ▪ models regularly applied for 10 billion novel samples ▪ models used on-the-fly for 100k samples in 50 ms
  • 21. Special Machine Learning Problems - use machine learning to predict user behavior ▪ labels: insufficient; inferred implicitly; imbalanced; ▪ features: high-dimensional; strongly correlated; noisy; - scale requires distributed algorithms ▪ in-house implementation of tree ensemble methods (bagging predictors) ▪ larger training sets grant performance improvements - speed and accuracy improvements underway
  • 22. tip of the iceberg Questions?
  • 23. (c) 2004-2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0