joint statistical meeting 2008

Data Analysis at Facebook

Jeff Hammerbacher, Ding Zhou*
Facebook Inc.

Outline
• How does Facebook work
• Managing Big Data
• Data Analysis for Business Intelligence
• Data Analysis for “Artificial Intelligence”
• Questions

Profile page - content generation portal

Newsfeed page - content consumption portal

Friends page - social graph portal

App page - social app platform

Facebook Data
▪ Social Graph Data
▪ The Nodes:
▪
100m+ users; 100+ dimensions each user (numerical, text, categorical);
▪
350k registrations daily;
▪ The Edges:
▪
200+ friends each user (median);
▪
20 categories of edges (fb friends, co-workers, family, etc);

▪ Social Behavior Data
▪ Social Interactions: interactions among users, via 100+ interaction types;
▪ Social Actions: between users and 33k+ facebook apps, via 200+ action types;

▪ Social Content Data
▪ Content of Posts, Notes, Photos, Video, etc

Managing Big Data
▪ Data scale [backend]:
▪ Over 1.3 PB raw capacity in largest cluster;
▪ Nearly 2 TB uncompressed data per day;
▪ Over 20 TB read/write per day;
▪ Distributed Data management:
▪ HDFS/Hadoop (MapReduce in Java);
▪ MetaStore (MetaData management);
▪ Hive QL (Query language on Hadoop+MetaStore);
▪ Usage:
▪
at least 50 engineers have run hadoop jobs
▪
3,514 Jobs weekly
▪
821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly

Hadoop - MapReduce in Java

facebook:1
data:1 analysis:1
team:1 data:1
data:1
facebook:1 analysis:1
facebook data team uses: 1 data:2
uses hadoop for hadoop: 1 facebook:1
data analysis for: 1 for:1
hadoop:1
team:1
for:1
uses: 1
hadoop:1
team:1
uses: 1
data:1
analysis:1

MapReduce Execution Flow
[Dean, J and Ghemawat, S, 2004]

Data Analysis for Business Intelligence

Data for Business Intelligence
▪ General Goal:
▪ support growth and monetization strategies, and product decisions
▪ User Behavior Studies
▪ NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive
of engagement;
▪ Identity*: Unsupervised learning over user session data to identify common usage patterns.
Techniques employed include K-Means, PageRank, dimension reduction methods;
▪ Experimentation Platform
▪ Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...);
▪
Columbus++*: A/B testing for impact of site change on site health metrics;;

▪ Reporting System
▪ ad-hoc analysis done by Hive queries
* - underlined are projects that Ding Zhou participates in;

Columbus
Geographical bird-view of
growth by country

Comparison between
user groups

Data Analysis for “Artificial Intelligence”
-- predicting user social behavior

who the user will
interact with

• predict interactions between friends

• features are user profile and browsing history

• tried linear models and tree models

• applied for search, newsfeed, etc

who the user hasn’t
found yet

• missing edge prediction problem

• observations are friend/non-friend pairs

• features include profile and local graph info

• profile info more informative

• graph info supplemental if profile incomplete

what applications the
user may like*

• 33k apps, only 0.1% of them used;

• a different recommendation problem;

• prediction model not applicable,
user preference unavailable;

• build a prediction model to infer “user ratings”;

• user-based + item-based recommendation

• how to combine profile, social graph, ratings?

* projects that Ding Zhou participates in;

what content is
interesting*
• newsfeed as the main content distribution channel

• stories generated by 100s of social actions:
on the site, platform, or the Web

• <0.1% of possible stories are shown

• predictions built on story features, and user
browsing history

* projects that Ding Zhou participates in;

Challenges in Data
- 100s of TBs of meaningful data available
- 1,000s of non-trivial features
- sampling not always applicable (e.g. small app has no user data)
- prediction requirements
▪ models regularly applied for 10 billion novel samples
▪ models used on-the-fly for 100k samples in 50 ms

Special Machine Learning Problems
- use machine learning to predict user behavior
▪ labels: insufficient; inferred implicitly; imbalanced;
▪ features: high-dimensional; strongly correlated; noisy;

- scale requires distributed algorithms
▪ in-house implementation of tree ensemble methods (bagging predictors)
▪ larger training sets grant performance improvements

- speed and accuracy improvements underway

tip of the iceberg

Questions?

joint statistical meeting 2008

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (13)

Semelhante a joint statistical meeting 2008

Semelhante a joint statistical meeting 2008 (20)

Último

Último (20)

joint statistical meeting 2008