What Are The Drone Anti-jamming Systems Technology?
joint statistical meeting 2008
1.
2. Data Analysis at Facebook
Jeff Hammerbacher, Ding Zhou*
Facebook Inc.
3. Outline
• How does Facebook work
• Managing Big Data
• Data Analysis for Business Intelligence
• Data Analysis for “Artificial Intelligence”
• Questions
9. Facebook Data
▪ Social Graph Data
▪ The Nodes:
▪
100m+ users; 100+ dimensions each user (numerical, text, categorical);
▪
350k registrations daily;
▪ The Edges:
▪
200+ friends each user (median);
▪
20 categories of edges (fb friends, co-workers, family, etc);
▪ Social Behavior Data
▪ Social Interactions: interactions among users, via 100+ interaction types;
▪ Social Actions: between users and 33k+ facebook apps, via 200+ action types;
▪ Social Content Data
▪ Content of Posts, Notes, Photos, Video, etc
10. Managing Big Data
▪ Data scale [backend]:
▪ Over 1.3 PB raw capacity in largest cluster;
▪ Nearly 2 TB uncompressed data per day;
▪ Over 20 TB read/write per day;
▪ Distributed Data management:
▪ HDFS/Hadoop (MapReduce in Java);
▪ MetaStore (MetaData management);
▪ Hive QL (Query language on Hadoop+MetaStore);
▪ Usage:
▪
at least 50 engineers have run hadoop jobs
▪
3,514 Jobs weekly
▪
821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
11. Hadoop - MapReduce in Java
facebook:1
data:1 analysis:1
team:1 data:1
data:1
facebook:1 analysis:1
facebook data team uses: 1 data:2
uses hadoop for hadoop: 1 facebook:1
data analysis for: 1 for:1
hadoop:1
team:1
for:1
uses: 1
hadoop:1
team:1
uses: 1
data:1
analysis:1
MapReduce Execution Flow
[Dean, J and Ghemawat, S, 2004]
13. Data for Business Intelligence
▪ General Goal:
▪ support growth and monetization strategies, and product decisions
▪ User Behavior Studies
▪ NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive
of engagement;
▪ Identity*: Unsupervised learning over user session data to identify common usage patterns.
Techniques employed include K-Means, PageRank, dimension reduction methods;
▪ Experimentation Platform
▪ Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...);
▪
Columbus++*: A/B testing for impact of site change on site health metrics;;
▪ Reporting System
▪ ad-hoc analysis done by Hive queries
* - underlined are projects that Ding Zhou participates in;
14. Columbus
Geographical bird-view of
growth by country
Comparison between
user groups
15. Data Analysis for “Artificial Intelligence”
-- predicting user social behavior
16. who the user will
interact with
• predict interactions between friends
• features are user profile and browsing history
• tried linear models and tree models
• applied for search, newsfeed, etc
17. who the user hasn’t
found yet
• missing edge prediction problem
• observations are friend/non-friend pairs
• features include profile and local graph info
• profile info more informative
• graph info supplemental if profile incomplete
18. what applications the
user may like*
• 33k apps, only 0.1% of them used;
• a different recommendation problem;
• prediction model not applicable,
user preference unavailable;
• build a prediction model to infer “user ratings”;
• user-based + item-based recommendation
• how to combine profile, social graph, ratings?
* projects that Ding Zhou participates in;
19. what content is
interesting*
• newsfeed as the main content distribution channel
• stories generated by 100s of social actions:
on the site, platform, or the Web
• <0.1% of possible stories are shown
• predictions built on story features, and user
browsing history
* projects that Ding Zhou participates in;
20. Challenges in Data
- 100s of TBs of meaningful data available
- 1,000s of non-trivial features
- sampling not always applicable (e.g. small app has no user data)
- prediction requirements
▪ models regularly applied for 10 billion novel samples
▪ models used on-the-fly for 100k samples in 50 ms
21. Special Machine Learning Problems
- use machine learning to predict user behavior
▪ labels: insufficient; inferred implicitly; imbalanced;
▪ features: high-dimensional; strongly correlated; noisy;
- scale requires distributed algorithms
▪ in-house implementation of tree ensemble methods (bagging predictors)
▪ larger training sets grant performance improvements
- speed and accuracy improvements underway