Facebook Hadoop Data & Applications

Hadoop and Hive at Facebook
Data and Applications
Dhruba Borthakur, Ding Zhou

Your Company Logo Here

Wednesday, June 10, 2009

Santa Clara Marriott

Who generates this data?

Lots of data is generated on Facebook
»  200 million active users
»  20 million users update their statuses at least
once each day
»  More than 850 million photos uploaded to the site
each month
»  More than 8 million videos uploaded each month
»  More than 1 billion pieces of content (web links,
news stories, blog posts, notes, photos, etc.)
shared each week

http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols

Where do we store parts of this data?

»  Hadoop/Hive Warehouse
›  4800 cores, 2 PetaBytes
total size

»  Other Hadoop Clusters
•  HDFS-Scribe cluster: 320
cores, 160 TB total size
•  Hadoop Archival Cluster :
80 cores, 200TB total size
•  Test cluster : 800 cores,
150 TB total size

Data Collection using Scribe
Network
Storage
and
Servers

Web Servers Scribe MidTier

Oracle RAC Hadoop Hive Warehouse MySQL

Data Collection using Scribe and HDFS
Scribe MidTier

RealBme
Hadoop
Cluster
Web Servers

Oracle RAC Hadoop Hive Warehouse
Hadoop Scribe Integration
MySQL

Data Archive: Move old data to cheap storage

Hadoop Warehouse

distcp

NFS
Hadoop Archive Node Cheap NAS

Hadoop Archival Cluster
20TB per node

HADOOP‐5048
Hive Query

Hive User Interfaces

Hive shell access

Hive Web UI

Data Analysis at Facebook

»  Business Intelligence
›  Growth and monetization strategies
›  Product insights & decisions
›  Philosophy: build meta tools and provide easy access to data

»  Artificial Intelligence
›  Recommendation & ranking products
›  Advertising optimization
›  Text analytics
›  Philosophy: model inference; data preparation; model building;

BI: Build centralized reporting tools

»  Top-level site metrics
Bird-view of user growth
by countries

Comparing certain metrics
between user groups

BI: Make AdHoc reporting easy

»  Example: “Find the number of status updates
mentioning ‘swine flu’ per day last month”

»  SELECT a.date, count(1)
»  FROM status_updates a
»  WHERE a.status LIKE “%swine flu%”
»  AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’
»  GROUP BY a.date

Build site metric dashboard in a day
»  Data collection:
›  Define metrics and log format (Hive schema)
›  Add logging to the site (Scribe logging)
›  Create a Hive table partitioned by date
›  Set up metric ETL cron job (Hive -> mysql/oracle)
»  Data visualization (using mysql)
»  Data access (adhoc query using Hive)

Build Machine Learning Products on
Hadoop/Hive
•  Recommendation & ranking
•  Advertising optimization
•  Text analytics

What applications the user may like
»  Recommend apps based on
social and demographic
popularity

»  User-app log is huge
»  Joining user-app log with
user demographics is difficult

»  Hive for data aggregation

Who the user wants to connect
»  Take existing edges and
user feedbacks as labels
»  Build regression models
based on user profile and
local graph features

»  Too many friends of friends
»  Model trained by sampling

»  Hive for model inference
»  Hive for feature selection

What users are talking about (Lexicon)
»  Market research & ad tool

»  Extract popular words from user
content
»  Slice by age, gender, region
»  Sentiment analysis
laid-off
»  Keyword association

»  Hadoop used for text analytics

Words associated with vodka

What ads the user might click on
»  Predict user-ad click-through

»  Ads click data is sparse so
sampling can miss info
»  Many ML algorithms are
iterative thus not easy for
hadoop

»  Hadoop for model training

Build ensemble ML models on Hadoop

Train models locally
Cross-Test models locally
»  Each mapper trains a
number of models
»  Each model output as a ds1 ds2 ds3 ds4
intermediate feature

»  Model selection at reducer
»  A regression model is built
on selected features
ensembles

Models assembled by ensemble methods
Model inference in a second Hadoop job

In summary

»  Hadoop and Hive at Facebook
»  Support product strategy and decision;
»  Recommendation & ranking products;
»  Advertising optimization;
»  Text analytics tools;

»  So Zuckerberg’s urgent questions are answered;
»  So celebrities know where their fans are from;
»  So we know one can like vodka and lemonade at the same time;
»  It’s fun playing with the data;

Dhruba Borthakur, Ding Zhou
dhruba@, dzhou@

Facebook Hadoop Data & Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Facebook Hadoop Data & Applications

Similar to Facebook Hadoop Data & Applications (20)

Recently uploaded

Recently uploaded (20)

Facebook Hadoop Data & Applications