Twitter Protobufs And Hadoop Hug 021709

Hadoop and Protocol Buffers at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter

TM

Wednesday, February 17, 2010

Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps


My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and
visualization, social graph analysis, machine learning, lots more
data


The Challenge
‣ Store some tweets


The Challenge
‣ Store some tweets Store 100 billion tweets


The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust to changes


The Challenge
‣ Robust
‣ Efficient in size and speed


The Challenge
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis


The Challenge
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis
‣ Reusable (especially for other classes of data, like logs, where the size gets
really large)


The System
‣ Your (friend’s) hadoop
cluster


The Data ‣ kevin@tw-mbp-kweil ~ $ curl http://
‣

‣
<?xml version="1.0" encoding="UTF-8"?>
<status>
api.twitter.com/1/statuses/show/9225259353.xml
‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at>
‣ <id>9225259353</id>
‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text>
‣ <source><a href="http://www.tweetdeck.com/" rel="nofollow">TweetDeck</a></source>
‣ <truncated>false</truncated>
‣ <in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>

Each tweet has 12 fields, 3 of which (user, geo,
‣

‣
‣ <favorited>false</favorited>
‣ <in_reply_to_screen_name></in_reply_to_screen_name>
‣ <user>

contributors) have subfields
‣ <id>3452911</id>
‣ <name>Kevin Weil</name>
‣ <screen_name>kevinweil</screen_name>
‣ <location>Portola Valley, CA</location>
‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description>
‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url>
‣ <url></url>
‣ <protected>false</protected>
‣ <followers_count>3122</followers_count>
‣ <profile_background_color>B2DFDA</profile_background_color>
‣ <profile_text_color>333333</profile_text_color>

‣ It can change as we add new features
‣ <profile_link_color>93A644</profile_link_color>
‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color>
‣ <friends_count>436</friends_count>
‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at>
‣ <favourites_count>721</favourites_count>
‣ <utc_offset>-28800</utc_offset>
‣ <time_zone>Pacific Time (US & Canada)</time_zone>
‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url>
‣ <profile_background_tile>false</profile_background_tile>
‣ <notifications>false</notifications>
‣ <geo_enabled>true</geo_enabled>
‣ <verified>false</verified>
‣ <following>false</following>
‣ <statuses_count>2556</statuses_count>
‣ <lang>en</lang>
‣ <contributors_enabled>false</contributors_enabled>
‣ </user>
‣ <geo/>
‣ <contributors/>
‣ </status>
‣


The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical


Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields

XML

JSON

CSV

Custom
regex
(Apache)


Enter Protocol Buffers
‣ “Protocol Buffers are a way of encoding structured data in an
efficient yet extensible format. Google uses Protocol Buffers for
almost all of its internal RPC protocols and file formats.”
‣
http://code.google.com/p/protobuf
‣ You write IDL describing your data structure
‣ It generates code in your languages of choice to construct, serialize,
deserialize, reflect across, etc, your data structure
‣ Like Thrift, but richer and more efficient (except no RPC)
‣ Avro is an exciting up-and-coming alternative


Protobuf IDL Example
‣ message Status {
‣ optional string created_at = 1;
‣ optional int64 id = 2;
‣ optional string text = 3;
‣ optional string source = 4;
‣ optional bool truncated = 5;
‣ optional int64 in_reply_to_status_id = 6;
‣ optional int64 in_reply_to_user_id = 7;
‣ optional bool favorited = 8;
‣ optional string in_reply_to_screen_name = 9;
‣ optional message User = 10;
‣ optional message Geo = 11;
‣ optional message Contributors = 12;

‣ message User {
‣ optional int64 id = 1;
‣ optional string name = 2;
‣ ...
‣ }
‣ message Geo { ... }
‣ message Contributors { ... }
‣ }


Protobuf Generated Code
‣ The generated code is:
‣
Efﬁcient (Google quotes 80x vs. |-delimited format)1,2

‣
Extensible
‣
Backwards compatible
‣
Polymorphic (in Java, C++, Python)
‣
Metadata-rich

1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-ﬂexible-data-processing-tool/fulltext
2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking


Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields

XML

JSON

CSV

Custom
regex
(Apache)
Protocol
Buffers

But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code


‣
Protocol Buffer InputFormats


‣
‣
OutputFormats


‣
‣
OutputFormats
‣
Writables


‣
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs


‣
‣
OutputFormats
‣
Writables
‣
‣
Cascading, Streaming, Dumbo, etc


‣
‣
OutputFormats
‣
Writables
‣
‣
Cascading, Streaming, Dumbo, etc
‣
Per Protocol Buffer


‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Efﬁcient,
extensible
storage and
serialization


Pig LoadFuncs
‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Even the load
statement itself
is codegen


Where do these work?
‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables)
‣ Deprecated Java MapReduce APIs (same)
‣
Enables Streaming, Dumbo, Cascading
‣ Pig
‣ HBase


Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?


Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing

Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation


Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.


Resolution
‣ All we do now is write IDL for the data schema
‣ Get efﬁcient, forward/backwards compatible, splittable data structures
automatically generated for us
‣ Get loaders, input formats, output formats, writables, and schemas
automatically generated for us
‣ Helps the Twitter analytics team stay agile
‣
Can handle new, complex data without the need for new code, new

tests, new bugs
‣
Focus on the analysis, not data formats


Twitter Open Source
‣ Coming soon! (1-2 weeks) http://github.com/kevinweil
‣ All base classes for InputFormats, OutputFormats, Writables, Pig
Loaders, etc
‣ For new and deprecated MapReduce API
‣ With and without LZO compression (see http://github.com/
kevinweil/hadoop-lzo)
‣ Protobuf reﬂection helpers
‣ Serialized block storage format for HDFS


Questions? Follow me at
twitter.com/kevinweil

‣ If this sounded interesting to you -- that’s because it is. And we’re hiring.

TM


Twitter Protobufs And Hadoop Hug 021709

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (18)

Semelhante a Twitter Protobufs And Hadoop Hug 021709

Semelhante a Twitter Protobufs And Hadoop Hug 021709 (20)

Mais de Hadoop User Group

Mais de Hadoop User Group (17)

Twitter Protobufs And Hadoop Hug 021709