1. Hadoop and Protocol Buffers at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter
TM
Wednesday, February 17, 2010
2. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Wednesday, February 17, 2010
3. My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and
visualization, social graph analysis, machine learning, lots more
data
Wednesday, February 17, 2010
4. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Wednesday, February 17, 2010
5. The Challenge
‣ Store some tweets
Wednesday, February 17, 2010
6. The Challenge
‣ Store some tweets Store 100 billion tweets
Wednesday, February 17, 2010
7. The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust to changes
Wednesday, February 17, 2010
8. The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient in size and speed
Wednesday, February 17, 2010
9. The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis
Wednesday, February 17, 2010
10. The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis
‣ Reusable (especially for other classes of data, like logs, where the size gets
really large)
Wednesday, February 17, 2010
11. The System
‣ Your (friend’s) hadoop
cluster
Wednesday, February 17, 2010
12. The Data ‣ kevin@tw-mbp-kweil ~ $ curl http://
‣
‣
<?xml version="1.0" encoding="UTF-8"?>
<status>
api.twitter.com/1/statuses/show/9225259353.xml
‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at>
‣ <id>9225259353</id>
‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text>
‣ <source><a href="http://www.tweetdeck.com/" rel="nofollow">TweetDeck</a></source>
‣ <truncated>false</truncated>
‣ <in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>
Each tweet has 12 fields, 3 of which (user, geo,
‣
‣
‣ <favorited>false</favorited>
‣ <in_reply_to_screen_name></in_reply_to_screen_name>
‣ <user>
contributors) have subfields
‣ <id>3452911</id>
‣ <name>Kevin Weil</name>
‣ <screen_name>kevinweil</screen_name>
‣ <location>Portola Valley, CA</location>
‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description>
‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url>
‣ <url></url>
‣ <protected>false</protected>
‣ <followers_count>3122</followers_count>
‣ <profile_background_color>B2DFDA</profile_background_color>
‣ <profile_text_color>333333</profile_text_color>
‣ It can change as we add new features
‣ <profile_link_color>93A644</profile_link_color>
‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color>
‣ <friends_count>436</friends_count>
‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at>
‣ <favourites_count>721</favourites_count>
‣ <utc_offset>-28800</utc_offset>
‣ <time_zone>Pacific Time (US & Canada)</time_zone>
‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url>
‣ <profile_background_tile>false</profile_background_tile>
‣ <notifications>false</notifications>
‣ <geo_enabled>true</geo_enabled>
‣ <verified>false</verified>
‣ <following>false</following>
‣ <statuses_count>2556</statuses_count>
‣ <lang>en</lang>
‣ <contributors_enabled>false</contributors_enabled>
‣ </user>
‣ <geo/>
‣ <contributors/>
‣ </status>
‣
Wednesday, February 17, 2010
13. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
Wednesday, February 17, 2010
14. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
Wednesday, February 17, 2010
15. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
Wednesday, February 17, 2010
16. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
Wednesday, February 17, 2010
17. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
Wednesday, February 17, 2010
18. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
Wednesday, February 17, 2010
19. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
Wednesday, February 17, 2010
20. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Wednesday, February 17, 2010
21. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Wednesday, February 17, 2010
22. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Wednesday, February 17, 2010
23. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Wednesday, February 17, 2010
24. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Wednesday, February 17, 2010
25. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Wednesday, February 17, 2010
26. Enter Protocol Buffers
‣ “Protocol Buffers are a way of encoding structured data in an
efficient yet extensible format. Google uses Protocol Buffers for
almost all of its internal RPC protocols and file formats.”
‣
http://code.google.com/p/protobuf
‣ You write IDL describing your data structure
‣ It generates code in your languages of choice to construct, serialize,
deserialize, reflect across, etc, your data structure
‣ Like Thrift, but richer and more efficient (except no RPC)
‣ Avro is an exciting up-and-coming alternative
Wednesday, February 17, 2010
28. Protobuf Generated Code
‣ The generated code is:
‣
Efficient (Google quotes 80x vs. |-delimited format)1,2
‣
Extensible
‣
Backwards compatible
‣
Polymorphic (in Java, C++, Python)
‣
Metadata-rich
1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
Wednesday, February 17, 2010
29. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Protocol
Buffers
Wednesday, February 17, 2010
30. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Wednesday, February 17, 2010
31. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
Wednesday, February 17, 2010
32. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
Wednesday, February 17, 2010
33. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
Wednesday, February 17, 2010
34. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
Wednesday, February 17, 2010
35. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
Wednesday, February 17, 2010
36. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
‣
Cascading, Streaming, Dumbo, etc
Wednesday, February 17, 2010
37. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
‣
Cascading, Streaming, Dumbo, etc
‣
Per Protocol Buffer
Wednesday, February 17, 2010
38. Protocol Buffer InputFormats
‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Efficient,
extensible
storage and
serialization
Wednesday, February 17, 2010
39. Pig LoadFuncs
‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Even the load
statement itself
is codegen
Wednesday, February 17, 2010
40. Where do these work?
‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables)
‣ Deprecated Java MapReduce APIs (same)
‣
Enables Streaming, Dumbo, Cascading
‣ Pig
‣ HBase
Wednesday, February 17, 2010
41. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Wednesday, February 17, 2010
42. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?
Wednesday, February 17, 2010
43. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing
Wednesday, February 17, 2010
44. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation
Wednesday, February 17, 2010
45. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.
Wednesday, February 17, 2010
46. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
Wednesday, February 17, 2010
47. Resolution
‣ All we do now is write IDL for the data schema
‣ Get efficient, forward/backwards compatible, splittable data structures
automatically generated for us
‣ Get loaders, input formats, output formats, writables, and schemas
automatically generated for us
‣ Helps the Twitter analytics team stay agile
‣
Can handle new, complex data without the need for new code, new
tests, new bugs
‣
Focus on the analysis, not data formats
Wednesday, February 17, 2010
48. Twitter Open Source
‣ Coming soon! (1-2 weeks) http://github.com/kevinweil
‣ All base classes for InputFormats, OutputFormats, Writables, Pig
Loaders, etc
‣ For new and deprecated MapReduce API
‣ With and without LZO compression (see http://github.com/
kevinweil/hadoop-lzo)
‣ Protobuf reflection helpers
‣ Serialized block storage format for HDFS
Wednesday, February 17, 2010
49. Questions? Follow me at
twitter.com/kevinweil
‣ If this sounded interesting to you -- that’s because it is. And we’re hiring.
TM
Wednesday, February 17, 2010