SlideShare uma empresa Scribd logo
1 de 49
Hadoop and Protocol Buffers at Twitter
                        Kevin Weil -- @kevinweil
                        Analytics Lead, Twitter




                                                                TM




Wednesday, February 17, 2010
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣     Protocol Buffers
                   ‣     Codegen, Hadoop and You
                   ‣     Applications
                   ‣     Conclusions and Next Steps




Wednesday, February 17, 2010
My Background
                   ‣     Studied Mathematics and Physics at Harvard, Physics at
                         Stanford
                   ‣     Tropos Networks (city-wide wireless): mesh routing algorithms,
                         GBs of data
                   ‣     Cooliris (web media): Hadoop and Pig for analytics, TBs of data
                   ‣     Twitter: Hadoop, Pig, HBase, large-scale data analysis and
                         visualization, social graph analysis, machine learning, lots more
                         data



Wednesday, February 17, 2010
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣     Protocol Buffers
                   ‣     Codegen, Hadoop and You
                   ‣     Applications
                   ‣     Conclusions and Next Steps




Wednesday, February 17, 2010
The Challenge
                    ‣     Store some tweets




Wednesday, February 17, 2010
The Challenge
                    ‣     Store some tweets Store 100 billion tweets




Wednesday, February 17, 2010
The Challenge
                    ‣     Store 100 billion tweets in a way that is
                    ‣     	    Robust to changes




Wednesday, February 17, 2010
The Challenge
                    ‣     Store 100 billion tweets in a way that is
                    ‣     	    Robust
                    ‣     	    Efficient in size and speed




Wednesday, February 17, 2010
The Challenge
                    ‣     Store 100 billion tweets in a way that is
                    ‣     	    Robust
                    ‣     	    Efficient
                    ‣     	    Amenable to large-scale analysis




Wednesday, February 17, 2010
The Challenge
                    ‣     Store 100 billion tweets in a way that is
                    ‣     	        Robust
                    ‣     	        Efficient
                    ‣          	   Amenable to large-scale analysis
                    ‣     	        Reusable (especially for other classes of data, like logs, where the size gets
                          really large)




Wednesday, February 17, 2010
The System
                   ‣     Your (friend’s) hadoop
                         cluster




Wednesday, February 17, 2010
The Data                                                                                     ‣     kevin@tw-mbp-kweil ~ $ curl http://
                    ‣

                    ‣
                          <?xml version="1.0" encoding="UTF-8"?>
                          <status>
                                                                                                                      api.twitter.com/1/statuses/show/9225259353.xml
                    ‣      <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at>
                    ‣      <id>9225259353</id>
                    ‣      <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter.    See you there?   http://bit.ly/9DJcd9</text>
                    ‣      <source>&lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;</source>
                    ‣      <truncated>false</truncated>
                    ‣      <in_reply_to_status_id></in_reply_to_status_id>
                           <in_reply_to_user_id></in_reply_to_user_id>



                                                                                                                      Each tweet has 12 fields, 3 of which (user, geo,
                    ‣




                                                                                                                ‣
                    ‣      <favorited>false</favorited>
                    ‣      <in_reply_to_screen_name></in_reply_to_screen_name>
                    ‣      <user>




                                                                                                                      contributors) have subfields
                    ‣          <id>3452911</id>
                    ‣          <name>Kevin Weil</name>
                    ‣          <screen_name>kevinweil</screen_name>
                    ‣          <location>Portola Valley, CA</location>
                    ‣          <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description>
                    ‣          <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url>
                    ‣          <url></url>
                    ‣          <protected>false</protected>
                    ‣          <followers_count>3122</followers_count>
                    ‣          <profile_background_color>B2DFDA</profile_background_color>
                    ‣          <profile_text_color>333333</profile_text_color>



                                                                                                                ‣     It can change as we add new features
                    ‣          <profile_link_color>93A644</profile_link_color>
                    ‣          <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
                    ‣          <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color>
                    ‣          <friends_count>436</friends_count>
                    ‣          <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at>
                    ‣          <favourites_count>721</favourites_count>
                    ‣          <utc_offset>-28800</utc_offset>
                    ‣          <time_zone>Pacific Time (US &amp; Canada)</time_zone>
                    ‣          <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url>
                    ‣          <profile_background_tile>false</profile_background_tile>
                    ‣          <notifications>false</notifications>
                    ‣          <geo_enabled>true</geo_enabled>
                    ‣          <verified>false</verified>
                    ‣          <following>false</following>
                    ‣          <statuses_count>2556</statuses_count>
                    ‣          <lang>en</lang>
                    ‣          <contributors_enabled>false</contributors_enabled>
                    ‣      </user>
                    ‣      <geo/>
                    ‣      <contributors/>
                    ‣     </status>
                    ‣




Wednesday, February 17, 2010
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing efficiency
                                      ‣   Reusability
                                      ‣   Ability to add new fields
                                      ‣   Ability to ignore unused fields
                                      ‣   Small data size
                                      ‣   Hierarchical



Wednesday, February 17, 2010
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing efficiency
                                      ‣   Reusability
                                      ‣   Ability to add new fields
                                      ‣   Ability to ignore unused fields
                                      ‣   Small data size
                                      ‣   Hierarchical



Wednesday, February 17, 2010
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing efficiency
                                      ‣   Reusability
                                      ‣   Ability to add new fields
                                      ‣   Ability to ignore unused fields
                                      ‣   Small data size
                                      ‣   Hierarchical



Wednesday, February 17, 2010
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing efficiency
                                      ‣   Reusability
                                      ‣   Ability to add new fields
                                      ‣   Ability to ignore unused fields
                                      ‣   Small data size
                                      ‣   Hierarchical



Wednesday, February 17, 2010
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing efficiency
                                      ‣   Reusability
                                      ‣   Ability to add new fields
                                      ‣   Ability to ignore unused fields
                                      ‣   Small data size
                                      ‣   Hierarchical



Wednesday, February 17, 2010
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing efficiency
                                      ‣   Reusability
                                      ‣   Ability to add new fields
                                      ‣   Ability to ignore unused fields
                                      ‣   Small data size
                                      ‣   Hierarchical



Wednesday, February 17, 2010
The Requirements
                                      ‣   Splittability
                                      ‣   Parsing efficiency
                                      ‣   Reusability
                                      ‣   Ability to add new fields
                                      ‣   Ability to ignore unused fields
                                      ‣   Small data size
                                      ‣   Hierarchical



Wednesday, February 17, 2010
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣     Protocol Buffers
                   ‣     Codegen, Hadoop and You
                   ‣     Applications
                   ‣     Conclusions and Next Steps




Wednesday, February 17, 2010
Common Formats
                                             Parsing                                Ignore unused
                               Splittable               Reusability   Add new fields               Small data size   Hierarchical
                                            efficiency                                   fields


            XML


           JSON


             CSV

        Custom
         regex
       (Apache)


Wednesday, February 17, 2010
Common Formats
                                             Parsing                                Ignore unused
                               Splittable               Reusability   Add new fields               Small data size   Hierarchical
                                            efficiency                                   fields


            XML


           JSON


             CSV

        Custom
         regex
       (Apache)


Wednesday, February 17, 2010
Common Formats
                                             Parsing                                Ignore unused
                               Splittable               Reusability   Add new fields               Small data size   Hierarchical
                                            efficiency                                   fields


            XML


           JSON


             CSV

        Custom
         regex
       (Apache)


Wednesday, February 17, 2010
Common Formats
                                             Parsing                                Ignore unused
                               Splittable               Reusability   Add new fields               Small data size   Hierarchical
                                            efficiency                                   fields


            XML


           JSON


             CSV

        Custom
         regex
       (Apache)


Wednesday, February 17, 2010
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣     Protocol Buffers
                   ‣     Codegen, Hadoop and You
                   ‣     Applications
                   ‣     Conclusions and Next Steps




Wednesday, February 17, 2010
Enter Protocol Buffers
                    ‣     “Protocol Buffers are a way of encoding structured data in an
                          efficient yet extensible format. Google uses Protocol Buffers for
                          almost all of its internal RPC protocols and file formats.”
                    ‣     
    http://code.google.com/p/protobuf
                    ‣     You write IDL describing your data structure
                    ‣     It generates code in your languages of choice to construct, serialize,
                          deserialize, reflect across, etc, your data structure
                    ‣     Like Thrift, but richer and more efficient (except no RPC)
                    ‣     Avro is an exciting up-and-coming alternative


Wednesday, February 17, 2010
Protobuf IDL Example
                    ‣     message Status {
                    ‣       optional string created_at                =   1;
                    ‣       optional int64 id                         =   2;
                    ‣       optional string text                      =   3;
                    ‣       optional string source                    =   4;
                    ‣       optional bool truncated                   =   5;
                    ‣       optional int64 in_reply_to_status_id      =   6;
                    ‣       optional int64 in_reply_to_user_id        =   7;
                    ‣       optional bool favorited                   =   8;
                    ‣       optional string in_reply_to_screen_name   =   9;
                    ‣       optional message User                     =   10;
                    ‣       optional message Geo                      =   11;
                    ‣       optional message Contributors             =   12;

                    ‣          message User {
                    ‣            optional int64 id                    = 1;
                    ‣            optional string name                 = 2;
                    ‣            ...
                    ‣          }
                    ‣          message Geo { ... }
                    ‣          message Contributors { ... }
                    ‣     }




Wednesday, February 17, 2010
Protobuf Generated Code
                    ‣     The generated code is:
                    ‣     
    Efficient (Google quotes 80x vs. |-delimited                   format)1,2

                    ‣     
    Extensible
                    ‣     
    Backwards compatible
                    ‣     
    Polymorphic (in Java, C++, Python)
                    ‣     
    Metadata-rich



                     1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
                     2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

Wednesday, February 17, 2010
Common Formats
                                             Parsing                                Ignore unused
                               Splittable               Reusability   Add new fields               Small data size   Hierarchical
                                            efficiency                                   fields


            XML


           JSON


             CSV

        Custom
         regex
       (Apache)
       Protocol
        Buffers
Wednesday, February 17, 2010
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣     Protocol Buffers
                   ‣     Codegen, Hadoop and You
                   ‣     Applications
                   ‣     Conclusions and Next Steps




Wednesday, February 17, 2010
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next step: codegen for all Hadoop-related code




Wednesday, February 17, 2010
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next step: codegen for all Hadoop-related code
                    ‣     
    Protocol Buffer InputFormats




Wednesday, February 17, 2010
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next step: codegen for all Hadoop-related code
                    ‣     
    Protocol Buffer InputFormats
                    ‣     
    OutputFormats




Wednesday, February 17, 2010
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next step: codegen for all Hadoop-related code
                    ‣     
    Protocol Buffer InputFormats
                    ‣     
    OutputFormats
                    ‣     
    Writables




Wednesday, February 17, 2010
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next step: codegen for all Hadoop-related code
                    ‣     
    Protocol Buffer InputFormats
                    ‣     
    OutputFormats
                    ‣     
    Writables
                    ‣     
    Pig LoadFuncs and StoreFuncs




Wednesday, February 17, 2010
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next step: codegen for all Hadoop-related code
                    ‣     
    Protocol Buffer InputFormats
                    ‣     
    OutputFormats
                    ‣     
    Writables
                    ‣     
    Pig LoadFuncs and StoreFuncs
                    ‣     
    Cascading, Streaming, Dumbo, etc



Wednesday, February 17, 2010
But Wait, There’s More
                    ‣     Codegen for data structures is nice...
                    ‣     Next step: codegen for all Hadoop-related code
                    ‣     
    Protocol Buffer InputFormats
                    ‣     
    OutputFormats
                    ‣     
    Writables
                    ‣     
    Pig LoadFuncs and StoreFuncs
                    ‣     
    Cascading, Streaming, Dumbo, etc
                    ‣     
    Per Protocol Buffer


Wednesday, February 17, 2010
Protocol Buffer InputFormats
                                                  ‣   All objects
                                                      (hierarchical
                                                      data,
                                                      inheritance, etc)
                                                  ‣   All automatically
                                                      generated
                                                  ‣   Efficient,
                                                      extensible
                                                      storage and
                                                      serialization



Wednesday, February 17, 2010
Pig LoadFuncs
                                   ‣   All objects
                                       (hierarchical
                                       data,
                                       inheritance, etc)
                                   ‣   All automatically
                                       generated
                                   ‣   Even the load
                                       statement itself
                                       is codegen



Wednesday, February 17, 2010
Where do these work?
                    ‣     Java MapReduce APIs (InputFormats, OutputFormats, Writables)
                    ‣     Deprecated Java MapReduce APIs (same)
                    ‣     
     Enables Streaming, Dumbo, Cascading
                    ‣     Pig
                    ‣     HBase




Wednesday, February 17, 2010
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣     Protocol Buffers
                   ‣     Codegen, Hadoop and You
                   ‣     Applications
                   ‣     Conclusions and Next Steps




Wednesday, February 17, 2010
Counting Big Data
                   ‣                  standard counts, min, max, std dev
                   ‣     How many requests do we serve in a day?
                   ‣     What is the average latency? 95% latency?
                   ‣     Group by response code. What is the hourly distribution?
                   ‣     How many searches happen each day on Twitter?
                   ‣     How many unique queries, how many unique users?
                   ‣     What is their geographic distribution?



Wednesday, February 17, 2010
Correlating Big Data
                  ‣                   probabilities, covariance, influence
                  ‣     How does usage differ for mobile users?
                  ‣     How about for users with 3rd party desktop clients?
                  ‣     Cohort analyses
                  ‣     Site problems: what goes wrong at the same time?
                  ‣     Which features get users hooked?
                  ‣     Which features do successful users use often?
                  ‣     Search corrections, search suggestions
                  ‣     A/B testing
Wednesday, February 17, 2010
Research on Big Data
                  ‣                 prediction, graph analysis, natural language
                  ‣     What can we tell about a user from their tweets?
                  ‣            From the tweets of those they follow?
                  ‣            From the tweets of their followers?
                  ‣            From the ratio of followers/following?
                  ‣     What graph structures lead to successful networks?
                  ‣     User reputation



Wednesday, February 17, 2010
Research on Big Data
                  ‣                 prediction, graph analysis, natural language
                  ‣     Sentiment analysis
                  ‣     What features get a tweet retweeted?
                  ‣            How deep is the corresponding retweet tree?
                  ‣     Long-term duplicate detection
                  ‣     Machine learning
                  ‣     Language detection
                  ‣     ... the list goes on.

Wednesday, February 17, 2010
Outline
                   ‣     Problem Statement
                   ‣     CSV? XML? JSON? Regex?
                   ‣     Protocol Buffers
                   ‣     Codegen, Hadoop and You
                   ‣     Applications
                   ‣     Conclusions and Next Steps




Wednesday, February 17, 2010
Resolution
                    ‣     All we do now is write IDL for the data schema
                    ‣     Get efficient, forward/backwards compatible, splittable data structures
                          automatically generated for us
                    ‣     Get loaders, input formats, output formats, writables, and schemas
                          automatically generated for us
                    ‣     Helps the Twitter analytics team stay agile
                    ‣     
    Can handle new, complex data without the need for new code, new
                          
    tests, new bugs
                    ‣     
    Focus on the analysis, not data formats


Wednesday, February 17, 2010
Twitter                 Open Source
                    ‣     Coming soon! (1-2 weeks) http://github.com/kevinweil
                    ‣     All base classes for InputFormats, OutputFormats, Writables, Pig
                          Loaders, etc
                    ‣     For new and deprecated MapReduce API
                    ‣     With and without LZO compression (see http://github.com/
                          kevinweil/hadoop-lzo)
                    ‣     Protobuf reflection helpers
                    ‣     Serialized block storage format for HDFS



Wednesday, February 17, 2010
Questions?                               Follow me at
                                                                        twitter.com/kevinweil




          ‣     If this sounded interesting to you -- that’s because it is. And we’re hiring.

                                                                                     TM




Wednesday, February 17, 2010

Mais conteúdo relacionado

Destaque

HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21Hadoop User Group
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for HadoopHadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21Hadoop User Group
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitTom Croucher
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 

Destaque (18)

HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
Mumak
MumakMumak
Mumak
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Cloudera Desktop
Cloudera DesktopCloudera Desktop
Cloudera Desktop
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 

Semelhante a Twitter Protobufs And Hadoop Hug 021709

Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterKevin Weil
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyOliver Seemann
 
Movable Type Seminar 2011
Movable Type Seminar 2011Movable Type Seminar 2011
Movable Type Seminar 2011Six Apart KK
 
Belgium jenkins-meetup-job-jungle-0.1
Belgium jenkins-meetup-job-jungle-0.1Belgium jenkins-meetup-job-jungle-0.1
Belgium jenkins-meetup-job-jungle-0.1Damien Coraboeuf
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveDavide Mauri
 
What's New and Newer in Apache httpd-24
What's New and Newer in Apache httpd-24What's New and Newer in Apache httpd-24
What's New and Newer in Apache httpd-24Jim Jagielski
 
What can Bioinformaticians learn from YouTube?
What can Bioinformaticians learn from YouTube?What can Bioinformaticians learn from YouTube?
What can Bioinformaticians learn from YouTube?Matt Wood
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryMárton Kodok
 
Webinar Slides: Become a MongoDB DBA (if you’re really a MySQL user)
Webinar Slides: Become a MongoDB DBA (if you’re really a MySQL user)Webinar Slides: Become a MongoDB DBA (if you’re really a MySQL user)
Webinar Slides: Become a MongoDB DBA (if you’re really a MySQL user)Severalnines
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerMark Kromer
 
Developing web applications in 2010
Developing web applications in 2010Developing web applications in 2010
Developing web applications in 2010Ignacio Coloma
 
Get the most out of Oracle Data Guard - OOW version
Get the most out of Oracle Data Guard - OOW versionGet the most out of Oracle Data Guard - OOW version
Get the most out of Oracle Data Guard - OOW versionLudovico Caldara
 

Semelhante a Twitter Protobufs And Hadoop Hug 021709 (20)

Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
SQL Bootcamp.pptx
SQL Bootcamp.pptxSQL Bootcamp.pptx
SQL Bootcamp.pptx
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case study
 
Movable Type Seminar 2011
Movable Type Seminar 2011Movable Type Seminar 2011
Movable Type Seminar 2011
 
Belgium jenkins-meetup-job-jungle-0.1
Belgium jenkins-meetup-job-jungle-0.1Belgium jenkins-meetup-job-jungle-0.1
Belgium jenkins-meetup-job-jungle-0.1
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
What's New and Newer in Apache httpd-24
What's New and Newer in Apache httpd-24What's New and Newer in Apache httpd-24
What's New and Newer in Apache httpd-24
 
What can Bioinformaticians learn from YouTube?
What can Bioinformaticians learn from YouTube?What can Bioinformaticians learn from YouTube?
What can Bioinformaticians learn from YouTube?
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
 
Luis Majano The Battlefield ORM
Luis Majano The Battlefield ORMLuis Majano The Battlefield ORM
Luis Majano The Battlefield ORM
 
Webinar Slides: Become a MongoDB DBA (if you’re really a MySQL user)
Webinar Slides: Become a MongoDB DBA (if you’re really a MySQL user)Webinar Slides: Become a MongoDB DBA (if you’re really a MySQL user)
Webinar Slides: Become a MongoDB DBA (if you’re really a MySQL user)
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL Server
 
Developing web applications in 2010
Developing web applications in 2010Developing web applications in 2010
Developing web applications in 2010
 
Get the most out of Oracle Data Guard - OOW version
Get the most out of Oracle Data Guard - OOW versionGet the most out of Oracle Data Guard - OOW version
Get the most out of Oracle Data Guard - OOW version
 

Mais de Hadoop User Group

Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Hadoop User Group
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation HadoopHadoop User Group
 

Mais de Hadoop User Group (17)

Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 

Twitter Protobufs And Hadoop Hug 021709

  • 1. Hadoop and Protocol Buffers at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM Wednesday, February 17, 2010
  • 2. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and visualization, social graph analysis, machine learning, lots more data Wednesday, February 17, 2010
  • 4. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 5. The Challenge ‣ Store some tweets Wednesday, February 17, 2010
  • 6. The Challenge ‣ Store some tweets Store 100 billion tweets Wednesday, February 17, 2010
  • 7. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust to changes Wednesday, February 17, 2010
  • 8. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient in size and speed Wednesday, February 17, 2010
  • 9. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis Wednesday, February 17, 2010
  • 10. The Challenge ‣ Store 100 billion tweets in a way that is ‣ Robust ‣ Efficient ‣ Amenable to large-scale analysis ‣ Reusable (especially for other classes of data, like logs, where the size gets really large) Wednesday, February 17, 2010
  • 11. The System ‣ Your (friend’s) hadoop cluster Wednesday, February 17, 2010
  • 12. The Data ‣ kevin@tw-mbp-kweil ~ $ curl http:// ‣ ‣ <?xml version="1.0" encoding="UTF-8"?> <status> api.twitter.com/1/statuses/show/9225259353.xml ‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at> ‣ <id>9225259353</id> ‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text> ‣ <source>&lt;a href=&quot;http://www.tweetdeck.com/&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;</source> ‣ <truncated>false</truncated> ‣ <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> Each tweet has 12 fields, 3 of which (user, geo, ‣ ‣ ‣ <favorited>false</favorited> ‣ <in_reply_to_screen_name></in_reply_to_screen_name> ‣ <user> contributors) have subfields ‣ <id>3452911</id> ‣ <name>Kevin Weil</name> ‣ <screen_name>kevinweil</screen_name> ‣ <location>Portola Valley, CA</location> ‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description> ‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url> ‣ <url></url> ‣ <protected>false</protected> ‣ <followers_count>3122</followers_count> ‣ <profile_background_color>B2DFDA</profile_background_color> ‣ <profile_text_color>333333</profile_text_color> ‣ It can change as we add new features ‣ <profile_link_color>93A644</profile_link_color> ‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color> ‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color> ‣ <friends_count>436</friends_count> ‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at> ‣ <favourites_count>721</favourites_count> ‣ <utc_offset>-28800</utc_offset> ‣ <time_zone>Pacific Time (US &amp; Canada)</time_zone> ‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url> ‣ <profile_background_tile>false</profile_background_tile> ‣ <notifications>false</notifications> ‣ <geo_enabled>true</geo_enabled> ‣ <verified>false</verified> ‣ <following>false</following> ‣ <statuses_count>2556</statuses_count> ‣ <lang>en</lang> ‣ <contributors_enabled>false</contributors_enabled> ‣ </user> ‣ <geo/> ‣ <contributors/> ‣ </status> ‣ Wednesday, February 17, 2010
  • 13. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 14. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 15. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 16. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 17. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 18. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 19. The Requirements ‣ Splittability ‣ Parsing efficiency ‣ Reusability ‣ Ability to add new fields ‣ Ability to ignore unused fields ‣ Small data size ‣ Hierarchical Wednesday, February 17, 2010
  • 20. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 21. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  • 22. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  • 23. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  • 24. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Wednesday, February 17, 2010
  • 25. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 26. Enter Protocol Buffers ‣ “Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.” ‣ http://code.google.com/p/protobuf ‣ You write IDL describing your data structure ‣ It generates code in your languages of choice to construct, serialize, deserialize, reflect across, etc, your data structure ‣ Like Thrift, but richer and more efficient (except no RPC) ‣ Avro is an exciting up-and-coming alternative Wednesday, February 17, 2010
  • 27. Protobuf IDL Example ‣ message Status { ‣ optional string created_at = 1; ‣ optional int64 id = 2; ‣ optional string text = 3; ‣ optional string source = 4; ‣ optional bool truncated = 5; ‣ optional int64 in_reply_to_status_id = 6; ‣ optional int64 in_reply_to_user_id = 7; ‣ optional bool favorited = 8; ‣ optional string in_reply_to_screen_name = 9; ‣ optional message User = 10; ‣ optional message Geo = 11; ‣ optional message Contributors = 12; ‣ message User { ‣ optional int64 id = 1; ‣ optional string name = 2; ‣ ... ‣ } ‣ message Geo { ... } ‣ message Contributors { ... } ‣ } Wednesday, February 17, 2010
  • 28. Protobuf Generated Code ‣ The generated code is: ‣ Efficient (Google quotes 80x vs. |-delimited format)1,2 ‣ Extensible ‣ Backwards compatible ‣ Polymorphic (in Java, C++, Python) ‣ Metadata-rich 1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext 2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking Wednesday, February 17, 2010
  • 29. Common Formats Parsing Ignore unused Splittable Reusability Add new fields Small data size Hierarchical efficiency fields XML JSON CSV Custom regex (Apache) Protocol Buffers Wednesday, February 17, 2010
  • 30. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 31. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code Wednesday, February 17, 2010
  • 32. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats Wednesday, February 17, 2010
  • 33. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats Wednesday, February 17, 2010
  • 34. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables Wednesday, February 17, 2010
  • 35. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs Wednesday, February 17, 2010
  • 36. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc Wednesday, February 17, 2010
  • 37. But Wait, There’s More ‣ Codegen for data structures is nice... ‣ Next step: codegen for all Hadoop-related code ‣ Protocol Buffer InputFormats ‣ OutputFormats ‣ Writables ‣ Pig LoadFuncs and StoreFuncs ‣ Cascading, Streaming, Dumbo, etc ‣ Per Protocol Buffer Wednesday, February 17, 2010
  • 38. Protocol Buffer InputFormats ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Efficient, extensible storage and serialization Wednesday, February 17, 2010
  • 39. Pig LoadFuncs ‣ All objects (hierarchical data, inheritance, etc) ‣ All automatically generated ‣ Even the load statement itself is codegen Wednesday, February 17, 2010
  • 40. Where do these work? ‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables) ‣ Deprecated Java MapReduce APIs (same) ‣ Enables Streaming, Dumbo, Cascading ‣ Pig ‣ HBase Wednesday, February 17, 2010
  • 41. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 42. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution? Wednesday, February 17, 2010
  • 43. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing Wednesday, February 17, 2010
  • 44. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation Wednesday, February 17, 2010
  • 45. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on. Wednesday, February 17, 2010
  • 46. Outline ‣ Problem Statement ‣ CSV? XML? JSON? Regex? ‣ Protocol Buffers ‣ Codegen, Hadoop and You ‣ Applications ‣ Conclusions and Next Steps Wednesday, February 17, 2010
  • 47. Resolution ‣ All we do now is write IDL for the data schema ‣ Get efficient, forward/backwards compatible, splittable data structures automatically generated for us ‣ Get loaders, input formats, output formats, writables, and schemas automatically generated for us ‣ Helps the Twitter analytics team stay agile ‣ Can handle new, complex data without the need for new code, new tests, new bugs ‣ Focus on the analysis, not data formats Wednesday, February 17, 2010
  • 48. Twitter Open Source ‣ Coming soon! (1-2 weeks) http://github.com/kevinweil ‣ All base classes for InputFormats, OutputFormats, Writables, Pig Loaders, etc ‣ For new and deprecated MapReduce API ‣ With and without LZO compression (see http://github.com/ kevinweil/hadoop-lzo) ‣ Protobuf reflection helpers ‣ Serialized block storage format for HDFS Wednesday, February 17, 2010
  • 49. Questions? Follow me at twitter.com/kevinweil ‣ If this sounded interesting to you -- that’s because it is. And we’re hiring. TM Wednesday, February 17, 2010