SlideShare uma empresa Scribd logo
1 de 75
Analyzing Big Data at



Web 2.0 Expo, 2010
Kevin Weil @kevinweil
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
My Background
‣   Studied Mathematics and Physics at Harvard,

    Physics at Stanford
‣   Tropos Networks (city-wide wireless): GBs of data
‣   Cooliris (web media): Hadoop for analytics, TBs of data
‣   Twitter: Hadoop, Pig, machine learning, visualization,

    social graph analysis, PBs of data
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	     20,000 CDs
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	     20,000 CDs
‣   	     10 million floppy disks
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	     20,000 CDs
‣   	     10 million floppy disks
‣   	     450 GB while I give this talk
Syslog?
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
Syslog?
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
‣   Resources

    overwhelmed
‣   Lost data
Scribe
‣   Surprise! FB had same problem, built and open-

    sourced Scribe
‣   Log collection framework over Thrift
‣   You “scribe” log lines, with categories
‣   It does the rest
Scribe
‣   Runs locally; reliable in   FE   FE   FE
    network outage
Scribe
‣   Runs locally; reliable in   FE         FE         FE
    network outage


‣   Nodes only know

    downstream writer;               Agg        Agg

    hierarchical, scalable
Scribe
‣   Runs locally; reliable in          FE         FE         FE
    network outage


‣   Nodes only know

    downstream writer;                      Agg        Agg

    hierarchical, scalable


‣   Pluggable outputs
                                File          HDFS
Scribe at Twitter
‣   Solved our problem, opened new vistas
‣   Currently 40 different categories logged from

    javascript, Ruby, Scala, Java, etc
‣   We improved logging, monitoring, behavior during

    failure conditions, writes to HDFS, etc
‣   Continuing to work with FB to make it better



    http://github.com/traviscrawford/scribe
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
‣   42 hours to write 12 TB
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
‣   42 hours to write 12 TB
‣   Uh oh.
Where Do I Put 12TB/day?
‣   Need a cluster of machines


‣   ... which adds new layers

    of complexity
Hadoop
‣   Distributed file system
‣   	     Automatic replication
‣   	     Fault tolerance
‣   	     Transparently read/write across multiple machines
Hadoop
‣   Distributed file system
‣   	     Automatic replication
‣   	     Fault tolerance
‣   	     Transparently read/write across multiple machines
‣   MapReduce-based parallel computation
‣   	     Key-value based computation interface allows

          for wide applicability
‣   	     Fault tolerance, again
Hadoop
‣   Open source: top-level Apache project
‣   Scalable: Y! has a 4000 node cluster
‣   Powerful: sorted 1TB random integers in 62 seconds


‣   Easy packaging/install: free Cloudera RPMs
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
Two Analysis Challenges
‣   Compute mutual followings in Twitter’s interest graph
‣   	     grep, awk? No way.
‣   	     If data is in MySQL... self join on an n-

    	     billion row table?
‣   	     n,000,000,000 x n,000,000,000 = ?
Two Analysis Challenges
‣   Compute mutual followings in Twitter’s interest graph
‣   	     grep, awk? No way.
‣   	     If data is in MySQL... self join on an n-

    	     billion row table?
‣   	     n,000,000,000 x n,000,000,000 = ?
‣   
     I don’t know either.
Two Analysis Challenges
‣   Large-scale grouping and counting
‣   	    select count(*) from users? maybe.

‣   	    select count(*) from tweets? uh...
‣   	    Imagine joining these two.
‣   	    And grouping.
‣   	    And sorting.
Back to Hadoop
‣   Didn’t we have a cluster of machines?
‣   Hadoop makes it easy to distribute the calculation
‣   Purpose-built for parallel calculation
‣   Just a slight mindset adjustment
Back to Hadoop
‣   Didn’t we have a cluster of machines?
‣   Hadoop makes it easy to distribute the calculation
‣   Purpose-built for parallel calculation
‣   Just a slight mindset adjustment
‣   But a fun one!
Analysis at Scale
‣   Now we’re rolling
‣   Count all tweets: 20+ billion, 5 minutes
‣   Parallel network calls to FlockDB to compute

    interest graph aggregates
‣   Run PageRank across users and interest graph
But...
‣   Analysis typically in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins lengthy, error-prone
‣   n-stage jobs hard to manage
‣   Data exploration requires compilation
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
Pig
‣   High level language
‣   Transformations on

    sets of records
‣   Process data one

    step at a time
‣   Easier than SQL?
Why Pig?
Because I bet you can read the following script
A Real Pig Script




‣   Just for fun... the same calculation in Java next
No, Seriously.
Pig Makes it Easy
‣   5% of the code
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
‣   Readable, reusable
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
‣   Readable, reusable
‣   As Pig improves, your calculations run faster
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
‣   Value the system that promotes innovation

    and iteration
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
‣   Value the system that promotes innovation

    and iteration
‣   More minds contributing = more value from your data
Counting Big Data
‣   How many requests per day?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
‣   Twitter searches per day?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
‣   Twitter searches per day?
‣   Unique users searching, unique queries?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
‣   Twitter searches per day?
‣   Unique users searching, unique queries?
‣   Links tweeted per day? By domain?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
‣   Twitter searches per day?
‣   Unique users searching, unique queries?
‣   Links tweeted per day? By domain?
‣   Geographic distribution of all of the above
Correlating Big Data
‣   Usage difference for mobile users?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #newtwitter?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #newtwitter?
‣   Cohort analyses
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #newtwitter?
‣   Cohort analyses
‣   What features get users hooked?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #newtwitter?
‣   Cohort analyses
‣   What features get users hooked?
‣   What features power Twitter users use often?
Research on Big Data
‣   What can we tell from a user’s tweets?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
‣   Duplicate detection (spam)
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
‣   Duplicate detection (spam)
‣   Language detection (search)
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
‣   Duplicate detection (spam)
‣   Language detection (search)
‣   Machine learning
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
‣   Duplicate detection (spam)
‣   Language detection (search)
‣   Machine learning
‣   Natural language processing
Diving Deeper
‣   HBase and building products from Hadoop
‣   LZO Compression
‣   Protocol Buffers and Hadoop
‣   Our analytics-related open source: hadoop-lzo,

    elephant-bird
‣   Moving analytics to realtime


    http://github.com/kevinweil/hadoop-lzo
    http://github.com/kevinweil/elephant-bird
Questions?
             Follow me at
             twitter.com/kevinweil

Mais conteúdo relacionado

Destaque

Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Realtime Search at Twitter - Michael Busch
Realtime Search at Twitter - Michael BuschRealtime Search at Twitter - Michael Busch
Realtime Search at Twitter - Michael Buschlucenerevolution
 
Data Analytics on Twitter Feeds
Data Analytics on Twitter FeedsData Analytics on Twitter Feeds
Data Analytics on Twitter FeedsEu Jin Lok
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesKarol Chlasta
 
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataSCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataBalaBit
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterKamran Munshi
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Bigdata analytics-twitter
Bigdata analytics-twitterBigdata analytics-twitter
Bigdata analytics-twitterdfilppi
 
Büyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopBüyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopCenk Derinozlu
 

Destaque (20)

Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
RainBird
RainBirdRainBird
RainBird
 
Realtime Search at Twitter - Michael Busch
Realtime Search at Twitter - Michael BuschRealtime Search at Twitter - Michael Busch
Realtime Search at Twitter - Michael Busch
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Data Analytics on Twitter Feeds
Data Analytics on Twitter FeedsData Analytics on Twitter Feeds
Data Analytics on Twitter Feeds
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
 
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataSCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Bigdata analytics-twitter
Bigdata analytics-twitterBigdata analytics-twitter
Bigdata analytics-twitter
 
Büyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopBüyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve Hadoop
 

Semelhante a Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010seagor
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Sri Ambati
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009prasadc
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computingJULIO GONZALEZ SANZ
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
 

Semelhante a Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010) (20)

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computing
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
 

Último

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)

  • 1. Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil
  • 2. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): GBs of data ‣ Cooliris (web media): Hadoop for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, PBs of data
  • 4. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 5. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess?
  • 6. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr)
  • 7. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs
  • 8. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks
  • 9. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks ‣ 450 GB while I give this talk
  • 10. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
  • 11. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
  • 12. Scribe ‣ Surprise! FB had same problem, built and open- sourced Scribe ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest
  • 13. Scribe ‣ Runs locally; reliable in FE FE FE network outage
  • 14. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • 15. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable ‣ Pluggable outputs File HDFS
  • 16. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 40 different categories logged from javascript, Ruby, Scala, Java, etc ‣ We improved logging, monitoring, behavior during failure conditions, writes to HDFS, etc ‣ Continuing to work with FB to make it better http://github.com/traviscrawford/scribe
  • 17. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 18. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed?
  • 19. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s
  • 20. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB
  • 21. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB ‣ Uh oh.
  • 22. Where Do I Put 12TB/day? ‣ Need a cluster of machines ‣ ... which adds new layers of complexity
  • 23. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines
  • 24. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines ‣ MapReduce-based parallel computation ‣ Key-value based computation interface allows for wide applicability ‣ Fault tolerance, again
  • 25. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000 node cluster ‣ Powerful: sorted 1TB random integers in 62 seconds ‣ Easy packaging/install: free Cloudera RPMs
  • 26. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 27. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 28. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 29. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 30. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 31. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 32. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 33. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ?
  • 34. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ? ‣ I don’t know either.
  • 35. Two Analysis Challenges ‣ Large-scale grouping and counting ‣ select count(*) from users? maybe. ‣ select count(*) from tweets? uh... ‣ Imagine joining these two. ‣ And grouping. ‣ And sorting.
  • 36. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment
  • 37. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment ‣ But a fun one!
  • 38. Analysis at Scale ‣ Now we’re rolling ‣ Count all tweets: 20+ billion, 5 minutes ‣ Parallel network calls to FlockDB to compute interest graph aggregates ‣ Run PageRank across users and interest graph
  • 39. But... ‣ Analysis typically in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins lengthy, error-prone ‣ n-stage jobs hard to manage ‣ Data exploration requires compilation
  • 40. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 41. Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 42. Why Pig? Because I bet you can read the following script
  • 43. A Real Pig Script ‣ Just for fun... the same calculation in Java next
  • 45. Pig Makes it Easy ‣ 5% of the code
  • 46. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time
  • 47. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time
  • 48. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable
  • 49. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster
  • 50. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions
  • 51. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration
  • 52. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration ‣ More minds contributing = more value from your data
  • 53. Counting Big Data ‣ How many requests per day?
  • 54. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency?
  • 55. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour?
  • 56. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day?
  • 57. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries?
  • 58. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain?
  • 59. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain? ‣ Geographic distribution of all of the above
  • 60. Correlating Big Data ‣ Usage difference for mobile users?
  • 61. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients?
  • 62. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter?
  • 63. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses
  • 64. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked?
  • 65. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked? ‣ What features power Twitter users use often?
  • 66. Research on Big Data ‣ What can we tell from a user’s tweets?
  • 67. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers?
  • 68. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow?
  • 69. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree?
  • 70. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam)
  • 71. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search)
  • 72. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning
  • 73. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning ‣ Natural language processing
  • 74. Diving Deeper ‣ HBase and building products from Hadoop ‣ LZO Compression ‣ Protocol Buffers and Hadoop ‣ Our analytics-related open source: hadoop-lzo, elephant-bird ‣ Moving analytics to realtime http://github.com/kevinweil/hadoop-lzo http://github.com/kevinweil/elephant-bird
  • 75. Questions? Follow me at twitter.com/kevinweil

Notas do Editor

  1. Don’t worry about the “yellow grid lines” on the side, they’re just there to help you align everything. They won’t show up when you present.
  2. Change this to your big-idea call-outs...
  3. Change this to your big-idea call-outs...