SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Taming the Social
                     Media Firehose




Scott Hendrickson
  Data Scientist
       Gnip
Social media firehoses

Connect, move and store lots of data

Filter and analyze

E.g. How a social media story evolves

Dig deeper
Obtain: pointing and clicking does not scale.
                           
         Scrub: the world is a messy place.
                           
        Explore: you can see a lot by looking.
                           
        Models: always bad, sometimes ugly.
                           
           iNterpret: insight, not numbers.

Hilary	
  Mason	
  &	
  Chris	
  Wiggins	
  	
  h1p://www.dataists.com/2010/09/a-­‐taxonomy-­‐of-­‐
data-­‐science/	
  	
  
Obtain	
  

                  Parse	
  


  Store	
                         Filter	
  




Analyze	
                       Structure	
  


                Aggregate	
  
iNterpret	
  
Continuous streams of flexibly structured social
       media activities in near-real time.
Continuous




Twitter Full Firehose:
      
300M+ activities/day
      
3,500 activities/second
      
or 1 activity every 290 μsec

Wordpress and Disqus Comments:
      
400K+ activities/day
      
4.6 activities/second
      
or 1 activity every 0.22 s
streams




E.g. Streaming HTTP
      
Not your familiar 1-shot web APIs
      
      

      
A step from stateless sessions
          •  Connection monitoring
          •  “Keep alive” records
          •  Caching-on-disconnect


(Ping	
  à	
  gniP)	
  
flexibly
structured

Vis-à-vis firehoses: 
   Emphasis on time-ordered events

   Combination of data and meta-data
      
      
E.g. Tweet and number of Retweets

   Activity encapsulation
      
      
Hierarchical structures within activity


Flexibly	
  Structured	
  =	
  “Unstructured”	
  in	
  the	
  normalized	
  set-­‐based	
  database	
  sense	
  
social media activities

Tweets, micro-blogs
Blog/rich-media posts
Comments/threaded discussions
Rich media-sharing (urls, reposts)
Location data (place, long/lat)
Friend/follower relationships
Engagement (e.g. Likes, up- and down-votes, reputation)
Tagging
near-real time


Twitter (Tweet-through-firehose-spigot)
      

      
~1.6 s (as low as 300 msec)
      

Wordpress Posts: (Post-through-firehose-spigot)

      
~2.5 s (as low as 1 sec)
      


Is	
  anything	
  realPme?	
  
1.  Compare time-evolution of social media
    reactions across firehoses

2.  Compare richness of content across
    firehoses
Firehoses:
       
Twitter
       
Wordpress Posts and Comments
       
Newsgator
Filter content on key terms:
       
“quake”
       
“terremoto”
Extract date time posted, group in 1 min buckets
and plot
Surprise events fit a “double-exponential” pulse in
activity rate that enables consistent comparison
between events and sources
R0 = 1288.150591
alpha=0.001470
beta=0.000195

# t0=1332266953
# TPeak=1332268410
Time-to-peak = 24.3 min
Peak rate=855

Mass=5816206.183899
# T 1/2life=1332272593
1/2Life = 69.7 min
	
  
1.  Connect and stream data from firehoses
2.  Preliminary filter
3.  Store to file
4.  Extract post times
5.  Count activities in 1-minute buckets
6.  Proxy of “richness”: count number of a
    characters in content
7.  Visualize
Connecting

Simple 
     
HTTP streaming with cURL
       curl --compressed 
        -v -ushendrickson@gnip.com 
        "https://stream.gnip.com:443/accounts/
       shendrickson/publishers/twitter/streams/sample10/
       decahose.json"
   
Build based on libraries

OTS solutions
Connecting


Considerations:
  Disconnects 
  Redundancy
  Latency
  Bandwidth
  Data bursts
  Costs
  Publisher TOS – Deletes
  De-dups, missing activities
Moving and Storing

Volumes (JSON, gzip’d)
      
100M Tweets = 25 GB 
      
< 2 min @300 MB/s (SATA II)
      
< 6 hrs @10 Mb/s (Ethernet)
      
1 day Wordpress.com posts = 350MB

Files system
NoSQL/Key-Value Stores – Flexible structure
Relational DB Stores – Indexes rock
Message Queues
Filter


 Model – guess at structure and process
        

 Parse – sort out the pieces
        

 Filter – reduce to what matters
 
 Aggregate – cluster, sum, average…
 
 Analyse – tell the story with data
Speed vs. Depth

Evolution
Network dynamics
     
Influencers, path analysis, viral spread…

Time dynamics
     
Time to peak, story half-life…

Natural language processing
     
”Aboutness” is hard, but gets easier as domain "
       narrows

Explore and deploy
     
Master skills, shorten cycles of exploration
     
Move learning to production
www.gnip.com
Twitter: @drskippy27

Mais conteúdo relacionado

Destaque

Lizeth canseco trabajo final
Lizeth canseco trabajo finalLizeth canseco trabajo final
Lizeth canseco trabajo final
Patriciacanseco
 
Historia de los derechos humanos
Historia de los derechos humanosHistoria de los derechos humanos
Historia de los derechos humanos
anny2365
 
120506 feldmann igf-d_datenschutz
120506 feldmann igf-d_datenschutz120506 feldmann igf-d_datenschutz
120506 feldmann igf-d_datenschutz
Thorsten Feldmann
 
Gbt.Bab.2009.001
Gbt.Bab.2009.001Gbt.Bab.2009.001
Gbt.Bab.2009.001
Kastor
 
Eaw Business Asessment 2007
Eaw Business Asessment 2007Eaw Business Asessment 2007
Eaw Business Asessment 2007
guestf74a155
 
Laura vanessa castillo
Laura vanessa castilloLaura vanessa castillo
Laura vanessa castillo
lalacastillo
 
Wouter Leusden - Bosch Rexroth
Wouter Leusden - Bosch RexrothWouter Leusden - Bosch Rexroth
Wouter Leusden - Bosch Rexroth
Themadagen
 
Versicherung werbefolder
Versicherung werbefolderVersicherung werbefolder
Versicherung werbefolder
Herbert Huber
 
Presentation Genfood
Presentation GenfoodPresentation Genfood
Presentation Genfood
guest3e4582
 
Mapa de kelly
Mapa de kellyMapa de kelly
Mapa de kelly
kuparela
 

Destaque (20)

Lizeth canseco trabajo final
Lizeth canseco trabajo finalLizeth canseco trabajo final
Lizeth canseco trabajo final
 
Historia de los derechos humanos
Historia de los derechos humanosHistoria de los derechos humanos
Historia de los derechos humanos
 
Proyecto de grupo
Proyecto de grupoProyecto de grupo
Proyecto de grupo
 
Fichacatalografica
FichacatalograficaFichacatalografica
Fichacatalografica
 
1. BR Messe GPA-djp Salzburg
1. BR Messe GPA-djp Salzburg1. BR Messe GPA-djp Salzburg
1. BR Messe GPA-djp Salzburg
 
Perejil
PerejilPerejil
Perejil
 
120506 feldmann igf-d_datenschutz
120506 feldmann igf-d_datenschutz120506 feldmann igf-d_datenschutz
120506 feldmann igf-d_datenschutz
 
Gbt.Bab.2009.001
Gbt.Bab.2009.001Gbt.Bab.2009.001
Gbt.Bab.2009.001
 
Glosario
GlosarioGlosario
Glosario
 
Eaw Business Asessment 2007
Eaw Business Asessment 2007Eaw Business Asessment 2007
Eaw Business Asessment 2007
 
Presentación trivia editado
Presentación trivia editadoPresentación trivia editado
Presentación trivia editado
 
Laura vanessa castillo
Laura vanessa castilloLaura vanessa castillo
Laura vanessa castillo
 
Wouter Leusden - Bosch Rexroth
Wouter Leusden - Bosch RexrothWouter Leusden - Bosch Rexroth
Wouter Leusden - Bosch Rexroth
 
Numeros Nicolinos
Numeros NicolinosNumeros Nicolinos
Numeros Nicolinos
 
Vencedor iberia cities português 2
Vencedor iberia cities português 2Vencedor iberia cities português 2
Vencedor iberia cities português 2
 
Versicherung werbefolder
Versicherung werbefolderVersicherung werbefolder
Versicherung werbefolder
 
Xiang Wan's Portfolio
Xiang Wan's PortfolioXiang Wan's Portfolio
Xiang Wan's Portfolio
 
Kooperation von Stadtbibliothek und VHS Bad Kreuznach
Kooperation von Stadtbibliothek und VHS Bad Kreuznach Kooperation von Stadtbibliothek und VHS Bad Kreuznach
Kooperation von Stadtbibliothek und VHS Bad Kreuznach
 
Presentation Genfood
Presentation GenfoodPresentation Genfood
Presentation Genfood
 
Mapa de kelly
Mapa de kellyMapa de kelly
Mapa de kelly
 

Semelhante a Hendrickson data2 2012-gnip

Rob Procter
Rob ProcterRob Procter
Rob Procter
NSMNSS
 
Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced information
Silvia Puglisi
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Jon Voss
 

Semelhante a Hendrickson data2 2012-gnip (20)

Trend Analysis
Trend AnalysisTrend Analysis
Trend Analysis
 
Rob Procter
Rob ProcterRob Procter
Rob Procter
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Studying archives of online behavior
Studying archives of online behaviorStudying archives of online behavior
Studying archives of online behavior
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our data
 
myExperiment @ Nettab
myExperiment @ NettabmyExperiment @ Nettab
myExperiment @ Nettab
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Learning as a Social Process
Learning as a Social ProcessLearning as a Social Process
Learning as a Social Process
 
Strategic scenarios in digital content and digital business
Strategic scenarios in digital content and digital businessStrategic scenarios in digital content and digital business
Strategic scenarios in digital content and digital business
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced information
 
Intelligentcontent2009
Intelligentcontent2009Intelligentcontent2009
Intelligentcontent2009
 
Shifting Scientific Practice - ORCID 2015
Shifting Scientific Practice - ORCID 2015Shifting Scientific Practice - ORCID 2015
Shifting Scientific Practice - ORCID 2015
 
Shifting Scientific Practice (K. Thaney)
Shifting Scientific Practice (K. Thaney)Shifting Scientific Practice (K. Thaney)
Shifting Scientific Practice (K. Thaney)
 
Conducting Twitter Reserch
Conducting Twitter ReserchConducting Twitter Reserch
Conducting Twitter Reserch
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
 

Último

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Último (20)

Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 

Hendrickson data2 2012-gnip

  • 1. Taming the Social Media Firehose Scott Hendrickson Data Scientist Gnip
  • 2. Social media firehoses Connect, move and store lots of data Filter and analyze E.g. How a social media story evolves Dig deeper
  • 3. Obtain: pointing and clicking does not scale. Scrub: the world is a messy place. Explore: you can see a lot by looking. Models: always bad, sometimes ugly. iNterpret: insight, not numbers. Hilary  Mason  &  Chris  Wiggins    h1p://www.dataists.com/2010/09/a-­‐taxonomy-­‐of-­‐ data-­‐science/    
  • 4. Obtain   Parse   Store   Filter   Analyze   Structure   Aggregate   iNterpret  
  • 5. Continuous streams of flexibly structured social media activities in near-real time.
  • 6. Continuous Twitter Full Firehose: 300M+ activities/day 3,500 activities/second or 1 activity every 290 μsec Wordpress and Disqus Comments: 400K+ activities/day 4.6 activities/second or 1 activity every 0.22 s
  • 7. streams E.g. Streaming HTTP Not your familiar 1-shot web APIs A step from stateless sessions •  Connection monitoring •  “Keep alive” records •  Caching-on-disconnect (Ping  à  gniP)  
  • 8. flexibly structured Vis-à-vis firehoses: Emphasis on time-ordered events Combination of data and meta-data E.g. Tweet and number of Retweets Activity encapsulation Hierarchical structures within activity Flexibly  Structured  =  “Unstructured”  in  the  normalized  set-­‐based  database  sense  
  • 9. social media activities Tweets, micro-blogs Blog/rich-media posts Comments/threaded discussions Rich media-sharing (urls, reposts) Location data (place, long/lat) Friend/follower relationships Engagement (e.g. Likes, up- and down-votes, reputation) Tagging
  • 10. near-real time Twitter (Tweet-through-firehose-spigot) ~1.6 s (as low as 300 msec) Wordpress Posts: (Post-through-firehose-spigot) ~2.5 s (as low as 1 sec) Is  anything  realPme?  
  • 11. 1.  Compare time-evolution of social media reactions across firehoses 2.  Compare richness of content across firehoses
  • 12. Firehoses: Twitter Wordpress Posts and Comments Newsgator Filter content on key terms: “quake” “terremoto” Extract date time posted, group in 1 min buckets and plot
  • 13.
  • 14. Surprise events fit a “double-exponential” pulse in activity rate that enables consistent comparison between events and sources
  • 15.
  • 16. R0 = 1288.150591 alpha=0.001470 beta=0.000195 # t0=1332266953 # TPeak=1332268410 Time-to-peak = 24.3 min Peak rate=855 Mass=5816206.183899 # T 1/2life=1332272593 1/2Life = 69.7 min  
  • 17. 1.  Connect and stream data from firehoses 2.  Preliminary filter 3.  Store to file 4.  Extract post times 5.  Count activities in 1-minute buckets 6.  Proxy of “richness”: count number of a characters in content 7.  Visualize
  • 18. Connecting Simple HTTP streaming with cURL curl --compressed -v -ushendrickson@gnip.com "https://stream.gnip.com:443/accounts/ shendrickson/publishers/twitter/streams/sample10/ decahose.json" Build based on libraries OTS solutions
  • 19. Connecting Considerations: Disconnects Redundancy Latency Bandwidth Data bursts Costs Publisher TOS – Deletes De-dups, missing activities
  • 20. Moving and Storing Volumes (JSON, gzip’d) 100M Tweets = 25 GB < 2 min @300 MB/s (SATA II) < 6 hrs @10 Mb/s (Ethernet) 1 day Wordpress.com posts = 350MB Files system NoSQL/Key-Value Stores – Flexible structure Relational DB Stores – Indexes rock Message Queues
  • 21. Filter Model – guess at structure and process Parse – sort out the pieces Filter – reduce to what matters Aggregate – cluster, sum, average… Analyse – tell the story with data
  • 23. Network dynamics Influencers, path analysis, viral spread… Time dynamics Time to peak, story half-life… Natural language processing ”Aboutness” is hard, but gets easier as domain " narrows Explore and deploy Master skills, shorten cycles of exploration Move learning to production