SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Taming the Social
                     Media Firehose




Scott Hendrickson
  Data Scientist
       Gnip
Social media firehoses

Connect, move and store lots of data

Filter and analyze

E.g. How a social media story evolves

Dig deeper
Obtain: pointing and clicking does not scale.
                           
         Scrub: the world is a messy place.
                           
        Explore: you can see a lot by looking.
                           
        Models: always bad, sometimes ugly.
                           
           iNterpret: insight, not numbers.

Hilary	
  Mason	
  &	
  Chris	
  Wiggins	
  	
  h1p://www.dataists.com/2010/09/a-­‐taxonomy-­‐of-­‐
data-­‐science/	
  	
  
Obtain	
  

                  Parse	
  


  Store	
                         Filter	
  




Analyze	
                       Structure	
  


                Aggregate	
  
iNterpret	
  
Continuous streams of flexibly structured social
       media activities in near-real time.
Continuous




Twitter Full Firehose:
      
300M+ activities/day
      
3,500 activities/second
      
or 1 activity every 290 μsec

Wordpress and Disqus Comments:
      
400K+ activities/day
      
4.6 activities/second
      
or 1 activity every 0.22 s
streams




E.g. Streaming HTTP
      
Not your familiar 1-shot web APIs
      
      

      
A step from stateless sessions
          •  Connection monitoring
          •  “Keep alive” records
          •  Caching-on-disconnect


(Ping	
  à	
  gniP)	
  
flexibly
structured

Vis-à-vis firehoses: 
   Emphasis on time-ordered events

   Combination of data and meta-data
      
      
E.g. Tweet and number of Retweets

   Activity encapsulation
      
      
Hierarchical structures within activity


Flexibly	
  Structured	
  =	
  “Unstructured”	
  in	
  the	
  normalized	
  set-­‐based	
  database	
  sense	
  
social media activities

Tweets, micro-blogs
Blog/rich-media posts
Comments/threaded discussions
Rich media-sharing (urls, reposts)
Location data (place, long/lat)
Friend/follower relationships
Engagement (e.g. Likes, up- and down-votes, reputation)
Tagging
near-real time


Twitter (Tweet-through-firehose-spigot)
      

      
~1.6 s (as low as 300 msec)
      

Wordpress Posts: (Post-through-firehose-spigot)

      
~2.5 s (as low as 1 sec)
      


Is	
  anything	
  realPme?	
  
1.  Compare time-evolution of social media
    reactions across firehoses

2.  Compare richness of content across
    firehoses
Firehoses:
       
Twitter
       
Wordpress Posts and Comments
       
Newsgator
Filter content on key terms:
       
“quake”
       
“terremoto”
Extract date time posted, group in 1 min buckets
and plot
Surprise events fit a “double-exponential” pulse in
activity rate that enables consistent comparison
between events and sources
R0 = 1288.150591
alpha=0.001470
beta=0.000195

# t0=1332266953
# TPeak=1332268410
Time-to-peak = 24.3 min
Peak rate=855

Mass=5816206.183899
# T 1/2life=1332272593
1/2Life = 69.7 min
	
  
1.  Connect and stream data from firehoses
2.  Preliminary filter
3.  Store to file
4.  Extract post times
5.  Count activities in 1-minute buckets
6.  Proxy of “richness”: count number of a
    characters in content
7.  Visualize
Connecting

Simple 
     
HTTP streaming with cURL
       curl --compressed 
        -v -ushendrickson@gnip.com 
        "https://stream.gnip.com:443/accounts/
       shendrickson/publishers/twitter/streams/sample10/
       decahose.json"
   
Build based on libraries

OTS solutions
Connecting


Considerations:
  Disconnects 
  Redundancy
  Latency
  Bandwidth
  Data bursts
  Costs
  Publisher TOS – Deletes
  De-dups, missing activities
Moving and Storing

Volumes (JSON, gzip’d)
      
100M Tweets = 25 GB 
      
< 2 min @300 MB/s (SATA II)
      
< 6 hrs @10 Mb/s (Ethernet)
      
1 day Wordpress.com posts = 350MB

Files system
NoSQL/Key-Value Stores – Flexible structure
Relational DB Stores – Indexes rock
Message Queues
Filter


 Model – guess at structure and process
        

 Parse – sort out the pieces
        

 Filter – reduce to what matters
 
 Aggregate – cluster, sum, average…
 
 Analyse – tell the story with data
Speed vs. Depth

Evolution
Network dynamics
     
Influencers, path analysis, viral spread…

Time dynamics
     
Time to peak, story half-life…

Natural language processing
     
”Aboutness” is hard, but gets easier as domain "
       narrows

Explore and deploy
     
Master skills, shorten cycles of exploration
     
Move learning to production
www.gnip.com
Twitter: @drskippy27

Mais conteúdo relacionado

Destaque

Lizeth canseco trabajo final
Lizeth canseco trabajo finalLizeth canseco trabajo final
Lizeth canseco trabajo finalPatriciacanseco
 
Historia de los derechos humanos
Historia de los derechos humanosHistoria de los derechos humanos
Historia de los derechos humanosanny2365
 
Fichacatalografica
FichacatalograficaFichacatalografica
Fichacatalograficaiesmonreal
 
1. BR Messe GPA-djp Salzburg
1. BR Messe GPA-djp Salzburg1. BR Messe GPA-djp Salzburg
1. BR Messe GPA-djp SalzburgHerbert Huber
 
120506 feldmann igf-d_datenschutz
120506 feldmann igf-d_datenschutz120506 feldmann igf-d_datenschutz
120506 feldmann igf-d_datenschutzThorsten Feldmann
 
Gbt.Bab.2009.001
Gbt.Bab.2009.001Gbt.Bab.2009.001
Gbt.Bab.2009.001Kastor
 
Eaw Business Asessment 2007
Eaw Business Asessment 2007Eaw Business Asessment 2007
Eaw Business Asessment 2007guestf74a155
 
Presentación trivia editado
Presentación trivia editadoPresentación trivia editado
Presentación trivia editadotecnopolis2012
 
Laura vanessa castillo
Laura vanessa castilloLaura vanessa castillo
Laura vanessa castillolalacastillo
 
Wouter Leusden - Bosch Rexroth
Wouter Leusden - Bosch RexrothWouter Leusden - Bosch Rexroth
Wouter Leusden - Bosch RexrothThemadagen
 
Vencedor iberia cities português 2
Vencedor iberia cities português 2Vencedor iberia cities português 2
Vencedor iberia cities português 2Iberia
 
Versicherung werbefolder
Versicherung werbefolderVersicherung werbefolder
Versicherung werbefolderHerbert Huber
 
Xiang Wan's Portfolio
Xiang Wan's PortfolioXiang Wan's Portfolio
Xiang Wan's PortfolioZen LuRui
 
Kooperation von Stadtbibliothek und VHS Bad Kreuznach
Kooperation von Stadtbibliothek und VHS Bad Kreuznach Kooperation von Stadtbibliothek und VHS Bad Kreuznach
Kooperation von Stadtbibliothek und VHS Bad Kreuznach Mestgirl
 
Presentation Genfood
Presentation GenfoodPresentation Genfood
Presentation Genfoodguest3e4582
 
Mapa de kelly
Mapa de kellyMapa de kelly
Mapa de kellykuparela
 

Destaque (20)

Lizeth canseco trabajo final
Lizeth canseco trabajo finalLizeth canseco trabajo final
Lizeth canseco trabajo final
 
Historia de los derechos humanos
Historia de los derechos humanosHistoria de los derechos humanos
Historia de los derechos humanos
 
Proyecto de grupo
Proyecto de grupoProyecto de grupo
Proyecto de grupo
 
Fichacatalografica
FichacatalograficaFichacatalografica
Fichacatalografica
 
1. BR Messe GPA-djp Salzburg
1. BR Messe GPA-djp Salzburg1. BR Messe GPA-djp Salzburg
1. BR Messe GPA-djp Salzburg
 
Perejil
PerejilPerejil
Perejil
 
120506 feldmann igf-d_datenschutz
120506 feldmann igf-d_datenschutz120506 feldmann igf-d_datenschutz
120506 feldmann igf-d_datenschutz
 
Gbt.Bab.2009.001
Gbt.Bab.2009.001Gbt.Bab.2009.001
Gbt.Bab.2009.001
 
Glosario
GlosarioGlosario
Glosario
 
Eaw Business Asessment 2007
Eaw Business Asessment 2007Eaw Business Asessment 2007
Eaw Business Asessment 2007
 
Presentación trivia editado
Presentación trivia editadoPresentación trivia editado
Presentación trivia editado
 
Laura vanessa castillo
Laura vanessa castilloLaura vanessa castillo
Laura vanessa castillo
 
Wouter Leusden - Bosch Rexroth
Wouter Leusden - Bosch RexrothWouter Leusden - Bosch Rexroth
Wouter Leusden - Bosch Rexroth
 
Numeros Nicolinos
Numeros NicolinosNumeros Nicolinos
Numeros Nicolinos
 
Vencedor iberia cities português 2
Vencedor iberia cities português 2Vencedor iberia cities português 2
Vencedor iberia cities português 2
 
Versicherung werbefolder
Versicherung werbefolderVersicherung werbefolder
Versicherung werbefolder
 
Xiang Wan's Portfolio
Xiang Wan's PortfolioXiang Wan's Portfolio
Xiang Wan's Portfolio
 
Kooperation von Stadtbibliothek und VHS Bad Kreuznach
Kooperation von Stadtbibliothek und VHS Bad Kreuznach Kooperation von Stadtbibliothek und VHS Bad Kreuznach
Kooperation von Stadtbibliothek und VHS Bad Kreuznach
 
Presentation Genfood
Presentation GenfoodPresentation Genfood
Presentation Genfood
 
Mapa de kelly
Mapa de kellyMapa de kelly
Mapa de kelly
 

Semelhante a Hendrickson data2 2012-gnip

Rob Procter
Rob ProcterRob Procter
Rob ProcterNSMNSS
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Studying archives of online behavior
Studying archives of online behaviorStudying archives of online behavior
Studying archives of online behaviorJames Howison
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our dataHeather Piwowar
 
myExperiment @ Nettab
myExperiment @ NettabmyExperiment @ Nettab
myExperiment @ NettabDuncan Hull
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Learning as a Social Process
Learning as a Social ProcessLearning as a Social Process
Learning as a Social ProcessRobert Cormia
 
Strategic scenarios in digital content and digital business
Strategic scenarios in digital content and digital businessStrategic scenarios in digital content and digital business
Strategic scenarios in digital content and digital businessMarco Brambilla
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache SparkMatthew Rowe
 
Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSilvia Puglisi
 
Intelligentcontent2009
Intelligentcontent2009Intelligentcontent2009
Intelligentcontent2009Salim Ismail
 
Shifting Scientific Practice - ORCID 2015
Shifting Scientific Practice - ORCID 2015Shifting Scientific Practice - ORCID 2015
Shifting Scientific Practice - ORCID 2015Kaitlin Thaney
 
Shifting Scientific Practice (K. Thaney)
Shifting Scientific Practice (K. Thaney)Shifting Scientific Practice (K. Thaney)
Shifting Scientific Practice (K. Thaney)ORCID, Inc
 
Conducting Twitter Reserch
Conducting Twitter ReserchConducting Twitter Reserch
Conducting Twitter ReserchKim Holmberg
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignCommunitySense
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...benaam
 

Semelhante a Hendrickson data2 2012-gnip (20)

Trend Analysis
Trend AnalysisTrend Analysis
Trend Analysis
 
Rob Procter
Rob ProcterRob Procter
Rob Procter
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Studying archives of online behavior
Studying archives of online behaviorStudying archives of online behavior
Studying archives of online behavior
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our data
 
myExperiment @ Nettab
myExperiment @ NettabmyExperiment @ Nettab
myExperiment @ Nettab
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Learning as a Social Process
Learning as a Social ProcessLearning as a Social Process
Learning as a Social Process
 
Strategic scenarios in digital content and digital business
Strategic scenarios in digital content and digital businessStrategic scenarios in digital content and digital business
Strategic scenarios in digital content and digital business
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced information
 
Intelligentcontent2009
Intelligentcontent2009Intelligentcontent2009
Intelligentcontent2009
 
Shifting Scientific Practice - ORCID 2015
Shifting Scientific Practice - ORCID 2015Shifting Scientific Practice - ORCID 2015
Shifting Scientific Practice - ORCID 2015
 
Shifting Scientific Practice (K. Thaney)
Shifting Scientific Practice (K. Thaney)Shifting Scientific Practice (K. Thaney)
Shifting Scientific Practice (K. Thaney)
 
Conducting Twitter Reserch
Conducting Twitter ReserchConducting Twitter Reserch
Conducting Twitter Reserch
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
 

Último

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Hendrickson data2 2012-gnip

  • 1. Taming the Social Media Firehose Scott Hendrickson Data Scientist Gnip
  • 2. Social media firehoses Connect, move and store lots of data Filter and analyze E.g. How a social media story evolves Dig deeper
  • 3. Obtain: pointing and clicking does not scale. Scrub: the world is a messy place. Explore: you can see a lot by looking. Models: always bad, sometimes ugly. iNterpret: insight, not numbers. Hilary  Mason  &  Chris  Wiggins    h1p://www.dataists.com/2010/09/a-­‐taxonomy-­‐of-­‐ data-­‐science/    
  • 4. Obtain   Parse   Store   Filter   Analyze   Structure   Aggregate   iNterpret  
  • 5. Continuous streams of flexibly structured social media activities in near-real time.
  • 6. Continuous Twitter Full Firehose: 300M+ activities/day 3,500 activities/second or 1 activity every 290 μsec Wordpress and Disqus Comments: 400K+ activities/day 4.6 activities/second or 1 activity every 0.22 s
  • 7. streams E.g. Streaming HTTP Not your familiar 1-shot web APIs A step from stateless sessions •  Connection monitoring •  “Keep alive” records •  Caching-on-disconnect (Ping  à  gniP)  
  • 8. flexibly structured Vis-à-vis firehoses: Emphasis on time-ordered events Combination of data and meta-data E.g. Tweet and number of Retweets Activity encapsulation Hierarchical structures within activity Flexibly  Structured  =  “Unstructured”  in  the  normalized  set-­‐based  database  sense  
  • 9. social media activities Tweets, micro-blogs Blog/rich-media posts Comments/threaded discussions Rich media-sharing (urls, reposts) Location data (place, long/lat) Friend/follower relationships Engagement (e.g. Likes, up- and down-votes, reputation) Tagging
  • 10. near-real time Twitter (Tweet-through-firehose-spigot) ~1.6 s (as low as 300 msec) Wordpress Posts: (Post-through-firehose-spigot) ~2.5 s (as low as 1 sec) Is  anything  realPme?  
  • 11. 1.  Compare time-evolution of social media reactions across firehoses 2.  Compare richness of content across firehoses
  • 12. Firehoses: Twitter Wordpress Posts and Comments Newsgator Filter content on key terms: “quake” “terremoto” Extract date time posted, group in 1 min buckets and plot
  • 13.
  • 14. Surprise events fit a “double-exponential” pulse in activity rate that enables consistent comparison between events and sources
  • 15.
  • 16. R0 = 1288.150591 alpha=0.001470 beta=0.000195 # t0=1332266953 # TPeak=1332268410 Time-to-peak = 24.3 min Peak rate=855 Mass=5816206.183899 # T 1/2life=1332272593 1/2Life = 69.7 min  
  • 17. 1.  Connect and stream data from firehoses 2.  Preliminary filter 3.  Store to file 4.  Extract post times 5.  Count activities in 1-minute buckets 6.  Proxy of “richness”: count number of a characters in content 7.  Visualize
  • 18. Connecting Simple HTTP streaming with cURL curl --compressed -v -ushendrickson@gnip.com "https://stream.gnip.com:443/accounts/ shendrickson/publishers/twitter/streams/sample10/ decahose.json" Build based on libraries OTS solutions
  • 19. Connecting Considerations: Disconnects Redundancy Latency Bandwidth Data bursts Costs Publisher TOS – Deletes De-dups, missing activities
  • 20. Moving and Storing Volumes (JSON, gzip’d) 100M Tweets = 25 GB < 2 min @300 MB/s (SATA II) < 6 hrs @10 Mb/s (Ethernet) 1 day Wordpress.com posts = 350MB Files system NoSQL/Key-Value Stores – Flexible structure Relational DB Stores – Indexes rock Message Queues
  • 21. Filter Model – guess at structure and process Parse – sort out the pieces Filter – reduce to what matters Aggregate – cluster, sum, average… Analyse – tell the story with data
  • 23. Network dynamics Influencers, path analysis, viral spread… Time dynamics Time to peak, story half-life… Natural language processing ”Aboutness” is hard, but gets easier as domain " narrows Explore and deploy Master skills, shorten cycles of exploration Move learning to production