Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Hendrickson data2 2012-gnip
1. Taming the Social
Media Firehose
Scott Hendrickson
Data Scientist
Gnip
2. Social media firehoses
Connect, move and store lots of data
Filter and analyze
E.g. How a social media story evolves
Dig deeper
3. Obtain: pointing and clicking does not scale.
Scrub: the world is a messy place.
Explore: you can see a lot by looking.
Models: always bad, sometimes ugly.
iNterpret: insight, not numbers.
Hilary
Mason
&
Chris
Wiggins
h1p://www.dataists.com/2010/09/a-‐taxonomy-‐of-‐
data-‐science/
4. Obtain
Parse
Store
Filter
Analyze
Structure
Aggregate
iNterpret
6. Continuous
Twitter Full Firehose:
300M+ activities/day
3,500 activities/second
or 1 activity every 290 μsec
Wordpress and Disqus Comments:
400K+ activities/day
4.6 activities/second
or 1 activity every 0.22 s
7. streams
E.g. Streaming HTTP
Not your familiar 1-shot web APIs
A step from stateless sessions
• Connection monitoring
• “Keep alive” records
• Caching-on-disconnect
(Ping
à
gniP)
8. flexibly
structured
Vis-à-vis firehoses:
Emphasis on time-ordered events
Combination of data and meta-data
E.g. Tweet and number of Retweets
Activity encapsulation
Hierarchical structures within activity
Flexibly
Structured
=
“Unstructured”
in
the
normalized
set-‐based
database
sense
9. social media activities
Tweets, micro-blogs
Blog/rich-media posts
Comments/threaded discussions
Rich media-sharing (urls, reposts)
Location data (place, long/lat)
Friend/follower relationships
Engagement (e.g. Likes, up- and down-votes, reputation)
Tagging
11. 1. Compare time-evolution of social media
reactions across firehoses
2. Compare richness of content across
firehoses
12. Firehoses:
Twitter
Wordpress Posts and Comments
Newsgator
Filter content on key terms:
“quake”
“terremoto”
Extract date time posted, group in 1 min buckets
and plot
13.
14. Surprise events fit a “double-exponential” pulse in
activity rate that enables consistent comparison
between events and sources
17. 1. Connect and stream data from firehoses
2. Preliminary filter
3. Store to file
4. Extract post times
5. Count activities in 1-minute buckets
6. Proxy of “richness”: count number of a
characters in content
7. Visualize
18. Connecting
Simple
HTTP streaming with cURL
curl --compressed
-v -ushendrickson@gnip.com
"https://stream.gnip.com:443/accounts/
shendrickson/publishers/twitter/streams/sample10/
decahose.json"
Build based on libraries
OTS solutions
20. Moving and Storing
Volumes (JSON, gzip’d)
100M Tweets = 25 GB
< 2 min @300 MB/s (SATA II)
< 6 hrs @10 Mb/s (Ethernet)
1 day Wordpress.com posts = 350MB
Files system
NoSQL/Key-Value Stores – Flexible structure
Relational DB Stores – Indexes rock
Message Queues
21. Filter
Model – guess at structure and process
Parse – sort out the pieces
Filter – reduce to what matters
Aggregate – cluster, sum, average…
Analyse – tell the story with data
23. Network dynamics
Influencers, path analysis, viral spread…
Time dynamics
Time to peak, story half-life…
Natural language processing
”Aboutness” is hard, but gets easier as domain "
narrows
Explore and deploy
Master skills, shorten cycles of exploration
Move learning to production