this presentation is a quick story about how we used beancounter.io to perform a realtime analysis of #debate hashtag during the 3rd Obama-Romney debate.
4. during peak time ~35
persons/second top up
their Oyster card*
http://www.tfl.gov.uk/corporate/modesoftransport/londonunderground/1608.aspx
5. every second ~58 new
pictures are uploaded on
Instagram*
http://www.digitalbuzzblog.com/infographic-instagram-stats/
6. the night of the first
#debate, 2615 tweets
per second have
been recorded*
http://www.nbcnews.com/technology/technolog/presidential-debate-sets-twitter-record-6281796
16. extract such information,
making it explicit,
analysing it
and doing it at a rate of
~2000 tweets/sec?
17. real-time analytics
Storm, a free and open source
distributed realtime computation
system. Storm makes it easy to
reliably process unbounded
streams of data, doing for realtime
processing what Hadoop did for
batch processing.
18. batch analyses
The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across
clusters of computers using simple
programming models.
+ hdfs, a distributed FS
19. data gathering from the Social Web
crunching the Social Web, in real-time.
formerly known as Beancounter
20. beancounter.io is a SaaS
platform to profile your
users from their activities on
the Social Web
22. (a quick parenthesis)
or ...
“how a butterfly flapping
its wings in Asia might
cause a hurricane in the
Atlantic” *
http://www.amazon.com/Strategic-Thinking-New-Science-Complexity/dp/0684842688
24. while beancounter.io was
handling more than ~100
check-ins per minute
at 13.32 UTC-8 Twitter had
an outage *
https://status.io.watchmouse.com/7617/125017//statuses/home_timeline-(OAuth-1.0a)
25. Facebook and Twitter check-ins rate
Nov 6, 2012 13:32 UTC-8 twitter service disruption
200
150
100
50
2012-11-06T20:45:01.690984
2012-11-06T21:40:03.615521
2012-11-06T22:35:04.645506 0
2012-11-06T23:30:05.627388
26. Facebook and Twitter overall comments
Nov 6, 2012 13:32 UTC-8 twitter service disruption
1500
1125
750
375
2012-11-06T20:45:01.690984
0
2012-11-06T21:30:02.861083
2012-11-06T22:15:04.455317
2012-11-06T23:00:05.432714
Facebook Twitter
27. lesson learnt: the real-time
Web is an hyper-connected
graph of a myriad of di!erent
live systems
always mind the butterflies,
even if you can’t see them
30. we’ve tied together beancounter.io,
Storm and Hadoop
please note, this was only the
10% of the firehose
real-time analytics
hdfs, distributed FS
Storm
batch analytics
31. more than ~ 500k tweets
processed in 2h for an average
rate of ~70 t/sec
each tweet produced a
snapshot (~10k each) for an
overall size of 4.6GB of data
32. more than ~18k
di!erent URLs shared
highest peak: 253 tweets/sec
5 amazon EC2 x-large instance
+ 2 mid-sized for HDFS
33. recurring concepts
70000
52500
35000
17500
Osama Bin Laden
Iran
Israel 0
Middle East
Pakistan
Iraq
Afghanistan
Russia
34. most co-occurrent concepts
Iran - Israel 35.356 %
Russia - Middle East 24.7 %
...
...
Wikileaks - Richard Nixon 93.5%
36. facts
data viz is a completely another job
mining data requires science skills, it’s not
just about technology: it’s about math
forget to control everything when data
flows at that speed: make reasoned
approximations