Learn how to build a Twitter-like analytics system, designed to meet real time needs, in a simple way. Using frameworks such as Spring Social, Active In-Memory Data Grid for Big Data event processing, and NoSQL database.
Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.
In the same way that Hadoop was born out of large-scale web applications, a new class of scalable frameworks and platforms for handling streaming or real time analysis and processing is born to handle the needs of large-scale location-aware mobile, social and sensor use. Do we want to limit ourselves to just these use cases?
Facebook, Twitter and Google have been pioneers in that arena and recently launched new analytics services designed to meet the real time needs.
In this session we will Review the common patterns and architecture that drive these platforms and learn how to build a Twitter-like analytics system in a simple way using frameworks such as Spring Social, Active In-Memroy Data Grid for Big Data event processing, and NoSQL database such as Cassandra or Hbase for handling the managing the historical data.
3. The Two Vs of Big Data
Velocity Volume
3 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
4. We’re Living in a Real Time World…
Social User Tracking & Homeland Security
Engagement
eCommerce Financial Services Real Time Search
4 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
5. The Flavors of Big Data Analytics
Counting Correlating Research
5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
6. Analytics @ Twitter – Counting
How many signups,
tweets, retweets for a
topic?
What’s the average
latency?
Demographics
Countries and cities
Gender
Age groups
Device types
…
6 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
7. Analytics @ Twitter – Correlating
What devices fail at the
same time?
What features get user
hooked?
What places on the
globe are “happening”?
7 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
8. Analytics @ Twitter – Research
Sentiment analysis
“Obama is popular”
Trends
“People like to tweet
after watching
American Idol”
Spam patterns
How can you tell
when a user spams?
8 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
9. It’s All about Timing
“Real time” Reasonably Quick Batch
(< few Seconds) (seconds - minutes) (hours/days)
9 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
10. It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying This is what
• Medium resolution (aggregations)
we’re here we’re here
to discuss
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
10 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
11. Challenge – Word Count
Tweets
11
?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Count
• URL mentions
• etc.
Word:Count
• Hottest topics
12. URL Mentions – Here’s One Use Case
12 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
13. Twitter in Numbers (March 2011)
It takes a week for users to
send 1 billion Tweets.
Source: http://blog.twitter.com/2011/03/numbers.html
13 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
14. Twitter in Numbers (March 2011)
On average,
140 million
tweets get sent every day.
Source: http://blog.twitter.com/2011/03/numbers.html
14 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
15. Twitter in Numbers (March 2011)
The highest
throughput to date is
6,939 tweets/sec.
Source: http://blog.twitter.com/2011/03/numbers.html
15 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
16. Twitter in Numbers (March 2011)
460,000 new
accounts
are created daily.
Source: http://blog.twitter.com/2011/03/numbers.html
16 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
17. Twitter in Numbers
5% of the users generate
75% of the content.
Source: http://www.sysomos.com/insidetwitter/
17 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
18. Analyze the Problem
(Tens of) thousands of tweets per second to
process
Assumption: Need to process in near real time
Aggregate counters for each word
A few 10s of thousands of words (or hundreds of
thousands if we include URLs)
System needs to linearly scale
System needs to be fault tolerant
18 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
19. Key Elements in
Real Time Big Data Analytics
19 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
21. Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec
22. Use EDA (Event Driven Architecture)
22 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
23. Putting it all together
23 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved