1. Big Data and NoSQL in REAL TIME
Facebook and Twitter Examples
Ron Zavner
2. Agenda
Our real time world…
Flavors of Big Data
Facebook messaging and real time analytics system
Twitter analytics system
Winning architecture?
2
® Copyright 2011 Gigaspaces Ltd. All Rights
3. What is Real Time?
3
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
4. We’re Living in a Real Time World…
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4
5. Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
5
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
6. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
The Two Vs of Big Data
Velocity Volume
7. The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7
8. Analytics – Counting
How many
signups, tweets, retweet
s for a topic?
What’s the average
latency?
Demographics
Countries and cities
Gender
Age groups
Device types
…
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8
9. Analytics – Correlating
What devices fail at the
same time?
What features get user
hooked?
What places on the
globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9
10. Analytics – Research
Sentiment analysis
“Obama is popular”
Trends
“People like to tweet
after watching
American Idol”
Spam patterns
How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10
11. It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
This is what
we’re here to
discuss
13. Store 135+ Billion Messages A Month
13
® Copyright 2011 Gigaspaces Ltd. All Rights
14. The actual analytics..
Like button analytics
Comments box analytics
14
® Copyright 2011 Gigaspaces Ltd. All Rights
15. Goals
Show why plugins are valuable
Make the data more actionable
Make the data more timely
Remove point of failures
Handle massive load - 200K events per second
15
® Copyright 2011 Gigaspaces Ltd. All Rights
16. Technology Evaluation
MySQL DB Counters
In-Memory Counters
MapReduce
Cassandra
HBase
16
® Copyright 2011 Gigaspaces Ltd. All Rights
18. Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec
20. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20
Twitter Reach – Here’s One Use Case
21. Let’s start with some
statistics ….
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
22. It takes a week for users to
send 1 billion Tweets.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
23. On average,
140 million
tweets get sent every day.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
24. The highest
throughput to date is
6,939 tweets/sec.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
25. 460,000 new
accounts
are created daily.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
26. 5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/
27. Challenge – Word Count
Word:Count
Tweets
Count
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved27
• Hottest topics
• URL mentions
• etc.
28. (Tens of) thousands of tweets per second to
process
Assumption: Need to process in near real time
Aggregate counters for each word
A few 10s of thousands of words (or hundreds of
thousands if we include URLs)
System needs to linearly scale
System needs to be fault tolerant
Word Count - Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28
29. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved29
Use EDA (Event Driven Architecture)
TokenizerRaw FiltererTokenized CounterFiltered
30. Sharding (Partitioning)
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer
3
Filterer 3
Tokenizer
n
Filterer n
Counter
Updater 1
Counter
Updater 2
Counter
Updater 3
Counter
Updater n
31. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved31
Computing Reach with Event Streams
33. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33
Twitter Storm
34. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34
Storm Overview
35. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35
Storm Cluster
36. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36
Streaming word count with Storm
37. Storage
Data Persistency
Querying
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37
Storm Limitation
Spouts
Bolt
Topologies
38. Event driven / flow
Reliable
Storage
Data Persistency
Querying
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38
Winner is… storm & in memory data grids
39. Facebook messages
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-
messaging-system-hbase-to-store-135.html
Facebook Real time analytics
http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-
analytics-system-hbase-to-process-20.html
Learn and fork the code on github:
https://github.com/Gigaspaces/rt-analytics
Detailed blog post
http://bit.ly/gs-bigdata-analytics
Twitter in numbers:
http://blog.twitter.com/2011/03/numbers.html
Twitter Storm:
http://bit.ly/twitter-storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39
References
Real time is ideally less than a second, not 30 seconds, not 5 seconds
We live almost every aspect of our lives in a real-time world. Think about our social communications; we update our friends online via social networks and micro-blogging, we text from our mobiles, or message from our laptops. But it's not just our social lives; we shop online whenever we want, we search the web for immediate answers to our questions, we trade stocks online, we pay our bills, and do our banking. All online and all in real time.Real time doesn't just affect our personal lives. Enterprises and government agencies need real-time insights to be successful, whether they are investment firms that need fast access to market views and risk analysis, or retailers that need to adjust their online campaigns and recommendations. Even homeland security has come to increasingly rely on real-time monitoring.The amount of data that flows in these systems is huge.Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.
Big data is definitely expected to grow and expand. Amount of data is growing and the demand grows as well. The requirements for analytics in real time is a must.
The Two Vs of Big Data are velocity and volume. As said before, the volume of data we need to handle is huge and at the same time we need to do it fast. We are required to make very complex calculations in read time and we need to perform those for a very large amount of data. The data is usually spread among many servers, distributed and each server would perform it’s calculation and then results would be aggregated – map reduce. This is a very common pattern to perform real time analytics. Having said that, we can see that sometimes the latency requirement is more challenging and we need to improve the time it takes to make these calculations. You can’t go straight to relational DB – not designed to handle the speed and volumes we’re talking about, that’s why we can look at NoSQL or cache.NoSQL can go further // I don’t have contraints of a relational db and I can store the data as it is (in JSON – the format used by Twitter) – but processing the sheer amount of data in the timeframes we need is incredibly challenging.
I think analytics – when we’re talking about Big Data and something like Twitter – can be split into three categories, or buckets.The first bucket is “Counting” How many signups, tweets or retweets are there for a topic?I might also be interested in counting in relation to demographic information – for example, how many people are tweeting right now at this event and on what types of devices?The “Correlating” bucket might contain questions like how many twitter users are using desktop vs mobile - and what's the trend? Within the week, within in the month?Our 3rd bucket “Research” is similar to 2, but looking at more depth in the past – here we require a lot of processing of historic data
Counting calculations – we expect to see results in real time.The challenge is reliability > not that we lose money, but the accuracy of the system is going to be damaged, so the value of the report is going to be meaningless. Counting requires a very high high resolution - every tweet counts - we don't know which one will be important. If we lose something, the accuracy of the system will be damaged.
Correlating – we expect to see most results also in real time.These are the interactive queries where we expect a result that I can layout in my browser or a BI tool.
Research calcsare historical and Hadoop (for example) is a very popular framework for doing batch analytics. We don’t expect for real time response here but you never know what’s next
It’s All about Timing.We expect to see real time results for lots of our calculations.We also need to make sure that our architecture allows us to be scalable.Today we might need to work with 100K TPS and it can easily grow to 200K TPS.We need to be highly available as well, we need to ensure zero downtime.For these we can use event driven and stream processing architectures.Correlation and research calculations are very interesting topics and we can expect longer response time, we however are going to examine the real time challenge.
We are going to talk about how facebook real time analytics system and also how they choose to store 135+ billion messages a month
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.htmlYou may have read somewhere that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages, on-site Facebook messages. All-in-all they need to store over 135 billion messages a month. Where do they store all that stuff? One of the posts gave the surprise answer - HBase beat out MySQL, Cassandra, and a few others.Why a surprise? Facebook created Cassandra and it was purpose built for an inbox type application, but they found Cassandra's eventual consistency model wasn't a good match for their new real-time Messages product. Facebook also has an extensive MySQL infrastructure, but they found performance suffered as data set and indexes grew larger. And they could have built their own, but they chose HBase.HBase is a scaleout table store supporting very high rates of row-level updates over massive amounts of data. Exactly what is needed for a Messaging system. HBase is also a column based key-value store built on the BigTable model. It's good at fetching rows by key or scanning ranges of rows and filtering. Also what is needed for a Messaging system. Complex queries are not supported however. Queries are generally given over to an analytics tool like Hive, which Facebook created to make sense of their multi-petabyte data warehouse, and Hive is based on Hadoop's file system, HDFS, which is also used by HBase.
Over the past year, social plugins have become an important and growing source of traffic for millions of websites. Today we're releasing a new version of Insights for Websites to give you better analytics on how people interact with your content and to help you optimize your website in real-time.Like button analyticsFor the first time, you can now access real-time analytics to optimize Like buttons across both your site and on Facebook. We use anonymized data to show you the number of times people saw Like buttons, clicked Like buttons, saw Like stories on Facebook, and clicked Like stories to visit your website.
Plugins are valueableSocial plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.Data actionableHelp users take action to make their content more valuable.How many people see a plugin, how many people take action on it, and how many are converted to traffic back on your site. Make the data more timelyWent from a 48-hour turn around to 30 seconds.Multiple points of failure were removed to make this goal.
http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.htmlMySQL DB CountersHave a row with a key and a counter.Results in lots of database activity.Stats are kept at a day bucket granularity. Every day at midnight the stats would roll over. When the roll over period is reached this resulted in a lot of writes to the database, which caused a lot of lock contention.Tried to spread the work by taking into account time zones. Tried to shard things differently.The high write rate led to lock contention, it was easy to overload the databases, had to constantly monitor the databases, and had to rethink their sharding strategy.Solution not well tailored to the problem.In-Memory CountersIf you are worried about bottlenecks in IO then throw it all in-memory.No scale issues. Counters are stored in memory so writes are fast and the counters are easy to shard.Felt in-memory counters, for reasons not explained, weren't as accurate as other approaches. Even a 1% failure rate would be unacceptable. Analytics drive money so the counters have to be highly accurate. They didn't implement this system. It was a thought experiment and the accuracy issue caused them to move on.MapReduceUsed Hadoop/Hive for previous solution. Flexible. Easy to get running. Can handle IO, both massive writes and reads. Don't have to know how they will query ahead of time. The data can be stored and then queried.Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals.CassandraHBase seemed a better solution based on availability and the write rate.Write rate was the huge bottleneck being solved.
http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.htmlThe Winner: HBase + Scribe + Ptail + PumaAt a high level:HBase stores data across distributed machines.Use a tailing architecture, new events are stored in log files, and the logs are tailed.A system rolls the events up and writes them into storage.A UI pulls the data out and displays it to users.Data FlowUser clicks Like on a web page.Fires AJAX request to Facebook.Request is written to a log file using Scribe. Scribe handles issues like file roll over.Scribe is built on the same HTFS file store Hadoop is built on.Write extremely lean log lines. The more compact the log lines the more can be stored in memory.PtailData is read from the log files using Ptail. Ptail is an internal tool built to aggregate data from multiple Scribe stores. It tails the log files and pulls data out.Ptail data is separated out into three streams so they can eventually be sent to their own clusters in different datacenters.Plugin impressionNews feed impressionsActions (plugin + news feed)PumaBatch data to lessen the impact of hot keys. Even though HBase can handle a lot of writes per second they still want to batch data. A hot article will generate a lot of impressions and news feed impressions which will cause huge data skews which will cause IO issues. The more batching the better.Batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable.Wait for last flush to complete for starting new batch to avoid lock contention issues.UI Renders DataFrontends are all written in PHP.The backend is written in Java and Thrift is used as the messaging format so PHP programs can query Java services.Caching solutions are used to make the web pages display more quickly.Performance varies by the statistic. A counter can come back quickly. Find the top URL in a domain can take longer. Range from .5 to a few seconds. The more and longer data is cached the less realtime it is.Set different caching TTLs in memcache.MapReduceThe data is then sent to MapReduce servers so it can be queried via Hive.This also serves as a backup plan as the data can be recovered from Hive.Raw logs are removed after a period of time.HBase is a distribute column store. Database interface to Hadoop. Facebook has people working internally on HBase. Unlike a relational database you don't create mappings between tables.You don't create indexes. The only index you have a primary row key.From the row key you can have millions of sparse columns of storage. It's very flexible. You don't have to specify the schema. You define column families to which you can add keys at anytime.Key feature to scalability and reliability is the WAL, write ahead log, which is a log of the operations that are supposed to occur. Based on the key, data is sharded to a region server. Written to WAL first.Data is put into memory. At some point in time or if enough data has been accumulated the data is flushed to disk.If the machine goes down you can recreate the data from the WAL. So there's no permanent data loss.Use a combination of the log and in-memory storage they can handle an extremely high rate of IO reliably. HBase handles failure detection and automatically routes across failures.Currently HBaseresharding is done manually.Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet.Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan.Schema Store on a per URL basis a bunch of counters.A row key, which is the only lookup key, is the MD5 hash of the reverse domainSelecting the proper key structure helps with scanning and sharding.A problem they have is sharding data properly onto different machines. Using a MD5 hash makes it easier to say this range goes here and that range goes there. For URLs they do something similar, plus they add an ID on top of that. Every URL in Facebook is represented by a unique ID, which is used to help with sharding.A reverse domain, com.facebook/ for example, is used so that the data is clustered together. HBase is really good at scanning clustered data, so if they store the data so it's clustered together they can efficiently calculate stats across domains. Think of every row a URL and every cell as a counter, you are able to set different TTLs (time to live) for each cell. So if keeping an hourly count there's no reason to keep that around for every URL forever, so they set a TTL of two weeks. Typically set TTLs on a per column family basis. Per server they can handle 10,000 writes per second. Checkpointing is used to prevent data loss when reading data from log files. Tailers save log stream check points in HBase.Replayed on startup so won't lose data.Useful for detecting click fraud, but it doesn't have fraud detection built in.Tailer Hot SpotsIn a distributed system there's a chance one part of the system can be hotter than another.One example are region servers that can be hot because more keys are being directed that way.One tailer can be lag behind another too.If one tailer is an hour behind and the others are up to date, what numbers do you display in the UI?For example, impressions have a way higher volume than actions, so CTR rates were way higher in the last hour.Solution is to figure out the least up to date tailer and use that when querying metrics.
In Twitter, the primary relationship between entities is many-to-many. Every post is sent to numerous followers of the user who sent the post; at the same time, each user can follow many other users. This causes Twitter to behave like a living organism, growing unexpectedly in many different directions.Let me give you an example. One analytic where I need to process tweets is to determine Twitter Reach – Reach is how many unique Twitter accounts received tweets about my topic.So, how do I compute my reach?There are several stages in the processing1. First, I need to record every tweet2. Then I can count how many followers got that tweet3. Then I need to understand the distinct reach and I need to account for this > meaning for each follower I need to look at each of their followers and remove the duplicates.Try to image what it takes to produce that number. If my tweet is retweeted by 100 users, each of whom has 100 followers – well, it starts to take a fair bit of number crunching.
Read mostly – duplicate the data so you can optimize the read.
Let’s analyze the problems that a simple Twitter word count presentsThe challenge here seems straightforward:Tens of thousands of tweets need to be stored and parsed every secondWord counters need to be aggregated continuously. Even though tweets are limited to 140 characters, we are dealing with hundreds of thousands of words per second.This is big.
In many ways this is the bench mark for other systems because it does stretch the limits > There is a huge amount of activity to analyze – the scale is enormous> And we want to grab a lot of information out of it – and this is the challenge - how do we grab the stream in real time without effecting latency?> how do we deal w/ that stream in real-time?> how do we handle the write scalability in real-time?> how do we make the system bullet-proof and easily scalable?> how do we begin to do analytics on this?
Storm is a real time, open source data streaming framework that functions entirely in memory. Storm is designed to be run on several machines to provided parallelism.Real-time processing is becoming very popular, and Storm is a popular open source framework and runtime used by Twitter for processing real-time data streams. Storm addresses the complexity of running real time streams through a compute cluster by providing an elegant set of abstractions that make it easier to reason about your problem domain by letting you focus on data flows rather than on implementation details.
It constructs a processing graph that feeds data from an input source through processing nodes. The processing graph is called a "topology". The input data sources are called "spouts", and the processing nodes are called "bolts". The data model consists of tuples. Tuples flow from Spouts to the bolts, which execute user code. Besides simply being locations where data is transformed or accumulated, bolts can also join streams and branch streams. Storm topologies are deployed in a manner somewhat similar to a webapp; a jar file is presented to a deployer which distributes it around the cluster where it is loaded and executed. A topology runs until it is killed.
zookeeper - Storm uses Zookeeper to communicate between the "Nimbus"(master) and the 'Supervisors" (workers), as well as to store its current state. Zookeeper coodinates activity in the cluster, and provides operational state storage.storm-nimbus – The topology execution coordinator for the cluster. The Nimbus is a singleton in the cluster (i.e. not elastic). It is stateless however (due to storing state in Zookeeper) and there for can fail and be restarted without consequence even to running jobs.storm-supervisor – The supervisors actually run the topology code. There can/should be many of these (i.e. elastic). The parallelism attributes of a given topology are specified in the topology itself.
Data grids are more event driven based while strom is used for flow/streaming. Storm have more capabilites. Storm is very specifically directed at the streaming problem, and is optimized for that use case. In order to produce extremely high throughput, it pushes responsibility for reliability outside of its own framework. Also because of its streaming focus, it provides higher level abstractions that make reasoning about streaming easier than in XAP.Reliable - The architecture is oriented to making data in-memory nearly as reliable as that on disk. Thus, writing into XAP involves some level of serialization and perhaps a network hop as well. Storm doesn't aspire to this level of reliability, instead it provides the means for the suppliers and consumers of data to provide it instead. Storm is "optimistic" in roughly the same sense that an optimistic lock in a database is optimistic: it assumes success is far more likely than failure, and so is willing to big hits to performance when failures occur because they are so rare. XAP is more pessimistic in this sense. XAP is designed to be a source of truth for the data it holds, and goes to great lengths to achieve it.For reasons sited above, there is no way, even in principle, for XAP to have a comparable thoughput to Storm: at least when there is no persistence. This caveat is critical however, since real world systems almost always need persistence, and ultra-fast in-memory persistence is one of XAP's main strengths. I also mentioned that Storm has higher level abstractions for Streaming, which make programming it more straightforward for streaming applications. Whereas in XAP you could implement streaming as a series of event driven processing stages, there is no concept of a "stream" or any kind of "flow" at the API level.Storm with XAPBasically, Spouts provide the source of tuples for Storm processing. For spouts to be maximally performant and reliable, they need to provide tuples in batches, and be able to replay failed batches when necessary. Of course, in order to have batches, you need storage, and to be able to replay batches, you need reliable storage. XAP is about the highest performing, reliable source of data out there, so a spout that serves tuples from XAP is a natural combination. Recall that Storm is a stream processing framework and runtime, and this presupposes the existence of a stream for it to read from. So there are really two artifacts needed for XAP to provide a spout to Storm: a "stream" in XAP, and of course the spout that reads from it. Realizing this, I wrote a simple service for XAP that leverages XAP's FIFO capabilities called XAPStream. It is a standalone (Storm independent) service that lets clients dynamically create, destroy, and of course read and write from streams in both batch and non-batch modes.