Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.
4. In This Talk
• What to think about when choosing a
datastore for a web application
• Myths and legends
• Executing distributed queries
• My research: a consistent cache
18. What to Think about When You’re
Choosing a Datastore
Hint: Not Scaling
19. Development Cycle
• Prototyping
– Ease of use
• Developing
– Flexibility, multiple developers
• Running a real production system
– Reliability
• SUCCESS! Sharding and specialized
datastorage
– We should only be so lucky
20. Getting Started
First five minutes – what’s it like?
Idea credit: Adam Marcus, The NoSQL Ecosystem, HPTS 2011
via Justin Sheehy from Basho
26. Questions
• What is your mix of reads and writes?
• How much data do you have?
• Do you need transactions?
• What are your access patterns?
• What will grow and change?
• What do you already know?
27. Reads and Writes
• Write optimized vs. Read optimized
• MongoDB has a global write lock per process
• No concurrent writes!
28. Size of Data
• Does it fit in memory?
• Disk-based solutions will be slower
• Worst case, Redis needs 2X the memory of
your data!
– It forks with copy-on-write
sudo echo 1 > /proc/sys/vm/overcommit_memory
30. Performance
• Relational databases are considered slow
• But they are doing a lot of work!
• Sometimes you need that work, sometimes
you don’t.
31. Simplest Scenario
• Responding to user requests in real time
• Frequently read data fits in memory
• High read rate
• No need to go to disk or scan lots of data
Datastore CPU is the bottleneck
32. Cost of Executing a Primary Key Lookup
• Receiving message
• Instantiating thread
• Parsing query
Query • Optimizing
Cost
• Take out read locks
• (Sometimes) lookup in
an index btree
• Responding to the client
Actually retrieving data
33. Options to Make This Fast
• Query cache
• Prepared statements
• Handler Socket
– 750K lookups/sec on a single server
Lesson: Primary key lookups can be fast no
matter what datastore you use
34. Flexibility vs. Performance
• We might want to pay the overhead for query
flexibility
• In a primary key datastore, we can only ask
queries on primary key
• SQL gives us flexibility to change our queries
35. Durability
• Persistent datastores
– Client: write
– Server: flush to disk, then send “I completed your
write”
– CRASH
– Recover: See the write
• By default, MongoDB does not fsync() before
returning to client on write
– Need j:true
• By default, MySQL uses MyISAM instead of
InnoDB
36. Specialization
• You know your
– query access patterns and traffic
– consistency requirements
• Design specialized lookups
– Transactional datastore for consistent data
– Memcached for static, mostly unchanging content
– Redis for a data processing pipeline
• Know what tradeoffs to make
37. Ways to Scale
• Reads
– Cache
– Replicate
– Partition
• Writes
– Partition data amongst multiple servers
39. Myths of Sharded Datastores
• NoSQL scales better than a relational database
• You can’t do a JOIN on a sharded datastore
40. MYTH: NoSQL scales better than a
relational database
• Scaling isn’t about the datastore, it’s about the
application
– Examples: FriendFeed, Quora, Facebook
• Complex queries that go to all shards don’t
scale
• Simple queries that partition well and use only
one shard do scale
47. Query Goes to All
Partitions
MySQL
0-99
Query goes to all servers
Sam’s MySQL
comments
100-199
MySQL
200-299
48. Costs for Query Go Up
When Adding a New MySQL
Server 0-99
MySQL
100-199
Sam’s
comments MySQL
200-299
MySQL
300-399
49. CPU Cost on Server of Retrieving One Row
Two costs: Query overhead and row retrieval
Query 97%
Overhead
MySQL
50. Idea: Multiple Partitionings
Partition comments table on post_id and user.
Reads can choose appropriate copy so queries go to
only one server.
Writes go to all copies.
52. All Read Queries Go To Only
One Partition
MySQL
0-99
Comments on
post 100
MySQL
100-199
Sam’s
comments
MySQL
S-Z
53. Writes Go to Both Table
Copies
MySQL
Write a new 0-99
comment by
Vicki on post
100
MySQL
100-199
MySQL
S-Z
54. Scaling
• Reads scale by N – the number of servers
• Writes are slowed down by T – the number of
table copies
• Big win if N >> T
• Table copies use more memory
– Often don’t have to copy large tables – usually
metadata
• How to execute these queries?
55. Dixie
• Is a query planner which executes SQL queries over a
partitioned database
• Optimizes read queries to minimize query overhead and
data retrieval costs
• Uses a novel cost formula which takes advantage of data
partitioned in different ways
56. Wikipedia Workload with Dixie
25000 Each
Wikipedia One copy of page table query
going to 1
20000 • DatabaseDixie + page.title copy
dump from 2008 server
• Real HTTP traces 3.2 X
15000
• Sharded across 1, 2, 5, and 10 servers
QPS
10000 • Compare by adding a copy of the page
Many
queries
table (< 100 MB), sharded another way
going to all
5000 servers
0
1 2 5 10
Number of Partitions
57. Myths of Sharded Datastores
• NoSQL scales better than a relational database
• You can’t do a JOIN on a sharded datastore
58. MYTH: You can’t execute a JOIN on a
sharded database
• What are the JOIN keys? What are your
partition keys?
• Bad to move lots of data
• Not bad to lookup a few pieces of data
• Index lookup JOINs expressed as primary key
lookups are fast
59. Join Query
Fetch all of Alice's comments on Max's posts.
comments table posts table
post_id user text id author link
3 Alice First post! 1 Max http://…
6 Alice Like. 3 Max www.
7 Alice You think? 22 Max Ask Hacker
22 Alice Nice work! 37 Max http://…
60. Join Query
Fetch all of Alice's comments on Max's posts.
comments table posts table
post_id user text id author link
3 Alice First post! 1 Max http://…
6 Alice Like. 3 Max www.
7 Alice You think? 22 Max Ask Hacker
22 Alice Nice work! 37 Max http://…
61. Distributed Query Planning
• R*
• Shore
• The State of the Art in Distributed Query
Processing, by Donald Kossman
• Gizzard
62. Conventional Wisdom
Partition on JOIN keys, Send query 0-99 0-99
as is to each server.
SELECT comments.text
FROM posts, comments
WHERE comments.post_id = posts.id 100- 100-
AND posts.author = 'Max' 199 199
AND comments.user = 'Alice'
200- 200-
299 299
64. Index Lookup Join
0-99
Partition on filter keys, retrieve Max's A-I
0-99
posts, then retrieve Alice's comments on A-I
Max's posts.
100-
SELECT posts.id 199
J-R
100-
FROM posts 199
J-R
WHERE posts.author = 'Max’
200-
299
S-Z
200-
299
S-Z
65. Index Lookup Join
0-99
Partition on filter keys, retrieve Max's A-I
0-99
A-I
posts, then retrieve Alice's comments on
Max's posts.
100-
ids of Max’s 199
J-R
100-
posts 199
J-R
200-
299
S-Z
200-
299
S-Z
66. Index Lookup Join
0-99
Partition on filter keys, retrieve Max's A-I
0-99
posts, then retrieve Alice's comments on A-I
Max's posts.
SELECT comments.text
100-
FROM comments
199
J-R
100-
WHERE comments.post_id IN […]
199
J-R
AND comments.user = 'Alice'
200-
299
S-Z
200-
299
S-Z
67. Index Lookup Join
0-99
Partition on filter keys, retrieve Max's A-I
0-99
posts, then retrieve Alice's comments on A-I
Max's posts.
100-
Alice’s comments on 199
J-R
100-
199
J-R
Max’s posts
200-
299
S-Z
200-
299
S-Z
68. Index Lookup Join Plan Scaling
Alice’s
comments
on Max’s
posts
MySQL MySQL MySQL
…
69. Comparison
Conventional Plan Index Lookup Plan
Intermediate
data, Max’s
posts Alice
DIDN’T
comment on
70. Comparison
Conventional Plan Index Lookup Plan
Intermediate
data, Max’s
posts Alice
DIDN’T
comment on
71. Dixie
• Cost model and query optimizer which
predicts the costs of the two plans
• Query executor which executes the original
query, designed for a single database, on a
partitioned database with table copies
72. Dixie's Predictions for the Two Plans
Index Lookup
Plan, each
10000
query going to
two servers 2-Step Join
Estimate
8000
• Varying the size of the intermediate data
Queries/Sec
6000
• Max’s posts, Alice didn’t comment on
Conventional Pushdown
Join Estimate
4000
• Fixing the number of results returned
Plan, each
query going
• Partitioned over 10 servers
to all servers
2000
0
0 50 100 150
Posts/Author
73. Performance of the Two Plans
10000 2-Step Join
8000 Pushdown Join
2-Step Join
Queries/Sec
6000 Estimate
Pushdown Join
4000 Estimate
2000
Blue beats yellow after this.
0 Dixie predicts it!
0 50 100 150
Posts/Author
74. JOINs Are Not Evil
• When only transferring small amounts data
• And with carefully partitioned data
75. Lessons Learned
• Don’t worry at the beginning: Use what you
know
• Not about SQL vs. NoSQL systems – you can
scale any system
• More about complex vs. simple queries
• When you do scale, try to make every query
use one (or a few) shards
76. Research: Caching
• Applications use a cache to store results
computed from a webserver or database rows
• Expiration or invalidation?
• Annoying and difficult for app developers to
invalidate cache items correctly
Work in progress with Bryan Kate, Eddie Kohler, Yandong Mao, and Robert Morris
Harvard and MIT
77. Solution: Dependency Tracking Cache
• Application indicates what data went into a
cached result
• The system will take care of invalidations
• Don’t need to recalculate expired data if it has
not changed
• Always get the freshest data when data has
changed
• Track ranges, even if empty!
78. Example Application: Twitter
• Cache a user’s view of their twitter homepage
(timeline)
• Only invalidate when a followed user tweets
• Even if many followed users tweet, only
recalculate the timeline once when read
• No need to recalculate on read if no one has
tweeted
79. Challenges
• Invalidation + expiration times
– Reading stale data
• Distributed caching
• Cache coherence on the original database
rows
80. Benefits
• Application gets all the benefits of caching
– Saves on computation
– Less traffic to the backend database
• Doesn’t have to worry about freshness
First, a bit about me.Former Software Engineer at Google – worked on Froogle, a product aggregator and shopping website, Blobstore, a storage system for large binary objects like gmail attachments and images, and Native Client, a way of running fast native code safely in the browser.Left to go be a PhD student at MIT, studying distributed systems at CSAIL (computer science and AI lab). I’ve been there for four years!This summer I decided to do an internship at news.me, as a data scientist, because I’m really interested on how we get our news and content these days. News.me then bought digg.com (interesting story) so what I really did last summer was work on powering real-time news by looking at social sharing on twitter and facebook, and relaunchingdigg in six weeks.
Here’s what I’m going to cover in this talk.First,Then, some myths I’ve heard about NoSQL, SQL, and partitioning on the internet.
Split?
Split?
Split?
Split?
Split?
Split?
Split?
But I kind of understand where they are coming from – there are so many options out there.
Apologies if I left your favorite out.It’s totally overwhelming, and there is a lot of hype.
So I’m here to tell you what to think about when you’re choosing a datastore. It’s definitely not scaling!
We can see what is really important by looking at the development cycle of a project.When prototyping, we really care about iterating quickly, and how easy the storage system is to use and how soon you can get up and running with itWhen developing, we care a lot about flexibility, the ability to develop new features, and having a codebase that multiple developers can use.When we’re running your system in the real world, our first priority is reliability and performance.If we even get to the stage where we need to think about sharding, it means we’ve really succeeded, and it’s a great problem to have!But we might not even get there.
One thing a lot of developers pay attention to is how easy it is to get started with a datastore.This is something I saw in someone else’s presentation, and I thought it was really great – what are the first five minutes like with your datastore?
Here we have redis, and it’s pretty awesome. This guy has created a 30 second guide -- A few commands and I’m storing keys and getting them back out. Very little startup overhead.
Contrast that with MySQL – I didn’t even show you the actual commands to run to get to the same point, because we have to go through so much other stuff first. Namely, setting up permissions, creating a database, and choosing a schema, which is something I might not even be ready to do yet!!!Redis seems like a big win here.
But – what happens when we get to the next stage, developing? At this point, we might have multiple people working on the code, we’re testing new features and accessing the data in different ways.Someone new might want to see how your data is structured.Let’s go back and look at Redis and MySQL again
This is a redis server I used at Digg for testing. I came in not knowing what was in this datastore – some of the key patterns were strewn about the code, which is not ideal. And the only way to tell the type of the key (set, sorted set, etc) was to look at the operations on it – zadd vs. set and get.How do I find out what’s going on?Redis has a keys command…. But this datastore is 24 GB of smallish keys. Running “keys” will take forever (click) and overload my server. This isn’t a good thing to do on a production system.So it’s hard to find out what’s going on here.
Contrast that with MySQL – because SQL is a flexible query language, I can learn about the data really quickly. Here I’m learning about the table “cards”. I can describe it, and select a few rows to see what’s in it.I can also easily add an index if I want to look up stuff in different ways, no need to completely rewrite everything in the database or change my code a lot.So if it’s not scaling and it’s not initial impressions that are important, what is?
What we should really think about are some of the following key questions. I’m going to present this as a list, and then go into detail on a few of themFirst off, we would want to choose a datastore optimized for how many reads/writes the app is doing.The size of our data is incredibly important when choosing from in-memory vs. disk optimized data stores. Do we require multi-statement transactions? If so, there are only a few datastores that provide that out of the box. However, it is possible to build out this layer in the application.Our access patterns are important too, as is how you predict your workload will grow and change, and I’ll describe this in more detail later in the talk when I talk about scaling.Finally, what do you already know? The best datastore is the one you’re most familiar with. A lot of them have pretty funny semantics, and I’ll give some examples.
It’s important to figure out whether you need a read-optimized or a write-optimizeddatastore.For example, if your application needs to do a lot of writes, you might be in trouble if you choose MongoDB – [click]MongoDB has a global write lock per process. What this means is that you can’t do concurrent writes, so your write-heavy application will be really slowed down!
So let’s talk about the size of our data. Disk based solutions are slower, but it’s really important to pick a datastore that can handle our data size if it might not fit in memory. And figuring this out can be a bit tricky – a store like Redis might actually require 2X the memory of our data size! This is because it forks a process to do background persistence to disk, and in the worst case, we might need to replicate all of that memory between the two processes. If Redis doesn’t get this memory, it will become totally unresponsive, and we might wish we went with a slower disk-based solution.There are ways around this, by changing the way our virtual memory manager allocates memory, but that’s scary. Also, it might not work if we’re changing all the keys while this happens
Another great thing to do is lay out our application requirements. What can we tolerate, if we have to make a tradeoff? Here are some axes we should think about.First, performance, or latency tolerance. How slow is ok?Second, durability or data loss tolerance. How bad is it if we lose a few writes, in the case of a crash?Freshness, or staleness tolerance – really this means, can we serve stale data from a cache? For how long?Finally, uptime, or downtime tolerance. Do we need to be able to switch over to a replica in the case of failure immediately, or can we handle being read-only or down for a while?
First, let’s discuss performance. Oftentimes relational databases are considered very slow.But, they are doing a lot of work! Some of which you might not be interested in. As an example, consider the base case for an application, doing very simple primary key lookups
The application needs the data fast, since it is responding to user requests in real time. The data mostly fits in memory [click], with a high read rate [click] and it’s primary key lookups, so there is no need [click] to go to disk or scan lots of data.What this means is that we have removed the constraints of disk and network, and really the CPU is the bottleneck here.What does this mean for a relational database? How much does it cost in server CPU time to do one of these lookups?
If we don’t need all of the fancy query flexibility, we have some options. First off, the query cache. Skips optimization and parsing, good if we are doing the same query over and overPrepared statements avoid parsing, but not optimization phase, and can’t use query cache, but work for the same query with different arguments.Handler Socket – made by YoshinoriMatsunobu, incredibly fast. Bypasses all of the querying machinery and talks directly to the storage engine so we still have all of the durability and error recovery that the storage engine guarantees! Also what’s really cool is we have the flexibility of using SQL on that data if we want. Lesson:
Now, you might be wondering WHY we would bother with a relational database if we’re just not going to use half of it. CLICK. Well, we might be willing to pay that query overhead for flexibility, so we can access our data in many different ways.[Read 2nd point] we have to know queries up front, otherwise we might find ourselves in a situation where we can’t execute certain queries because they require scanning all the data. [click] SQL makes it easier to change what queries we are issuing, which is especially important if we don’t fully know all the features of our application yet.
Cut this?
Cut this?
Another really important issue is Durability. How is this defined? Well, in a persistent datastore, here is what you should expect.When a client does a write, the server flushes that right to disk BEFORE sending “done” to the client.That way, if the server were to crash, when it recovers we can see the write. [click, read 2nd point] In fact, it doesn’t even return to the client AT ALL after a write, so the client has no idea if the write succeeded or not without making another request. [CLICK, mysql]The thing is, some applications might not care about every single write making it to disk. Maybe they are just logging things to a database. But I really think making this the default, for supposedly persistent datastores, was a bad idea. This is something we really have to be aware of when choosing a datastore and building our application
What happens when we’ve gotten to the point where we have so much traffic that we need to shard? This is the time when we start thinking about customized data store solutions.At this point we know our application very well, [click]and we know where we use simple queries and where we need something more complex. We know where to create copies for fast reads and where we need to keep one copy for consistency. We can use a combination of datastores [click] to get the best performance, but we know what tradeoffs we can make to use one datastore for ease of developmentHopefully we will reach this stage of tremendous growth. I’ve told before you don’t need to worry about scaling in the beginning, and here’s why: No matter what we chose initially, we can scale it! I’ll show you what I mean.
First, let’s discuss some ways of scaling – when scaling reads, we have a few options. We can set up a cache, like memcached. We can replicate your servers, so reads can be spread across all of them, like with MongoDB replica sets or MySQL master-slave replication, or we can partition the data amongst different severs.For writes we only have the latter option, so I’m going to talk about partitioning when I talk about scaling.
Now, there’s a lot of folklore out there about which systems do what. I’m going to address two myths around scaling.
First, the idea that we had to have chosen a NoSQL database from the beginning, because relational databases don’t scale well. This is not true.Second, that you can’t do a join if you have a sharded database. Also not true!
Scaling isn’t about the datastore, it’s about the application.[click]Plenty of applications have scaled using a relational database. They just use them in a different way. Friendfeed and Quora both appreciated the tested solution of MySQL, and built partitioning into their applications. Facebook has thousands of MySQL servers. (though I guess Stonebraker would say they are trapped in a fate worse than death… handling 1B users is nothing to laugh at)The issue here is complex vs. simple queries.[click]Complex queries that go to all shards don’t scale. [click]. Simple … read [PAUSE]Great. So all we need to do is pick a good partition key and send every lookup and query to one server! But…
We still have a problem. LOTS of applications need to look up data in different ways. Nobody really talks about this. What do I mean? Well, let’s look at an example.
This is a post page in Hacker News. We need to get all of the comments on a post. Let’s even ignore nested comments for now.Here is a SQL query, getting all the comments for post 100.Here’s how we might do it in redis if you had comments stored in a sorted set by post id.I’d also just like to point out that I got this snapshot from hacker news today, and it’s about a guy getting bitten by the fact that mongo doesn’t return anything to the client after a write. HA.
Read slideNow it becomes a bit confusing to figure out how to use these table copies. Particularly if you are still using SQL.
To solve this problem we created a system called Dixie (Read 1st point).SQL queries originally written as though for a single database with no table copies[read rest]Pause after this.So does Dixie actually help?
Well, to evaluate this, we ran an experiment using Wikipedia.Here is a graph showing how throughput scales as we add more servers, using one copy per table or using Dixie and table copies.First off, a bit about this experiment – we used a Wikipedia database dump from 2008, and real HTTP traces of traffic. We partitioned this across 1, 2, 5, and 10 servers, both with one copy of each table and then with an added copy of one of the tables, the page table, partitioned in a different way.The yellow bars are most queries going to all servers [click] while the blue bars are most queries going to one server [click] because of the extra table copyOn 10 servers, Dixie plus a copy provides a 3.2 X improvement [click] in throughput over not having table copies.This shows that as we scale to more partitions, the benefit of table copies grows. When there are only one or two servers table copies don’t help much.
So we’ve shown that a SQL datastore can scale perfectly well, if we add table copies and make sure queries only need to go to one server. Now let’s talk about JOINs.
[read slide]Let’s look at an example
A little background on distributed query planning – there has been a lot of research in this area, and systems like Oracle and DB2 have sophisticated distributed query planners. Not a ton of open source solutions. Some things out there are very simple, like Gizzard.
This is what Dixie does!Dixie has a [read 1st point]And a [read 2nd point]
[read slide]So, now I’m going to change track and talk about research I’m currently working on, involving cacheing
And this is research with people from Harvard and MIT.[read slide]The problem is it’s often confusing as to whether we should set an expire time on the data, or if we should pay the cost of invalidating it on writes.
[last point] And we do tracking at the level of ranges, so that if a result is based on a range, it will get invalidated when new things are added to that range. Even if the range was originally empty!
[click] So here’s an example of how this might work. We cache a user’s view of their twitter homepage, their timeline.[click] We only invalidate this cached object when another user the original user is following tweets[click]So if a user reads their timeline infrequently, the invalidations don’t matter, we only recalculate the timeline once on a read[click] And if the people the user follows don’t tweet often, the timeline won’t change and won’t need to be recalculated
There are many challenges involved in this work. First off, we want to offer applications the option of using invalidation and expiration times together, if they are willing to occasionally serve stale data.Next, we are creating a distributed cache, so there is all of the work that goes along with that.Finally, we need to properly do cache coherence on the original database rows, which are spread out amongst all of these distributed caches.
But this coule be really helpful – it would mean that an application could get all of the benefits of cachingWhich include saving on computation, like not having to regenerate HTML pages, and sending less traffic to the backend databaseWhile it doesn’t have to worry about freshness or invalidating resultsThis is still a work in progress. Hopefully we’ll have a paper soon.
So in summary, I’ve told you about choosing adastore, scaling it, Dixie, and a dependency tracking cacheI’d love to talk further about any of these things if you’re interested.
Thank you so much for your time! I’m really excited to talk about any of this stuff, and I don’t know anyone at strange loop, so please feel free to say hi!Here’s my contact information, with pointers to some of the things I talked about.