Describes the process of selecting a NoSQL product for use as part of LucidMedia's ad serving platform. Details pros/cons of several products and tips for general use.
5. The Use Case
• Server-side user database (cookie store)
• Hundreds of millions of users
• Fast access - 5-10ms
• Cloud hardware
6. What is NoSQL?
• Data storage tools created in reaction to
common web scaling problems with
relational databases
• Widely differing purposes and feature sets
7. Problem: Scaling Writes
• Relational databases scale vertically - all
records must be on the same machine
• Solution - distribute data across machines,
scaling horizontally
• This solves the scaling problem, but makes
joins, grouping, transactions difficult
8. Problem: High Latency
• Database is usually biggest contributor to
server-side application latency
• memcached pioneered low latency key-
value store
• Solution - compromise functionality for
speed
• Usually sacrifice transactions, advanced
query types
9. Problem: Inflexible
Schemas
• Relational databases require schema to be
defined ahead of time
• Flexible schema gives more options to
developers, handle upgrades in code instead
of writing SQL
• Storing custom formats can save lots of
space for records with sparse fields
10. General NoSQL
Features
• Storage Format / Operations
• Memory / Disk Utilization
• Atomic Operations
• Auto-Sharding - Partitioning data across
servers, scales reads and writes
• Replication - Copying data between
servers, scales reads
12. Evaluation - Lucidmedia
• Query latency is priority #1
• Disk access is suspect, since we’re in the
cloud
• Transactions not necessary - it’s OK to be
briefly inconsistent or even lose a few
updates
• Replication and auto-sharding are nice, but
also can be done manually
13. Products Evaluated
Complex Storage Scalability
Type Data Storage License Used By
Operations Profile Features
LRU cache Facebook,
Check and set Open Source
memcached Key-Value mapping string
(CAS)
All in memory None
(BSD)
Twitter,
to binary data YouTube
Indexing on
BSON multiple fields,
Disk and
Document objects(binary MapReduce, Auto-Sharding, Commercial, FourSquare,
MongoDB Store format similar atomic
memory
Replication AGPL bit.ly, ShutterFly
to JSON) operations
(single object)
Column family
Tunable
store - similar
Key-Value consistency, Disk and Facebook
to BigTable, Clustered, Open Source
Cassandra (Column
multiple data
atomic memory
Replication (Apache)
Inbox Search,
Store) operations Digg, Twitter
types for
(row level)
columns
Replication,
Simple key-
Many atomic All in memory, Cluster
value, supports Open Source GitHub, Digg,
Redis Key-Value
list, set, sorted
operations saved to disk (unreleased)
(BSD) LucidMedia
(single key) for persistence will provide
set, hash
auto-sharding
14. Findings
Pros Cons Using It?
We need more than a cache,
memcached Fast, widely used, great for caching MemcacheDB didn’t seem Yes (for other things)
widely used at the time
Great data model and feature set, strong Early versions had performance
MongoDB commercial support issues
No
Not optimized for our
Great for storing and searching huge
Cassandra amounts of data
problem, so performance didn’t No
fit our needs
No auto-sharding (yet), memory
Lightning fast, very active development,
Redis useful feature set
footprint (per key) is a little Yes
high
17. Performance Testing
• Use real application data
• Approximate real conditions - run against
your web servers, not a simple test
program
• Averages hide important details - use
percentiles to measure latency
18. Drivers
• Huge performance Whalin SpyMemcached
difference between
drivers for the same 6
language
Latency (ms)
4.5
• Use asynchronous driver
when possible to 3
parallelize requests
1.5
0
1 10 20 30
Concurrency (threads)
19. Sharding
• Split into a large number of shards initially,
since you’re going to reshard eventually
• Automate shard management processes
• Measure performance and utilization
metrics in production to predict scaling
needs