Big Data DC - NoSQL at LucidMedia

NoSQL at LucidMedia
Nick Kleinschmidt

@kleinsch
nkleinsch@lucidmedia.com

Overview
• Who is LucidMedia?
• What is NoSQL?
• Major NoSQL Products
• Performance Results
• Pro Tips
• Questions

LucidMedia

• Online Display
Advertising Network

• Over 1.5B impressions/
day

• Based in Reston, VA

• Hiring engineers!

The Use Case

• Server-side user database (cookie store)
• Hundreds of millions of users
• Fast access - 5-10ms
• Cloud hardware

What is NoSQL?

• Data storage tools created in reaction to
common web scaling problems with
relational databases
• Widely differing purposes and feature sets

Problem: Scaling Writes

• Relational databases scale vertically - all
records must be on the same machine
• Solution - distribute data across machines,
scaling horizontally
• This solves the scaling problem, but makes
joins, grouping, transactions difﬁcult

Problem: High Latency
• Database is usually biggest contributor to
server-side application latency
• memcached pioneered low latency key-
value store
• Solution - compromise functionality for
speed
• Usually sacriﬁce transactions, advanced
query types

Problem: Inflexible
Schemas
• Relational databases require schema to be
defined ahead of time
• Flexible schema gives more options to
developers, handle upgrades in code instead
of writing SQL
• Storing custom formats can save lots of
space for records with sparse fields

General NoSQL
Features
• Storage Format / Operations
• Memory / Disk Utilization
• Atomic Operations
• Auto-Sharding - Partitioning data across
servers, scales reads and writes
• Replication - Copying data between
servers, scales reads

Types of Products
Key-Value Document Graph
• memcached • MongoDB • FlockDB
• Redis • CouchDB • Neo4j
• BerkeleyDB
• HBase
• Cassandra
• Amazon
SimpleDB

Evaluation - Lucidmedia
• Query latency is priority #1
• Disk access is suspect, since we’re in the
cloud
• Transactions not necessary - it’s OK to be
brieﬂy inconsistent or even lose a few
updates
• Replication and auto-sharding are nice, but
also can be done manually

Products Evaluated
Complex Storage Scalability
Type Data Storage License Used By
Operations Proﬁle Features

LRU cache Facebook,
Check and set Open Source
memcached Key-Value mapping string
(CAS)
All in memory None
(BSD)
Twitter,
to binary data YouTube

Indexing on
BSON multiple ﬁelds,
Disk and
Document objects(binary MapReduce, Auto-Sharding, Commercial, FourSquare,
MongoDB Store format similar atomic
memory
Replication AGPL bit.ly, ShutterFly
to JSON) operations
(single object)

Column family
Tunable
store - similar
Key-Value consistency, Disk and Facebook
to BigTable, Clustered, Open Source
Cassandra (Column
multiple data
atomic memory
Replication (Apache)
Inbox Search,
Store) operations Digg, Twitter
types for
(row level)
columns

Replication,
Simple key-
Many atomic All in memory, Cluster
value, supports Open Source GitHub, Digg,
Redis Key-Value
list, set, sorted
operations saved to disk (unreleased)
(BSD) LucidMedia
(single key) for persistence will provide
set, hash
auto-sharding

Findings
Pros Cons Using It?
We need more than a cache,
memcached Fast, widely used, great for caching MemcacheDB didn’t seem Yes (for other things)
widely used at the time

Great data model and feature set, strong Early versions had performance
MongoDB commercial support issues
No

Not optimized for our
Great for storing and searching huge
Cassandra amounts of data
problem, so performance didn’t No
ﬁt our needs

No auto-sharding (yet), memory
Lightning fast, very active development,
Redis useful feature set
footprint (per key) is a little Yes
high

Performance - GET
MySQL (InnoDB) Memcached Redis

6000
Throughput (reqs/sec)

4500

3000

1500

0
10 20 30 40 60
Concurrency (threads)
http://www.ruturaj.net/myisam-innodb

Performance - SET
MySQL (InnoDB) Memcached Redis

6000
Throughput (reqs/sec)

4500

3000

1500

0
10 20 30 40 60
http://www.ruturaj.net/myisam-innodb

Performance Testing

• Use real application data
• Approximate real conditions - run against
your web servers, not a simple test
program
• Averages hide important details - use
percentiles to measure latency

Drivers
• Huge performance Whalin SpyMemcached
difference between
drivers for the same 6
language

Latency (ms)
4.5
• Use asynchronous driver
when possible to 3
parallelize requests
1.5

0
1 10 20 30

Sharding

• Split into a large number of shards initially,
since you’re going to reshard eventually
• Automate shard management processes
• Measure performance and utilization
metrics in production to predict scaling
needs

Questions?

• @kleinsch
• nkleinsch@lucidmedia.com

Big Data DC - NoSQL at LucidMedia

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Big Data DC - NoSQL at LucidMedia

Notas do Editor