NoSQL - Life Beyond the Outer Join

NoSQL - Life beyond the Outer
Join
Glen Smith
(glen@bytecode.com.au)

Objectives

 Survey the landscape of NoSQL offerings
 Learn some of the terminology
 Look at some of the Java offerings in the space
 Take away source to play with
 Be able to ask questions (but you may not get
answers)

What is NoSQL?

 (N)ot (O)nly SQL not “Anti SQL”
 Movement more than “one” technology
 Distributed Storage System
 Much weaker queries
 Scale across many machines
 Much larger data, much faster queries

Why NoSQL?

 Inspired by Distributed Data Storage problems
 Scale easily by adding servers
 Not suited to all problem types, but super-suited to
certain large problem types
 High-write situations (eg activity tracking or timeline
rendering for millions of users)
 A lot of relational uses are really dumbed down (eg
fetch by PK with update)

What’s wrong with RDBMS?

 Nothing ;-)
 To scale RDBMS, your approach is typically:
 Shard your datasource
 Put in a bunch of read replicas
 Put memcached in front of those
 What could possibly go wrong? 
 Complex. Custom caching. Partitioning. Migrating of
shards. Tons of moving parts.

How can I live w/o ACID?

 Atomic (it happens or not, no partial completes)
 Consistent (DB internals, ref integ, field validate)
 Isolated (Can’t modify uncommitted data)
 Durable (written to disk/transaction log)

 But in a distributed db, life is not so simple...

The CAP theorum

In a distributed system, when you have state on more
than one machine, pick any two:
 Consistency (easy in read-only states – copy!)
 Availability (can you get at your data? Is it up?)
 Partition Tolerance (3 machines on one net, 3 on the
other, with a broken link. How do you take updates
since you can’t keep people up to date. What if you
don’t agree on what’s up?)

How do these NoSQL things work?

 Basically big distributed hashtables
 Push all logic into the write (update two lists – one for
userId, one for email)
 Things don’t happen transactionally. These are two
writes.
 There is no free lunch. The programmer is now
handling consistency problems.
 You were thinking about query optimisation before,
and now even more so.

How big are we talking?

 Digg - 3Tb
 Facebook Inbox – 50 Tb
 eBay – 2 Pb
 Think about Twitter’s issues.. Billion of queries a
second over Tb of data.

The NoSQL Taxonomy

 Key-Value In-Memory stores (Memcached, Redis)
 Key-Value “Eventually Consistent” stores (“Dynamo
Clones” like Cassandra, Voldemort, Riak)
 Document stores (Couchdb, Mongodb, JCR)
 Graph Databases (Neo4j)
 Tabular (“BigTable clones” like Hadoop/Hbase)

Memcached

 Developed for the original LiveJournal site
 LRU, distributed hashtable
 Logic is in both client and server
 Used in Google App Engine, Facebook, Twitter
 Ehcache now has similar service
 Good for things that outlive an app server

How does it work?

 Clients know how to:
 Send items to servers (consistent hashing)
 What to do when a server fails
 How to fetch keys from servers
 Can “weigh” to server capacities
 Servers know how to:
 Store items they receive
 Expire them from the cache
 No inter-server comms – everything is unaware

Voldemort

 Less than Memcached, but also more!
 Not a cache, but a distributed key/value store
 Developed by LinkedIn
 Works on distributed hashmap w/failover
 Logic can be in client/server or just server
 Pluggable storage (mysql,bdb,mock)
 Pluggable serialization (JSON, Google PB, etc)

“Relaxed” Consistency

 Eventual consistency – data will come into sync but
not immediately on the write. In practice “pretty
soon” is milliseconds later
 We are actually used to this – eg Google indexes
update every so often.
 Guarantees to read your own writes (eg your profile
on LinkedIn)
 Tuneable to better performance/weaker consistency

What’s attractive?

 Data is automatically replicated
 Partitioning ensures all servers have subset
 Server failure is handled transparently
 Data is rebalanced when servers added/removed
 Serialization is pluggable
 Apache License

Impressive Performance

 “We were able to move applications that needed to
handle hundreds of millions of reads and writes per day
from over 400ms to under 10ms while simultaneously
increasing the amount of data we store.”

Performance Info

http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010

Sample Script

 Starting the server (or deploy as a .war)
binvoldemort-server.bat configsingle_node_cluster
 Starting the console
binvoldemort-shell.bat test tcp://localhost:6666
 Run some queries
put “hello” “world”
get “hello”
put “hello” “world 2.0”
delete “hello”

CouchDb

 Document-Oriented Db – No Schema
 Written in Erlang (!) by a Notes Dev (!!!)
 Everything is stored in JSON, Restful API
 Clever replication concepts – works in disconnected
settings
 Every write is a new document, version
 Map/Reduce baked in
 Apache License

What’s attractive?

 Schemaless operation – Adhoc data
 Incremental replication (great for disconnected
settings)
 Great fault-tolerance (with versioned conflicts)
 Fast query with flexibility (MapReduce)

So what is this Map/Reduce thing?

 Popularized by Google’s BigTable
 Map functions collect documents matching criteria
and create a B-Tree
 Reduce functions operate on the B-Tree
 Everything happens in parallel on many machines
 Example: distributed grep

The Naked Couch

 http://127.0.0.1:5984/
 http://127.0.0.1:5984/_all_dbs
 http://127.0.0.1:5984/mydb (PUT)
 http://127.0.0.1:5984/_utils/ (Futon)

Mapping Couch with Ekron

 You lose some of the joy of schema-less
 But you do get lots of boilerplate ;-)
 Oh, and strong typing.

Writing a Couch MapReduce

 You write a map function to extract data
 You always return a key/value pair

function(doc) {
if (doc.title.indexOf(“Hi!") > -1) {
emit(doc.title, doc);
}
}

Neo4j

 Stored data in a graph of nodes and r’ships
 Can handle billions of nodes per machine
 Means you can query on relationships!
 Supports ACID transactions
 One 500kb jar (!)
 Dual-licensed GPL/Commercial

Blogvertising

 http://blogs.bytecode.com.au/glen
 http://twitter.com/glen_a_smith
 http://grailspodcast.com/

 Download all the source from today:
 http://bitbucket.org/glen_a_smith/cjug-nosql-
examples

Q&A

Looking for a good book?

NoSQL - Life Beyond the Outer Join

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

NoSQL - Life Beyond the Outer Join