2. Overview
● Evolution of/motivation for NoSQL
databases
● Characterization of NoSQL databases
● Classification of NoSQL databases
● Popularity/usage of NoSQL systems
3. A brief history of NoSQL
● Originally coined in 1998 by Strozzi for
specific non-rel database
○ easy to use, free, text based data storage, easy
manipulation of contents of db
● Reintroduced by Evans (Rackspace) in 2009
for conf on open source distributed
databases
○ in response to increase in interest in non RDBMS
solutions
■ bringing together Cassandra, Mongo, Couch, etc
● Has grown as a movement over last 3 years
4. Current status
● Significant buzz within community in 2010
○ initial development of technology
○ pioneer deployments
○ lots of meetups/conferences/birds of feathers
● Many key technologies evolved later 2010,
2011
○ more large deployments for some technologies
○ small companies with no legacy basing operations
on NoSQL
5. Current Status
● 2012
○ buzz/hype is fading
○ technology continues to mature
○ increased number of deployments
○ skills sought in job market
6. NoSQL - a negative
definition
● NoSQL simply defined by being non-
relational
○ diverse set of technologies fall into NoSQL camp
● Motivations mixed
○ open source
○ scale - TB, PB - particulary for read/write latency
○ increased flexibility over RDBMS systems
○ ability to work with raw data
○ ACID not always most appropriate design choice
■ analytics data is excellent example
● Results in many different NoSQL
technologies
7. Typical characteristics
● Don't use SQL!
● Open Source
● Intended to deliver performance
○ in some dimension
● Typically JOIN not supported
○ performance hit
● Consistency often relaxed
○ eventual consistency
● More flexibility in schema
○ if schema used at all!
8. Diversity of NoSQL
databases
● 122 seperate technologies listed on http:
//nosql-database.org/
○ mix of commercial, open source and some
inbetween
● Vary in many dimensions:
○ architecture
○ interfaces
■ api/languages
○ internal data storage
○ distribution mechanisms
■ redundancy, reliability
○ usage - deployments & support community
○ maturity
9. Classification of NoSQL
systems
● Column based solutions
● Document store solutions
● Key/Value solutions
● Graph based solutions
● Less significantly:
○ XML databases
○ Object databases
○ Mulitvalue databases
10. Column based solutions
● Structured data
○ similar to classical tables
● Generally much more flexible
○ no rigorous schema necessary
○ can typically add columns in ad hoc fashion
■ often without explicitly declaring column
● However, can result in very different usage
○ eg can have millions of columns associated with
given row
● Examples: Hadoop/HBase, Cassandra,
Hypertable, SimpleDB
11. Document based solutions
● Less structured data
○ DB composed of 'documents' containing arbitrary
data
■ usually containing longer form content eg CMS
● Documents contain some structure to
support query/search/filter, etc
● Somewhat less emphasis on a key
○ can be autogenerated
● Quite unlike classical databases
● Examples: MongoDB, CouchDB
12. Key/value stores
● DBs inspired by memcache
○ simple, fast key/value stores
● Attempt to retain most of DB in memory
○ fast response times
● Different designs for scalability
○ single node/multi node
● Much emphasis on the keys in this type of
DB
● Write usually overwrites entire previous entry
● Examples: Redis, Couchbase/Membase,
DynamoDB, Riak
13. Graph based solutions
● Obviously different from previous categories
○ Focus specifically on graphs
● Queries supported are graph-specific
○ eg get nodes related to specified node
● Typically support for solving standard graph
problems
○ eg shortest path, general graph traversal
● Can deliver very significant performance
over non-graph specific solutions
○ for graph problems!
● Examples: Neo4j
14. It's a noisy space...
● Very many candidate technologies
● Relatively small amount of real world
solutions
● Differences between classifications above is
one of emphasis...
○ column based and document based arrive at semi-
structured sweet spot from opposite ends of
spectrum
● ...although this results in different preferred
use cases...
○ document based solution better for document
problems, eg CMS
15. Common techniques used
● Hashing techniques used to map data to
nodes in cluster
● Internode communication via Gossip
● Common replication techniques
● Thrift is used in a few cases
● MapReduce often used to search over
distributed system
19. Horses for courses...
● SQL is perfectly good solution for many
problems
○ tried and tested
● Some problems require alternative solution
○ typically driven by scale and/or flexibility
● NoSQL offers (many) alternatives
○ although relatively easy to identify realistic options
● Column based approaches good for mostly
structured data with enhanced flexibility
● Document based approaches good for
document oriented problems