2. My background
• Oracle from July 1994 to June 2003
• MarkLogic from July 2003 to Feb 2011
• 10gen (makers of MongoDB) since Feb 2011
3. In this talk
• Why is everyone and their brother inventing
a new database nowadays
• Meanwhile, lots of great analytics are happening
in Hadoop with no database at all
• Why do they all look so different from each
other and what we’re used to
4. Since the dawn of the RDBMS
1970 2012
Main memory Intel 1103, 1k bits 4GB of RAM costs
$25.99
Mass storage IBM 3330 Model 1, 100 3TB Superspeed USB for
MB $129
Microprocessor Nearly – 4004 being Westmere EX has 10
developed; 4 bits and cores, 30MB L3 cache,
92,000 instructions per runs at 2.4GHz
second
5. More recent changes
A decade ago Now
Faster Buy a bigger server Buy more servers
Faster storage A SAN with more SSD
spindles
More reliable storage More expensive SAN More copies of local
storage
Deployed in Your data center The cloud – private or
public
Large data set Millions of rows Billions to trillions of
rows
Development Waterfall Iterative
Tasks Simple transactions Complex analytics
6. Assumptions behind today’s DBMS
• Relational data model
• Third normal form
• ACID
• Multi-statement transactions
• SQL
• RAM is small and disks are slow
• Runs on one fast computer
7. Yesterday’s assumptions in today’s world
• Scaleout is hard
• Or impossible if you believe the CAP theorem
• Custom solutions proliferate
• Too slow? Just add a cache
• ORM tools everywhere
• Only the database is scale-up
8. Challenging some assumptions
• Do you need a database at all
• How does it handle transactions and
consistency
• How does it scale out
• How should it model data
• How do you query it
9. My opinions
• Different use cases will produce different
answers
• Existing RDBMS solutions will continue to
solve a broad set of problems well but many
applications will work better on top of
alternative technologies
• Many new technologies will find niches but
only one or two will become mainstream
10. Do you need a database at all
• Can you better solve your problem with a
batch processing framework
• Can you better solve your problem with an in
memory object store/cache
11. Is Scaleout Mission Impossible
• What about the CAP Theorem?
• It says if a distributed system is partitioned, you
can’t be able to update everywhere and have
consistency
• Duh
• So, either allow inconsistency or limit where
updates can be applied
12. Two choices for consistency
• Eventual consistency
• Allow updates when a system has been partitioned
• Resolve conflicts later
• Example: CouchDB, Cassandra
• Immediate consistency
• Limit the application of updates to a single master
node for a given slice of data
• Another node can take over after a failure is detected
• Avoids the possibility of conflicts
• Example: MongoDB
13. Transactions
• Do they exist
• At what level of granularity
• MongoDB example
• Transactions are document-level
• Those short transactions are atomic, consistent,
isolated and durable
15. Scaleout architecture
• How do you distribute data among many
servers
• Choices
• Hashes (Dynamo style) vs ranges (BigTable style)
• Tradeoff: set-and-forget vs optimizability
• Physical vs logical segments
• Very important with secondary indexes
• Tradeoff: cluster rebalancing ease vs performance
optimization
• MongoDB : bigtable style range partitioning with
logical segmentation
16. Scaleout – no free lunch
• With a large cluster:
• No known solution to the general case of fast
distributed joins
• Some subcases can be handled
• No known solution to fast distributed transactions
17. Why mess with the data model
• Relational minus joins and multi-statement transactions is
much less useful
• What about partial solutions to joins and multi-statement
transactions
• Hard to implement
• Complex for developers to understand performance implications
• Therefore alternatives are worth considering for distributed
systems
• Common alternatives
• Key-value
• Document
• Graph
• Column-family
• MongoDB example: JSON-based document oriented
18. Change one assumption
• First normal form: no repeating groups
• Why?
• What if that is not a requirement?
• You need many fewer joins
• Transactions are often simplified
• Data locality is often increased
• But at a cost
• Much theory is now moot
• Implementation complexity
• From a different initial assumption, different
rules apply
19. Querying a database
• By primary key only
• Ad-hoc queries
• SQL or otherwise, but language details are a
minor choice
• Via map-reduce
• OLTP and BI together
• Eg, SAP HANA
• MongoDB example: ad-hoc queries (based
on JSON) and map-reduce