2. What’s so fun about databases? Traditional database discussions talked about Employee records Bank records Now we talk about Web search Data mining The collective intelligence of tweets Scientific and medical databases
3. How much data can a database hold? The biggest OLTP databases 2001: 1.1 – 10.3 TB. 2003: 9.1 – 29.2 TB. 2005: 17.7 – 100.4 TB. 2010: ~2.5 PB. The trend will continue Very large databases bring new unique challenges
4. Historical Context Late 1990’s. The web scales out. Suddenly, databases not adequate for holding the data being accumulated Scale out vs. Scale up
5. Brewer’s Conjecture (p1) Source: Eric Brewer’s July 2000 PODC Keynote Main points: Classic “Distributed Systems” don’t work They focus on computation, not data Distributing computation is easy, distributing data is hard DBMS research is about ACID (mostly) Atomicity, Consistency, Isolation and Durability But we forfeit “C” and “I” for availability, graceful degradation and performance – this tradeoff is fundamental BASE Basically Available Soft-state Eventual Consistency
6. Brewer’s Conjecture (p2) BASE Weak consistency stale data OK Availability first Best effort Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution ACID Strong consistency Isolation Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution (e.g. schema) But I think it’s a spectrum Eric Brewer
7. CAP Theorem Since then, Brewer’s conjecture formally proved: Gilbert & Lynch, 2002 Thus Brewer’s conjecture became the CAP theorem… …and contributed to the birth of the NoSQL movement But the theory is not settled While http://nosql-database.org/ lists 122 NoSQL databases
8. What is NoSQL? Stands for Not Only SQL Class of non-relational data storage systems Usually do not require a fixed table schema nor do they use the concept of joins All NoSQL offerings relax one or more of the ACID properties
9. Forces at Work Three major papers were the seeds of the NoSQL movement CAP Theorem (discussed above) BigTable(Google) Dynamo (Amazon) Some types of data could not be modeled well in RDBMS Document Storage and Indexing Recursive Data and Graphs Time Series Data Genomics Data
10. NoSQL Databases Key-Value Stores A storage system that stores values, indexed by a key. Example: Voldemort, Dynomite, Tokyo Cabinet BigTable Clones (aka "ColumnFamily") A tabular model where each row (at least in theory) can have an individual configuration of columns. Example: HBase, Hypertable, Cassandra, Amazon SimpleDB
11. NoSQL Databases Document Databases Collections of documents, which contain key-value collections (called "documents") Example: CouchDB, MongoDB, Riak Graph Databases Nodes & relationships, both of which can hold key-value pairs Example: AllegroGraph, InfoGrid, Neo4j
12. Amazon SimpleDB Key-value store Written in Erlang, (as is CouchDB) Data is modeled in terms of Domain, a container of entities, Item, an entity and Attribute and Value, a property of an Item Eventually Consistent, except when ReadConsistent flag specified Impressive performance numbers, e.g., .7 sec to store 1 million records SQL-like SELECT select output_list from domain_name [where expression] [sort_instructions] [limit limit]
13. Google Datastore Part of App Engine; also used for internal applications Used for all storage Incorporates a transaction model to ensure high consistency Optimistic locking Transactions can fail CAP implications Datastore isn’t just “eventually consistent” They offer two commercial options (with different prices) Master/Slave Low latency but also lower availability Asynchronous replication High Replication Strong availability at the cost of higher latency
14.
15. For more info, see video of Ryan Barrett’s talk at Google I/ODatastore Application at Google
19. HadoopMapReduce An Open Source project of the Apache Foundation Other Hadoop-related projects at Apache include: Cassandra™: A scalable multi-master database with no single points of failure. HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Pig™: A high-level data-flow language and execution framework for parallel computation. See the Apache Hadoop website for more.
20. Hadoop Availability Run on your laptop Run on your server Run on Amazon Cloud Introduction at IBM DeveloperWorks Run on Google App Engine It’s not Hadoop, it’s Google’s implementation of MapReduce
21. MapReduce Statistics @ GOOG Take-away message: MapReduce is not a “new-fangled technology of the future” It is here, it is proven, use it!
22.
23. Take Aways NoSQL databases are a solution to web-scale problems A lot of data lives outside relational databases With SQLnix.org, we are starting a local resource for NoSQL database knowledge Taking on projects to apply the technology, not just read about it. If you want to work on it, please contact us. Thanks