NoSQL databases and MapReduce: From Big Data to distributed computing models

NoSQL databases and MapReduce J Singh Early Stage IT

What’s so fun about databases? Traditional database discussions talked about Employee records Bank records Now we talk about Web search Data mining The collective intelligence of tweets Scientific and medical databases

How much data can a database hold? The biggest OLTP databases 2001: 1.1 – 10.3 TB. 2003: 9.1 – 29.2 TB. 2005: 17.7 – 100.4 TB. 2010: ~2.5 PB. The trend will continue Very large databases bring new unique challenges

Historical Context Late 1990’s. The web scales out. Suddenly, databases not adequate for holding the data being accumulated Scale out vs. Scale up

Brewer’s Conjecture (p1) Source: Eric Brewer’s July 2000 PODC Keynote Main points: Classic “Distributed Systems” don’t work They focus on computation, not data Distributing computation is easy, distributing data is hard DBMS research is about ACID (mostly) Atomicity, Consistency, Isolation and Durability But we forfeit “C” and “I” for availability, graceful degradation and performance – this tradeoff is fundamental BASE Basically Available Soft-state Eventual Consistency

Brewer’s Conjecture (p2) BASE Weak consistency stale data OK Availability first Best effort Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution ACID Strong consistency Isolation Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution (e.g. schema) But I think it’s a spectrum Eric Brewer

CAP Theorem Since then, Brewer’s conjecture formally proved: Gilbert & Lynch, 2002 Thus Brewer’s conjecture became the CAP theorem… …and contributed to the birth of the NoSQL movement But the theory is not settled While http://nosql-database.org/ lists 122 NoSQL databases

What is NoSQL? Stands for Not Only SQL Class of non-relational data storage systems Usually do not require a fixed table schema nor do they use the concept of joins All NoSQL offerings relax one or more of the ACID properties

Forces at Work Three major papers were the seeds of the NoSQL movement CAP Theorem (discussed above) BigTable(Google) Dynamo (Amazon) Some types of data could not be modeled well in RDBMS Document Storage and Indexing Recursive Data and Graphs Time Series Data Genomics Data

NoSQL Databases Key-Value Stores A storage system that stores values, indexed by a key. Example: Voldemort, Dynomite, Tokyo Cabinet BigTable Clones (aka "ColumnFamily") A tabular model where each row (at least in theory) can have an individual configuration of columns. Example: HBase, Hypertable, Cassandra, Amazon SimpleDB

NoSQL Databases Document Databases Collections of documents, which contain key-value collections (called "documents") Example: CouchDB, MongoDB, Riak Graph Databases Nodes & relationships, both of which can hold key-value pairs Example: AllegroGraph, InfoGrid, Neo4j

Amazon SimpleDB Key-value store Written in Erlang, (as is CouchDB) Data is modeled in terms of Domain, a container of entities, Item, an entity and Attribute and Value, a property of an Item Eventually Consistent, except when ReadConsistent flag specified Impressive performance numbers, e.g., .7 sec to store 1 million records SQL-like SELECT select output_list from domain_name [where expression] [sort_instructions] [limit limit]

Google Datastore Part of App Engine; also used for internal applications Used for all storage Incorporates a transaction model to ensure high consistency Optimistic locking Transactions can fail CAP implications Datastore isn’t just “eventually consistent” They offer two commercial options (with different prices) Master/Slave Low latency but also lower availability Asynchronous replication High Replication Strong availability at the cost of higher latency

For more info, see video of Ryan Barrett’s talk at Google I/ODatastore Application at Google

Databases and Key-Value Stores http://browsertoolkit.com/fault-tolerance.png

MapReduce Conceptual Underpinnings Programming model from Lisp and other functional languages (map square '(1 2 3 4))  (1 4 9 16) (reduce + '(1 4 9 16)) 30 Easy to distribute Nice failure/retry semantics

HadoopMapReduce An Open Source project of the Apache Foundation Other Hadoop-related projects at Apache include: Cassandra™: A scalable multi-master database with no single points of failure. HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Pig™: A high-level data-flow language and execution framework for parallel computation. See the Apache Hadoop website for more.

Hadoop Availability Run on your laptop Run on your server Run on Amazon Cloud Introduction at IBM DeveloperWorks Run on Google App Engine It’s not Hadoop, it’s Google’s implementation of MapReduce

MapReduce Statistics @ GOOG Take-away message: MapReduce is not a “new-fangled technology of the future” It is here, it is proven, use it!

End of an Era? The Relational Model is not necessarily the answer It was excellent for data processing Not a natural fit for Data Warehouses Web-oriented search Real-time analytics, and Semi-structured data i.e., Semantic Web SQL is not the answer Coupling between modern programming languages and SQL are “ugly beyond belief” Programming languages have evolved while SQL has remained static Pascal C/C++ Java The little languages: Python, Perl, PHP, Ruby ,[object Object],A critique of the “one size fits all” assumption in DBMS

NoSQL databases and MapReduce: From Big Data to distributed computing models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to NoSQL databases and MapReduce: From Big Data to distributed computing models

Similar to NoSQL databases and MapReduce: From Big Data to distributed computing models (20)

More from J Singh

More from J Singh (20)

Recently uploaded

Recently uploaded (20)

NoSQL databases and MapReduce: From Big Data to distributed computing models