This document discusses scalable systems and NoSQL databases. It provides examples of companies that use scalable systems like Google with BigTable, Amazon with Dynamo and SimpleDB, and Twitter with Hadoop and Pig. It then defines what is meant by a scalable system, such as doubling resources leading to doubling performance. Finally, it lists some popular technologies used to build scalable systems, including distributed key-value stores, document databases, and MapReduce.
2. Some scalable systems Google ~ BigTable Amazon ~ Dynamo ~ SimpleDB Microsoft ~Powerset ~ Bing ~ Dynomite Twitter ~ Hadoop ~ Pig Facbook ~ Digg ~ Cassandra ~ Thrift Nasdaq ~ tin ~ text & filesystem Akamai ~ Riak Ubuntu ~ LHC ~ BBC ~ CouchDB Linkedin ~ Gilt ~ Voldemort Business Insider ~ MongoDB Stuff built in Erlangby guys with physics degrees
3. How they define scalable If I add Xresources, then I gain Xperformance. If I double my nodes (servers), then I should get double the computing power. If I double my processors, then the processing should take half as long to do. If I double my network bandwidth, then I should be able to transmit twice as fast or twice as much data. If we double the amount of developers, then we should get twice the amount of work done.
4. Some chatter dump No… SQL, ORMs, Schemas, Joins, Foreign Keys, Transactions, ACID, RDBMS Distributed Key/Value Stores ~ Document-oriented Database ~ MapReduce Functional Languages ~ Erlang ~ F# ~ No OO RESTful ~ JSON ~ BSON ~ HTTP Horizontal vs. Vertical Scaling Google Bigtable Paper Dynamo Amazon Paper CAP Theorem (Consistency, Availability, Partition Tolerance) ~ Only 2 @ a time. BASE ~ Eventually Consistent for High Availability ~ DNS SLA ~ Number of 9s Code for Failure ~ Fault-tolerance ~ Graceful Degradation SN (Shared Nothing) Architecture ~ No bottlenecks Sharding~ Horizontal Partitioning Distributed Map ~ Consistent Hashing (Ring of Nodes) Sloppy Quorum ~ Minimum Nodes for R/W Hinted Handoff ~ Always Writeable ~ Handles Temp failures Merkle Tree Replication ~ Handles Permanent Failures Fault-tolerance ~ Read-Repair ~ Replication Vector Clocks (node, counter) ~ No Wall Clocks SuperColumns ~ ColumnFamily Stateless App Servers ~ P2P Bootstrapping CDN (Content Delivery Network) MVCC (Multiversion Concurrency Control) ~ B-tree ~ Tail Appends ~ Cluster Rebalancing
5. Some popular reads (Brewer’s CAP theorem) Towards a Robust Distributed Systems http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf (Google) Bigtable: A Distributed Storage System for Structured Data http://labs.google.com/papers/bigtable-osdi06.pdf Dynamo: Amazon’s Highly Available Key-value Store http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf