SlideShare uma empresa Scribd logo
1 de 69
NoSQL, CAP, and relativity
2013-09-18
Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga
1
Agenda
2
• CAP theorem mostly
• A bit about NoSQL databases
• Are triple stores NoSQL?
• Connection with Einstein’s theory of
relativity
• And, finally, a surprise
NoSQL databases
3
What makes a NoSQL database?
4
• Doesn’t use SQL as query language
– usually more primitive query language
– sometimes key/value only
• BASE rather than ACID
– that is, sacrifices consistency for availability
– much more about this later
• Schemaless
– that is, data need not conform to a predefined
schema
BASE vs ACID
• ACID
– Atomicity
– Consistency
– Isolation
– Durability
• BASE
– Basically Available
– Soft-state
– Eventual consistency
5
Eventual consistency
• A key property of non-ACID systems
• Means
– if no further changes made,
– eventually all nodes will be consistent
• In itself eventual consistency is a very weak
guarantee
– when is “eventually”? it doesn’t say
– in practice it means the system can be inconsistent at
any time
• Stronger guarantees are sometimes made
– with prediction and measuring, actual behaviour can
be quantified
– in practice, systems often appear strongly consistent
6
Implementing ev. consistency
• Nodes must exchange information about
writes
– basically, after performing a write, node must
inform all other replicas of the changed objects
– signal OK to reader before or during replication
– for example, by broadcast to all nodes
• Must have some way to deal with conflicts
– all nodes must agree on conflict resolution
– common solution: embed clock value in write
message, then let last writer win
– clock need not be in sync for all nodes
7
What’s wrong with ACID?
• Semantics are easier for developers
– applications can lean back and just trust the db
• However, doesn’t scale as well
– that is, doesn’t scale with number of nodes
– requires too much communication and agreement
between nodes
• Bigger web sites have therefore gone BASE
– Facebook, Flickr,Twitter, ...
8
Other benefits of NoSQL
• Schemaless
– possible to write much more flexible code
– schema evolution vastly easier
• Avoid joins
– document databases allow hierarchical non-
normalized objects to be retrieved directly
9
Downsides to NoSQL
• Everyone knows SQL
– few people know your specific NoSQL database
• Lack of validation
– code will typically do anything the database lets it
get away with (especially over time)
• No standards
– you can’t easily switch databases
– (well, except with SPARQL)
• Lack of maturity
– lack of supporting tools, unpleasant surprises, ...
• Weak query languages
– means you have to do more in code
– may hurt performance
10
Triple stores (SPARQL)
• Non-SQL? Yes
• BASE? No
• Schemaless? Yes
11
Only two out of three, so whether
triple stores are NoSQL databases
is debatable.
At the very least, they differ substantially
from the core examples of NoSQL
databases.
Can triple stores be BASE?
• In theory, yes
– nothing inherent in graph structure that prevents it
• But,
– how do you shard graph data?
– no known way to do it that’s efficient
• So, in practice this is hard
12
When should you use NoSQL?
• If scalability is a concern
– however, don’t forget that sites like Flickr and
Wikipedia used RDBMSs for years
– relational databases scale a long way
• If schemalessness is important
– sometimes it really is
– seriously consider RDF for this use case
• If fashion is a concern
– for a surprising number of people, using something
new and shiny is the main thing
13
The CAP Theorem
14
CAP
15
• Consistency
– all nodes always give the same answer
• Availability
– nodes always answer queries and accept updates
• Partition-tolerance
– system continues working even if one or more
nodes go quiet
CAPTheorem:You can only have two of these.
Partition tolerance: Without this, the cluster dies the moment one node goes silent. Can’t really drop this one.
C ≠ C
16
• C in ACID
– means all data obeys constraints in schema
• C in CAP
– means all servers agree on the data
• However,
– ACID implementations also follow the C in CAP
History
17
• First formulated by Eric Brewer in 2000
– based on experience with Inktomi search engine
– described the SQL/NoSQL divide very well
– coined the BASE acronym
• Formalized and proven in 2002
– by Seth Gilbert and Nancy Lynch
• Today CAP is better understood
– widely considered a key tradeoff in designing
distributed systems
– and particularly databases
– in some ways gave rise to NoSQL databases
Consistent or Available?
18
request request request
Consistency
19
DB node 1 DB node 2Client 1
Client
2
read account X
balance -> 100
set account X
balance = 0 set account X
balance = 0
set account X
balance = 0
set account X
balance = 0
Availability
20
DB node 1 DB node 2Client 1
Client
2
read account X
balance -> 100
set account X
balance = 0 set account X
balance = 0
set account X
balance = 0
read account X
balance -> 100
set account X
balance = 0
Happy customer walks
away, richer by 200.
Servers eventually agree
balance is 0.
What exactly is the problem?
21
• The ordering of events affects the outcome
• The different nodes do not necessarily
observe the same order
• However, it is possible to impose a
consistent ordering of events
• The trouble is, that involves
communication with all nodes
• Waiting for all of them takes time
Math digression
22
Time, Clocks and the Ordering of
Events in a Distributed System
23
The origin of this paper was a note titled The
Maintenance of Duplicate Databases by Paul
Johnson and BobThomas. I believe their note
introduced the idea of using message time-
stamps in a distributed algorithm. I happen to
have a solid, visceral understanding of
special relativity (see [5]). This enabled me to
grasp immediately the essence of what they
were trying to do. ... I realized that the
essence of Johnson andThomas's algorithm
was the use of timestamps to provide a total
ordering of events that was consistent with
the causal order. ...
It didn't take me long to realize that an
algorithm for totally ordering events could be
used to implement any distributed system.
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
Order theory
24
• An ordering relation is any relation ≤ such
that
– a ≤ a (reflexivity)
– if a ≤ b and b ≤ a then a = b (antisymmetry)
– if a ≤ b and b ≤ c then a ≤ c (transitivity)
• A total order is an order such that
– a ≤ b or b ≤ a (totality)
• A partial order is any order which is not total
– that is, for some pairs a and b, neither a ≤ b nor b ≤ a
Examples
25
• Total orders
– normal ordering of numbers and letters
• Partial orders
– ordering sets by the subset relation (see figure)
– “In general, the ·order-relation· on duration is a
partial order since there is no determinate
relationship between certain durations such as one
month (P1M) and 30 days (P30D)” XML Schema, pt2
Relativity
26
History
27
• 1687
– Isaac Newton publishes Philosophiæ
Naturalis Principia Mathematica
– physics begins
– no changes over next two centuries
• 1905
– Albert Einstein publishes special relativity
– abandons notions of fixed time and space
• 1916
– Einstein’s general relativity
– takes into account gravity
– no changes since
28
...
(we skip 270 slides)
The barn is too small
29
• Three people (M, F, and B) own a board
(5m wide) and a barn (4m wide)
• The board doesn’t fit inside the barn!
• What to do?
4
30
We have a solution!
31
As seen by F & B: board is 4m long (relativistic shortening), barn continues to be 4m wide.
When the board is exactly inside the barn, F and B will close their doors simultaneously,
and the problem will be solved.
(As seen by M: board is 5m long (at rest relative to him), barn shortened to 3.2m wide. Pay no attention to this.)
What they observe
F and B
• When the board is just
inside, both close their
doors simultaneously
• Right after, the board
crashes through back door
M
• B shuts his door just as
the front of the board
reaches him
• 0.6 seconds later, F closes
his door
• Board crashes through
back door
32
The key point
• They don’t agree on the order of events!
– and this is not a paradox
– it is in fact how the universe works
• Change the story slightly and the three
people could have three different orders of
events
• No total order of events exists on which all
observers can agree
– the ordering of events in the universe is a partial
order
• What then of causality?
– ifA causes B, but some people think B happened
beforeA, then what?
33
Resolution
34
• A, B, and C are events
• The cone is the “light cone” from A
– that is, the spread of light fromA
• C is outside the cone
– thereforeA cannot influence C
– observers may disagree on order of A&C
• B is inside the cone
– thereforeA can influence B
– observers may not disagree on order
35
...
(we skip 532 slides)
Back to CAP
36
Relevance to the CAP theorem
37
• Distributed nodes will never agree on the
order of events
– unless a communication delay is introduced
• Basically, only events inside the “light
cone” can be totally ordered
– communications delay can never be less than time
taken by light to traverse physical distance
– in practice, it will be quite a bit bigger
– how big depends on hardware and design
constraints
One solution: Paxos
38
• Protocol created by Leslie Lamport
• Can be used to introduce logical clock
– all nodes agree to always increase the number
• Which again can order events
• Allowing all nodes to agree on order of events
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#paxos-simple
Another solution: ev. consistency
39
• Essentially, this solution says some % of
errors is acceptable
– Amazon may have to compensate purchasers with
a gift card once every 100,000 transactions
– business value of remaining available is higher
than cost of errors
• Even in banking this may apply
– ATMs often continue working even if contact with
bank is lost
– allow withdrawals up to some limit
– accept it if customers overcharge
– (they’ll have to pay fees and interest, anyway)
CALM
40
• Client code may have to compensate for
database inconsistencies
– this can quickly become complex
– complex means error-prone
• However, there is a way around that
– CALM = consistency as logical monotonicity
– means facts used by clients to make decisions
never change
• A database that never deletes or
overwrites is CALM
– for example because it logs trades or other events
– non-CALM databases may require ACID
A CALM example
41
• Client 1 reads A = 10
• Client 1 uses this information to write B = 5
• If Client 2 now reads B = 5, that client
cannot read a value of A older than A = 10
• Doing so would violate what’s known as
“causal consistency”
ACID 2.0
42
• A bit misleading, because it’s not really
ACID at all
• Basically requires update operations to
have these properties
– associativity a + (b + c) = (a + b) + c
– commutativity a + b = b + a
– idempotence f(x) = f(f(x))
– distributed
• One approach is to use datatypes which
guarantee these properties
SDShare updates
43
• These are actually
– associative
– commutative
– idempotent
• So SDShare is already ACID 2.0...
Another solution: datatypes
44
• CRDTs
– commutative, replicated data types
• Basically,
– data types designed so that the order of
operations doesn’t matter
– the end result is always the same, regardless of the
order of operations
An example
45
• A problem with our cash withdrawal
example is the overwrite operation
• What if the operation were
“increment(account, -100)” instead?
– this operation is associative and commutative
– (not inherently idempotent, however)
• Nodes can now apply incoming updates as
they get them
– the ordering of updates can be ignored
– once all updates are applied, all nodes will agree
(which is eventual consistency)
46
Thus far, all is well, and everyone agrees
47
“To go wildly faster, one
must remove all four
sources of the overhead
discussed above.This is
possible in either a SQL
context or some other
context.”
48
What on earth is he talking about?
Everyone knows SQL and ACID don’t scale!
Meanwhile, at
Google...
49
AdWords database difficulties
50
This backend was originally based on a MySQL database that
was manually sharded many ways.The uncompressed dataset
is tens of terabytes, which is small compared to many NoSQL
instances, but was large enough to cause difficulties with
sharded MySQL.The MySQL sharding scheme assigned each
customer and all related data to a fixed shard.This layout
enabled the use of indexes and complex query processing on a
per-customer basis, but required some knowledge of the
sharding in application business logic. Resharding this
revenue-critical database as it grew in the number of customers
and their data was extremely costly. The last resharding took
over two years of intense effort, and involved coordination and
testing across dozens of teams to minimize risk.
More background
51
We store financial data and have hard requirements on
data integrity and consistency.We also have a lot of
experience with eventual consistency systems at
Google. In all such systems, we find developers spend a
significant fraction of their time building extremely
complex and error-prone mechanisms to cope with
eventual consistency and handle data that may be out
of date. We think this is an unacceptable burden to
place on developers and that consistency problems
should be solved at the database level.
Yet more background
52
At least 300 applications within Google use
Megastore (despite its relatively low per-
formance) because its data model is simpler
to manage than Bigtable’s, and because of its
support for synchronous replication across
datacenters. (Bigtable only supports
eventually-consistent replication across data-
centers.) Examples of well-known Google
applications that use Megastore are Gmail,
Picasa, Calendar, Android Market, and
AppEngine.
Requirements
• Scalability
– scale simply by adding hardware
– no manual sharding
• Availability
– no downtime, for any reason
• Consistency
– fullACID transactions
• Usability
– full SQL with indexes
53
Uh, didn’t we just learn
that this is impossible?
Spanner
54
• Globally-distributed semi-relational db
– SQL as query language
– versioned data with non-locking read-only
transactions
• Externally consistent reads/writes
• Atomic schema updates
– even while transactions are running
• High availability
– experiment: killing 25 out of 125 servers has no
effect (except on throughput)
Transaction model
55
• Fairly close to traditional MVCC
– every row has a timestamp
– reads have associated timestamp, see database as
of that point in time
• The key is a consistent order of timestamps
across nodes
Spanner architecture
56
Spanner architecture #2
57
TrueTime
58
• Enables consistency in Spanner by giving
transactions timestamps
– that is, imposes a consistent ordering on
transactions
• Represents time with uncertainty interval
– the bigger the uncertainty, the more careful nodes
must be
– bigger uncertainty leads to slower transactions
• Uses two kinds of time servers to reduce
uncertainty
– GPS-based servers
– atomic clocks
Use of Paxos is key
• Combines Paxos withTrueTime to ensure
timestamps are monotonically increasing
• Paxos requires majority votes to agree
– implies less than half of data centers can fail at any
one time
• AdWords therefore runs with 5 data centers
– allows two simultaneous failures without effect
– three on East Coast, two onWest Coast
– (in Google East Coast +West Goast = globally)
59
Data model
60
F1 – the next layer up
61
• Builds on Spanner, adds
– distributed SQL queries
– including joins from external sources
– transactionally consistent indexes
– asynchronous schema changes
– optimistic transactions
– automatic change history
Why built-in change history?
62
“Many database users build mechanisms to log changes, either from
application code or using database features like triggers. In the MySQL system
that AdWords used before F1, our Java application libraries added change
history records into all transactions.This was nice, but it was inefficient and
never 100% reliable. Some classes of changes would not get history records,
including changes written from Python scripts and manual SQL data
changes.”
Application code is not enough to enforce business rules,
because many important changes are made behind the
application code. For example, data conversion.
Look at any database that’s a few years old, and you’ll
find data disallowed by the application code, but allowed
by the schema.
Distributed queries
63
Two interfaces
• NoSQL interface
– basically a simple key->row lookup
– simpler in code for object lookup
– faster because no SQL parsing
• Full SQL interface
– good for analytics and more complex interactions
64
Status
• >100 terabyte of uncompressed data
– distributed across 5 data centers
– Five nines (99.999%) uptime
• Serves up to hundreds of thousands of
requests/second
• SQL queries scan trillions of rows/day
• No observable increase of latency
compared to MySQL-based backend
– but change tracking and sharding now invisible to
application
65
Winding up
66
Conclusion
67
• NoSQL is mostly about BASE
– to some degree also schemalessness
• The CAPTheorem is key to understanding
distributed systems
– NoSQL is BASE because of CAP
• The CAPTheorem is a consequence of the
theory of relativity
• New systems seem to indicate that ACID
may scale, after all
– basically, the speed of light is greater than we
thought
Further reading
68
• NoSQL eMag, InfoQ, pilot issue May 2013
– http://www.infoq.com/minibooks/emag-NoSQL
• Brewer’s original presentation
– http://www.cs.berkeley.edu/~brewer/cs262b-
2004/PODC-keynote.pdf
• Proof by Lynch & Gilbert
– http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture
-SigAct.pdf
• Why E=mc2?, Cox & Forshaw
• Eventual ConsistencyToday: Limitations,
Extensions, and Beyond, ACM Queue
– http://queue.acm.org/detail.cfm?id=2462076
Further reading
69
• Spanner paper
– http://research.google.com/archive/spanner.html
• F1 papers
– http://research.google.com/pubs/pub38125.html
– http://research.google.com/pubs/pub41376.html

Mais conteúdo relacionado

Mais procurados

7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Resilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsResilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsUwe Friedrichsen
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overviewPritamKathar
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep LearningNatasha Latysheva
 
5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency DatabaseScyllaDB
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Centralized logging for (java) applications with the elastic stack made easy
Centralized logging for (java) applications with the elastic stack   made easyCentralized logging for (java) applications with the elastic stack   made easy
Centralized logging for (java) applications with the elastic stack made easyfelixbarny
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL DatabasesBADR
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 

Mais procurados (20)

7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Resilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsResilience reloaded - more resilience patterns
Resilience reloaded - more resilience patterns
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hazelcast
HazelcastHazelcast
Hazelcast
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
The CAP Theorem
The CAP Theorem The CAP Theorem
The CAP Theorem
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Centralized logging for (java) applications with the elastic stack made easy
Centralized logging for (java) applications with the elastic stack   made easyCentralized logging for (java) applications with the elastic stack   made easy
Centralized logging for (java) applications with the elastic stack made easy
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
HazelCast
HazelCastHazelCast
HazelCast
 

Semelhante a NoSQL databases, the CAP theorem, and the theory of relativity

NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityLars Marius Garshol
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeDilum Bandara
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdfShaimaaMohamedGalal
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
 
Quantum Computers New Generation of Computers Part 9 Quantum Computers Readin...
Quantum Computers New Generation of Computers Part 9 Quantum Computers Readin...Quantum Computers New Generation of Computers Part 9 Quantum Computers Readin...
Quantum Computers New Generation of Computers Part 9 Quantum Computers Readin...Professor Lili Saghafi
 
Bringing the Unix Philosophy to Big Data
Bringing the Unix Philosophy to Big DataBringing the Unix Philosophy to Big Data
Bringing the Unix Philosophy to Big Databcantrill
 
introduction into IR
introduction into IRintroduction into IR
introduction into IRssusere3b1a2
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingStefan Marr
 
Rust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsRust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsScyllaDB
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Deductive databases
Deductive databasesDeductive databases
Deductive databasesJohn Popoola
 
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...MayaData
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)Daniel Austin
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Saltmarch Media
 
Your Database is Trying to Kill You
Your Database is Trying to Kill YouYour Database is Trying to Kill You
Your Database is Trying to Kill YouKevin Lawver
 

Semelhante a NoSQL databases, the CAP theorem, and the theory of relativity (20)

NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain Syndrome
 
6269441.ppt
6269441.ppt6269441.ppt
6269441.ppt
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Quantum Computers New Generation of Computers Part 9 Quantum Computers Readin...
Quantum Computers New Generation of Computers Part 9 Quantum Computers Readin...Quantum Computers New Generation of Computers Part 9 Quantum Computers Readin...
Quantum Computers New Generation of Computers Part 9 Quantum Computers Readin...
 
Bringing the Unix Philosophy to Big Data
Bringing the Unix Philosophy to Big DataBringing the Unix Philosophy to Big Data
Bringing the Unix Philosophy to Big Data
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent Programming
 
Rust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsRust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency Systems
 
Radcliffe
RadcliffeRadcliffe
Radcliffe
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
 
ds7_con.ppt
ds7_con.pptds7_con.ppt
ds7_con.ppt
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
 
Wither OWL
Wither OWLWither OWL
Wither OWL
 
Your Database is Trying to Kill You
Your Database is Trying to Kill YouYour Database is Trying to Kill You
Your Database is Trying to Kill You
 

Mais de Lars Marius Garshol

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformationLars Marius Garshol
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at SchibstedLars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLars Marius Garshol
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceLars Marius Garshol
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programmingLars Marius Garshol
 

Mais de Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 

Último

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

NoSQL databases, the CAP theorem, and the theory of relativity

  • 1. NoSQL, CAP, and relativity 2013-09-18 Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga 1
  • 2. Agenda 2 • CAP theorem mostly • A bit about NoSQL databases • Are triple stores NoSQL? • Connection with Einstein’s theory of relativity • And, finally, a surprise
  • 4. What makes a NoSQL database? 4 • Doesn’t use SQL as query language – usually more primitive query language – sometimes key/value only • BASE rather than ACID – that is, sacrifices consistency for availability – much more about this later • Schemaless – that is, data need not conform to a predefined schema
  • 5. BASE vs ACID • ACID – Atomicity – Consistency – Isolation – Durability • BASE – Basically Available – Soft-state – Eventual consistency 5
  • 6. Eventual consistency • A key property of non-ACID systems • Means – if no further changes made, – eventually all nodes will be consistent • In itself eventual consistency is a very weak guarantee – when is “eventually”? it doesn’t say – in practice it means the system can be inconsistent at any time • Stronger guarantees are sometimes made – with prediction and measuring, actual behaviour can be quantified – in practice, systems often appear strongly consistent 6
  • 7. Implementing ev. consistency • Nodes must exchange information about writes – basically, after performing a write, node must inform all other replicas of the changed objects – signal OK to reader before or during replication – for example, by broadcast to all nodes • Must have some way to deal with conflicts – all nodes must agree on conflict resolution – common solution: embed clock value in write message, then let last writer win – clock need not be in sync for all nodes 7
  • 8. What’s wrong with ACID? • Semantics are easier for developers – applications can lean back and just trust the db • However, doesn’t scale as well – that is, doesn’t scale with number of nodes – requires too much communication and agreement between nodes • Bigger web sites have therefore gone BASE – Facebook, Flickr,Twitter, ... 8
  • 9. Other benefits of NoSQL • Schemaless – possible to write much more flexible code – schema evolution vastly easier • Avoid joins – document databases allow hierarchical non- normalized objects to be retrieved directly 9
  • 10. Downsides to NoSQL • Everyone knows SQL – few people know your specific NoSQL database • Lack of validation – code will typically do anything the database lets it get away with (especially over time) • No standards – you can’t easily switch databases – (well, except with SPARQL) • Lack of maturity – lack of supporting tools, unpleasant surprises, ... • Weak query languages – means you have to do more in code – may hurt performance 10
  • 11. Triple stores (SPARQL) • Non-SQL? Yes • BASE? No • Schemaless? Yes 11 Only two out of three, so whether triple stores are NoSQL databases is debatable. At the very least, they differ substantially from the core examples of NoSQL databases.
  • 12. Can triple stores be BASE? • In theory, yes – nothing inherent in graph structure that prevents it • But, – how do you shard graph data? – no known way to do it that’s efficient • So, in practice this is hard 12
  • 13. When should you use NoSQL? • If scalability is a concern – however, don’t forget that sites like Flickr and Wikipedia used RDBMSs for years – relational databases scale a long way • If schemalessness is important – sometimes it really is – seriously consider RDF for this use case • If fashion is a concern – for a surprising number of people, using something new and shiny is the main thing 13
  • 15. CAP 15 • Consistency – all nodes always give the same answer • Availability – nodes always answer queries and accept updates • Partition-tolerance – system continues working even if one or more nodes go quiet CAPTheorem:You can only have two of these. Partition tolerance: Without this, the cluster dies the moment one node goes silent. Can’t really drop this one.
  • 16. C ≠ C 16 • C in ACID – means all data obeys constraints in schema • C in CAP – means all servers agree on the data • However, – ACID implementations also follow the C in CAP
  • 17. History 17 • First formulated by Eric Brewer in 2000 – based on experience with Inktomi search engine – described the SQL/NoSQL divide very well – coined the BASE acronym • Formalized and proven in 2002 – by Seth Gilbert and Nancy Lynch • Today CAP is better understood – widely considered a key tradeoff in designing distributed systems – and particularly databases – in some ways gave rise to NoSQL databases
  • 19. Consistency 19 DB node 1 DB node 2Client 1 Client 2 read account X balance -> 100 set account X balance = 0 set account X balance = 0 set account X balance = 0 set account X balance = 0
  • 20. Availability 20 DB node 1 DB node 2Client 1 Client 2 read account X balance -> 100 set account X balance = 0 set account X balance = 0 set account X balance = 0 read account X balance -> 100 set account X balance = 0 Happy customer walks away, richer by 200. Servers eventually agree balance is 0.
  • 21. What exactly is the problem? 21 • The ordering of events affects the outcome • The different nodes do not necessarily observe the same order • However, it is possible to impose a consistent ordering of events • The trouble is, that involves communication with all nodes • Waiting for all of them takes time
  • 23. Time, Clocks and the Ordering of Events in a Distributed System 23 The origin of this paper was a note titled The Maintenance of Duplicate Databases by Paul Johnson and BobThomas. I believe their note introduced the idea of using message time- stamps in a distributed algorithm. I happen to have a solid, visceral understanding of special relativity (see [5]). This enabled me to grasp immediately the essence of what they were trying to do. ... I realized that the essence of Johnson andThomas's algorithm was the use of timestamps to provide a total ordering of events that was consistent with the causal order. ... It didn't take me long to realize that an algorithm for totally ordering events could be used to implement any distributed system. http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
  • 24. Order theory 24 • An ordering relation is any relation ≤ such that – a ≤ a (reflexivity) – if a ≤ b and b ≤ a then a = b (antisymmetry) – if a ≤ b and b ≤ c then a ≤ c (transitivity) • A total order is an order such that – a ≤ b or b ≤ a (totality) • A partial order is any order which is not total – that is, for some pairs a and b, neither a ≤ b nor b ≤ a
  • 25. Examples 25 • Total orders – normal ordering of numbers and letters • Partial orders – ordering sets by the subset relation (see figure) – “In general, the ·order-relation· on duration is a partial order since there is no determinate relationship between certain durations such as one month (P1M) and 30 days (P30D)” XML Schema, pt2
  • 27. History 27 • 1687 – Isaac Newton publishes Philosophiæ Naturalis Principia Mathematica – physics begins – no changes over next two centuries • 1905 – Albert Einstein publishes special relativity – abandons notions of fixed time and space • 1916 – Einstein’s general relativity – takes into account gravity – no changes since
  • 29. The barn is too small 29 • Three people (M, F, and B) own a board (5m wide) and a barn (4m wide) • The board doesn’t fit inside the barn! • What to do? 4
  • 30. 30
  • 31. We have a solution! 31 As seen by F & B: board is 4m long (relativistic shortening), barn continues to be 4m wide. When the board is exactly inside the barn, F and B will close their doors simultaneously, and the problem will be solved. (As seen by M: board is 5m long (at rest relative to him), barn shortened to 3.2m wide. Pay no attention to this.)
  • 32. What they observe F and B • When the board is just inside, both close their doors simultaneously • Right after, the board crashes through back door M • B shuts his door just as the front of the board reaches him • 0.6 seconds later, F closes his door • Board crashes through back door 32
  • 33. The key point • They don’t agree on the order of events! – and this is not a paradox – it is in fact how the universe works • Change the story slightly and the three people could have three different orders of events • No total order of events exists on which all observers can agree – the ordering of events in the universe is a partial order • What then of causality? – ifA causes B, but some people think B happened beforeA, then what? 33
  • 34. Resolution 34 • A, B, and C are events • The cone is the “light cone” from A – that is, the spread of light fromA • C is outside the cone – thereforeA cannot influence C – observers may disagree on order of A&C • B is inside the cone – thereforeA can influence B – observers may not disagree on order
  • 37. Relevance to the CAP theorem 37 • Distributed nodes will never agree on the order of events – unless a communication delay is introduced • Basically, only events inside the “light cone” can be totally ordered – communications delay can never be less than time taken by light to traverse physical distance – in practice, it will be quite a bit bigger – how big depends on hardware and design constraints
  • 38. One solution: Paxos 38 • Protocol created by Leslie Lamport • Can be used to introduce logical clock – all nodes agree to always increase the number • Which again can order events • Allowing all nodes to agree on order of events http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#paxos-simple
  • 39. Another solution: ev. consistency 39 • Essentially, this solution says some % of errors is acceptable – Amazon may have to compensate purchasers with a gift card once every 100,000 transactions – business value of remaining available is higher than cost of errors • Even in banking this may apply – ATMs often continue working even if contact with bank is lost – allow withdrawals up to some limit – accept it if customers overcharge – (they’ll have to pay fees and interest, anyway)
  • 40. CALM 40 • Client code may have to compensate for database inconsistencies – this can quickly become complex – complex means error-prone • However, there is a way around that – CALM = consistency as logical monotonicity – means facts used by clients to make decisions never change • A database that never deletes or overwrites is CALM – for example because it logs trades or other events – non-CALM databases may require ACID
  • 41. A CALM example 41 • Client 1 reads A = 10 • Client 1 uses this information to write B = 5 • If Client 2 now reads B = 5, that client cannot read a value of A older than A = 10 • Doing so would violate what’s known as “causal consistency”
  • 42. ACID 2.0 42 • A bit misleading, because it’s not really ACID at all • Basically requires update operations to have these properties – associativity a + (b + c) = (a + b) + c – commutativity a + b = b + a – idempotence f(x) = f(f(x)) – distributed • One approach is to use datatypes which guarantee these properties
  • 43. SDShare updates 43 • These are actually – associative – commutative – idempotent • So SDShare is already ACID 2.0...
  • 44. Another solution: datatypes 44 • CRDTs – commutative, replicated data types • Basically, – data types designed so that the order of operations doesn’t matter – the end result is always the same, regardless of the order of operations
  • 45. An example 45 • A problem with our cash withdrawal example is the overwrite operation • What if the operation were “increment(account, -100)” instead? – this operation is associative and commutative – (not inherently idempotent, however) • Nodes can now apply incoming updates as they get them – the ordering of updates can be ignored – once all updates are applied, all nodes will agree (which is eventual consistency)
  • 46. 46 Thus far, all is well, and everyone agrees
  • 47. 47 “To go wildly faster, one must remove all four sources of the overhead discussed above.This is possible in either a SQL context or some other context.”
  • 48. 48 What on earth is he talking about? Everyone knows SQL and ACID don’t scale!
  • 50. AdWords database difficulties 50 This backend was originally based on a MySQL database that was manually sharded many ways.The uncompressed dataset is tens of terabytes, which is small compared to many NoSQL instances, but was large enough to cause difficulties with sharded MySQL.The MySQL sharding scheme assigned each customer and all related data to a fixed shard.This layout enabled the use of indexes and complex query processing on a per-customer basis, but required some knowledge of the sharding in application business logic. Resharding this revenue-critical database as it grew in the number of customers and their data was extremely costly. The last resharding took over two years of intense effort, and involved coordination and testing across dozens of teams to minimize risk.
  • 51. More background 51 We store financial data and have hard requirements on data integrity and consistency.We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level.
  • 52. Yet more background 52 At least 300 applications within Google use Megastore (despite its relatively low per- formance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across data- centers.) Examples of well-known Google applications that use Megastore are Gmail, Picasa, Calendar, Android Market, and AppEngine.
  • 53. Requirements • Scalability – scale simply by adding hardware – no manual sharding • Availability – no downtime, for any reason • Consistency – fullACID transactions • Usability – full SQL with indexes 53 Uh, didn’t we just learn that this is impossible?
  • 54. Spanner 54 • Globally-distributed semi-relational db – SQL as query language – versioned data with non-locking read-only transactions • Externally consistent reads/writes • Atomic schema updates – even while transactions are running • High availability – experiment: killing 25 out of 125 servers has no effect (except on throughput)
  • 55. Transaction model 55 • Fairly close to traditional MVCC – every row has a timestamp – reads have associated timestamp, see database as of that point in time • The key is a consistent order of timestamps across nodes
  • 58. TrueTime 58 • Enables consistency in Spanner by giving transactions timestamps – that is, imposes a consistent ordering on transactions • Represents time with uncertainty interval – the bigger the uncertainty, the more careful nodes must be – bigger uncertainty leads to slower transactions • Uses two kinds of time servers to reduce uncertainty – GPS-based servers – atomic clocks
  • 59. Use of Paxos is key • Combines Paxos withTrueTime to ensure timestamps are monotonically increasing • Paxos requires majority votes to agree – implies less than half of data centers can fail at any one time • AdWords therefore runs with 5 data centers – allows two simultaneous failures without effect – three on East Coast, two onWest Coast – (in Google East Coast +West Goast = globally) 59
  • 61. F1 – the next layer up 61 • Builds on Spanner, adds – distributed SQL queries – including joins from external sources – transactionally consistent indexes – asynchronous schema changes – optimistic transactions – automatic change history
  • 62. Why built-in change history? 62 “Many database users build mechanisms to log changes, either from application code or using database features like triggers. In the MySQL system that AdWords used before F1, our Java application libraries added change history records into all transactions.This was nice, but it was inefficient and never 100% reliable. Some classes of changes would not get history records, including changes written from Python scripts and manual SQL data changes.” Application code is not enough to enforce business rules, because many important changes are made behind the application code. For example, data conversion. Look at any database that’s a few years old, and you’ll find data disallowed by the application code, but allowed by the schema.
  • 64. Two interfaces • NoSQL interface – basically a simple key->row lookup – simpler in code for object lookup – faster because no SQL parsing • Full SQL interface – good for analytics and more complex interactions 64
  • 65. Status • >100 terabyte of uncompressed data – distributed across 5 data centers – Five nines (99.999%) uptime • Serves up to hundreds of thousands of requests/second • SQL queries scan trillions of rows/day • No observable increase of latency compared to MySQL-based backend – but change tracking and sharding now invisible to application 65
  • 67. Conclusion 67 • NoSQL is mostly about BASE – to some degree also schemalessness • The CAPTheorem is key to understanding distributed systems – NoSQL is BASE because of CAP • The CAPTheorem is a consequence of the theory of relativity • New systems seem to indicate that ACID may scale, after all – basically, the speed of light is greater than we thought
  • 68. Further reading 68 • NoSQL eMag, InfoQ, pilot issue May 2013 – http://www.infoq.com/minibooks/emag-NoSQL • Brewer’s original presentation – http://www.cs.berkeley.edu/~brewer/cs262b- 2004/PODC-keynote.pdf • Proof by Lynch & Gilbert – http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture -SigAct.pdf • Why E=mc2?, Cox & Forshaw • Eventual ConsistencyToday: Limitations, Extensions, and Beyond, ACM Queue – http://queue.acm.org/detail.cfm?id=2462076
  • 69. Further reading 69 • Spanner paper – http://research.google.com/archive/spanner.html • F1 papers – http://research.google.com/pubs/pub38125.html – http://research.google.com/pubs/pub41376.html