This document summarizes a research paper on Google's globally distributed database called Spanner. Spanner provides strong consistency and transactions across globally distributed data. It addresses the need for a scalable database to replace Google's sharded MySQL deployment. Spanner uses TrueTime to synchronize clocks across datacenters with bounded uncertainty. It assigns timestamps to transactions using this synchronized time to ensure consistency. Spanner supports different types of transactions like read-write, read-only, and snapshot reads through its consistency and concurrency control mechanisms. Evaluation results show Spanner can provide low latency, high throughput and availability even during leader failures.
2. Authors:
+James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes,
Christopher Frost, … , Google, Inc.
Published in:
+ACM Transactions on Computer Systems (TOCS) Volume 31 Issue 3,
August 2013 .
+USENIX Symposium on Operating Systems Design and
Implementation (OSDI)2012,Best Paper .
3. Topics To be Covered
+ Problem Definition
+ Spanner Database
+ Greedy Forwarding
+ Right Hand Rule - Perimeter
+ Planarized Graph
+ Combining Greedy And Planner Perimeters
+ Simulations And Results
4. Problem – Need for Scalable MySQL
Google’s advertising backend
- Based on MySQL
• Relations
• Query language
- Manually sharded
•Resharding is very costly
- Global distribution
6. Overview of Available Solutions
• Scalability
• Throughput
• Performance
• Eventually-consistent
replication support across
data-centers
•Transactional scope limited
to single row
Google Megastore
• Replicated ACID transactions
• Schematized semi-relational
tables
• Synchronous replication support
across data-centers
• Performance
• Lack of query language
7. Bridging the gap between Megastore and Bigtable
Solution: Google Spanner
Removes the need to manually partition data
Synchronous replication and automatic failover
Strong transactional semantics
Google
Megastore
8. What is Spanner?
Distributed multiversion database
+ General-purpose transactions (ACID)
+ SQL query language
+ Schematized tables
+ Semi-relational data model
Running in production
+ Storage for Google’s ad data
+ Replaced a sharded MySQL database
9. Example: Social Network
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
US Brazil
RussiaSpain
San Francisco
Seattle
Arizona
Sao Paulo
Santiago
Buenos Aires
Moscow
Berlin
Krakow
London
Paris
Berlin
Madrid
Lisbon
User posts
Friend lists
x1000 x1000
x1000
x1000
Sharding: is a type of database partitioning that separates very large databases the
into smaller, faster, more easily managed parts called data shards. The word shard
means a small part of a whole.
10. Why Consistency matters?
Generate a page of friends’ recent posts
Consistent view of friend list and their posts
Single Machine
User posts
Friend lists
User posts
Friend lists
Friend2 post
Generate my page
Friend1 post
Friend1000 post
Friend999 post
Block writes
…
11. User posts
Friend lists
User posts
Friend lists
Multiple Machines
User posts
Friend lists
Generate my page
Friend2 post
Friend1 post
Friend1000 post
Friend999 post
User posts
Friend lists
Block writes
…
12. User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
Multiple Datacenters
User posts
Friend lists
Generate my page
Friend2 post
Friend1 post
Friend1000 post
Friend999 post
…
US
Spain
Russia
Brazil
x1000
x1000
x1000
x1000
14. Spanner Deployment - Universe
A Spanner deployment is called a universe.
Spanner is organized as a set of zones, where each zone is the
rough analog of a deployment of Bigtable servers .
15. Spanner Deployment - Universe
A zone has one zonemaster and several thousand spanservers.
The former assigns data to spanservers; the latter serve data
to clients.
Proxies are used by clients to locate span servers with the
correct data
16. Spanner Deployment - Universe
universemaster has a console with status information
placement driver handles data movement for replication or
load balancing purposes
18. Spanserver Software Stack
+ Each spanserver is responsible for 100-1000 data structure instances , called tablet
+Tablet maintains following mapping:
(key: string, timestamp:int64) -> string
18
19. Spanserver Software Stack
+ Data and logs stored on colossus (successor of GFS)
+ A single paxos state machine on top of each tablet: consistent replication
+ Paxos - to get consensus; i.e. for all participants to agree on common value. We use
paxos for consistent replication
19
20. Spanserver Software Stack
+Paxos implementation supports long-lived leaders with time-based leader leases.
+ Writes must initiate the paxos protocol at the leader.
+Reads access state directly from the underlying tablet at any replica .
+ Transaction manager: to support distributed transactions
20
23. Directories
+ A directory is the unit of data placement.
+ A directory is also a smallest unit whose geographic replication properties can
be specied by an application.
+ All data in a directory has the same replication configuration
.
+ A Paxos group may contain multiple directories .
24. Directories
+ Spanner might move a directory:
- To shed load from a paxos group.
- To put directories that are frequently accessed together into the same group.
- To move a directory into a group that is closer to its accessors.
26. + An application creates one or more databases in a universe.
+ Each database can contain an unlimited number of schematized
tables.
+ Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
+ Hierarchies of tables
• Tables must be partitioned by client into one or more hierarchies of tables
• Table in the top: directory table
Data Model
27. EXAMPLE
CREATE TABLE Albums {
user_id INT64 NOT NULL, album_id INT64 NOT NULL,
name STRING
} PRIMARY KEY (user_id, album_id) ;
CREATE TABLE Photo {
user_id INT64 NOT NULL, album_id INT64 NOT NULL,
photo_id INT64 NOT NULL, name STRING
} PRIMARY KEY (user_id, album_id, photo_id),
INTERLEAVE IN PARENT Albums ON DELETE CASCADE;
+ Client applications declare the hierarchies in database schemas via the
INTERLEAVE IN declarations.
+ ON DELETE CASCADE says that deleting a row in the directory table
deletes any associated child rows.
30. + Not purely relational
- Requires rows to have names
- Names are nothing but a set(can be singleton) of primary keys
- In a way, it’s a key value store with primary keys mapped to non-key columns as values
Semi-Relational Data Model
33. +Is synchronizing time at the global scale possible?
+Synchronizing time within and between datacenters extremely
hard and uncertain.
+ Serialization of requests is impossible at global scale
Time Synchronization
34. +Idea: accept uncertainty, keep it small and quantify (using
GPS and Atomic Clocks).
Time Synchronization
38. How True Time is Implemented
• Daemon polls variety of masters:
• Chosen from nearby datacenters
• From further datacenters
• Armageddon masters
• Daemon polls variety of masters
and reaches a consensus about
correct timestamp.
39. External consistency
• External Consistency: Formally, If commit of T1 preceded the
initiation of a new transaction T2 in wall-clock (physical) time,
then commit of T1 should precede commit of T2 in the serial
ordering also.
40. External consistency
• Jerry unfriends Tom to write a controversial comment.
• If serial order is as above, Jerry will be in trouble!
T1: unfriend Tom
T2: post comment
T3: view Jerry’s profile
by Jerry
by Tom
physical time
T2 : post comment T3 : view Jerry’s Profile T1 : unfriend Tom
41. External consistency
• External Consistency: Formally, If commit of T1 preceded the
initiation of a new transaction T2 in wall-clock (physical) time,
then commit of T1 should precede commit of T2 in the serial
ordering also.
45. Key Innovation
+ The Spanner implementation supports 3 type of transactions
1. Readwrite transactions
2. Read-only transactions
3. Snapshot reads.
+ Standalone writes are implemented as read-write transactions;
+ Non-snapshot standalone reads are implemented as read-only
transactions.
46. Snapshot Read
+ Read in past without locking.
+ Client can specify timestamp for read or an upper bound of
timestamp.
+ Each replica tracks a value called safe time tsafe , which is the
maximum timestamp at which a replica is up-to-date.
+ Replica can satisfy read at any t ≤ tsafe
47. Read-only Transactions
+ Assign timestamp sread and do snapshot read at sread
+ sread = TT.now().latest()
+ It guarantees external consistency.
48. Read-Write Transactions
+ Leader must only assign timestamps within the interval of its
leader lease
+ Timestamps must be assigned in monotonically increasing order.
+ If transaction T1 commits before T2 starts, T2’s commit
timestamp must be greater than T1’s commit timestamp
49. Read-Write Transactions
+ Clients buffer writes.
+ Client chooses a coordinate group that initiates two-phase
commit.
+ A non-coordinator-participant leader chooses a prepare
timestamp and logs a prepare record through paxos and notifies
the coordinator
50. Read-Write Transactions
+ The coordinator assigns a commit timestamp si no less than all
prepare timestamps and TT.now().latest().
+ The coordinator ensures that clients cannot see any data
committed by Ti until TT.after(si) is true. This is done by commit
wait (wait until absolute time passes si to commit).
51. Timestamps, Global Clock
+Strict two-phase locking for write transactions
+Assign timestamp while locks are held
T
Pick s = now()
Acquired locks Release locks
52. Timestamps and TrueTime
Spanner 52
T
Pick s = TT.now().latest
Acquired locks Release locks
Wait until TT.now().earliest > ss
average ε
Commit wait
average ε
53. Commit Wait and 2-Phase Commit
Spanner 53
TC
Acquired locks Release locks
TP1
Acquired locks Release locks
TP2
Acquired locks Release locks
Notify participants of s
Commit wait doneCompute s for each
Start logging Done logging
Prepared
Compute overall s
Committed
Send s
54. Example
Spanner 54
TP
Remove X from
my friend list
Remove myself from
X’s friend list
sC=6
sP=8
s=8 s=15
Risky post P
s=8
Time <8
[X]
[me]
15
TC T2
[P]
My friends
My posts
X’s friends
8
[]
[]
55. Assign timestamp(R/O)
+Simple:
+ Can read if within
+
Spanner 55
latestnowTTsread ()..
),min( TM
safe
Paxos
safesafe ttt
1)(min , prepare
gii
TM
safe st
56. Assign timestamp (paxos)
+One paxos group -> lastTS()
+More paxos group-> TT.now().latest
+which may wait for safe time to advance
62. Evaluation
+ Spanner’s performance
+ replication
+ transactions
+ availability
+ Some insight on TrueTime behavior
+ Case study of their first client, F1
63. Test Bench
+ CPUs: timeshared machines:
+ Scheduling units of
+ 4GB RAM
+ 4 cores (AMD Barcelona 2200MHz)
+ Clients were run on separate machines
64. Test Bench
+ Each zone contained 1 spanserver
+ Clients and zones were placed in a set of data
centers with network distance of less than 1ms
+ Test database:
+ 50 Paxos groups
+ 2500 directories
65. Test Bench
+ Operations were standalone reads and writes
of 4KB
+ All reads were served out of memory after a
compaction for measuring the overhead of
Spanner’s call stack
+ 1 unmeasured round of reads was done first to
warm location caches
66. Latency
+ Clients issued sufficiently few operations to avoid
queuing at the servers
+ From the 1-replica experiments:
+ Commit wait is about 5ms
+ Paxos latency is about 9ms
68. Latency
+ With increase of replicas
+ the latency stays roughly constant
+ less standard deviation
+ latency to achieve a quorum becomes less
sensitive
69. Throughput
+ Clients issued sufficiently many operations so as to saturate
the servers’ CPUs
+ Snapshot reads can execute at any up-to-date replicas, so
their throughput increases almost linearly with the number
of replicas
+ Single-read read-only transactions only execute at leaders
because timestamp assignment must happen at leaders
+ Read-only-transaction throughput increases with the
number of replicas because the number of effective span
servers increases
71. Throughput
+ Write throughput benefits from the same experimental
artifact (which explains the increase in throughput from 3 to
5 replicas)
+ But increase with replicas, this benefit outweighed by the
linear increase in the amount of work performed per write
73. Throughput
+ two-phase commit can scale to a reasonable number of
participants:
+ Experiments with 3 zones, each with 25 spanservers
+ Scaling up to 50 participants is reasonable in both mean
and 99th-percentile
+ But latencies start to rise noticeably at 100 participants
74. Availability
+ The test universe : 5 zones named Zi
+ With 25 span servers each .
+ Test database was sharded into 1250 Paxos groups
+ Test clients 100
+ constantly issued non-snapshot reads at an aggregate
rate of 50K reads/second.
75. Availability
+ All of the leaders were explicitly placed in Z1.
+ Five seconds into each test, all of the servers in one zone
were killed:
+ non-leader kills Z2
+ leader-hard kills Z1
+ leader-soft kills Z1
77. Availability
+ Non-leader kills Z2: no effect on read throughput
+ Leader-soft kills Z1:
+ it gives notifications to all of the servers that they
should handoff leadership first
+ time to handoff leadership to a different zone has a
minor effect: throughput drop around 3-4%
78. Availability
+ Leader-hard kills Z1:
+ severe effect: the rate of completion drops almost to 0
+ After Leaders get re-elected:
+ throughput of the system rises to approximately
100K reads/second because
1. Extra capacity in the system
2. Operations are queued while the leader is
unavailable
+ Goes back to its steady-state rate
79. Availability
+ Paxos leader leases are set to 10 seconds
+ After killing the zone, the leader-lease expiration times for
the groups should be evenly distributed over the next 10
seconds
+ Soon after each lease from a dead leader expires, a new
leader is elected
+ So, Approximately 10 seconds after the kill time
+ all of the groups have leaders and throughput has recovered
80. Availability
+ Shorter lease times would reduce the effect of server deaths
on availability
+ Trade-off greater amounts of lease-renewal network
traffic
+ New mechanism is under development
+ mechanism that will cause slaves to release Paxos
leader leases upon leader failure
81. TrueTime
+ Two questions must be answered with respect to TrueTime:
+ is E truly a bound on clock uncertainty
+ how bad does get?
+ Bad CPUs are 6 times more likely than bad clocks
82. TrueTime
+ TrueTime data taken at several thousand span server
machines across data centers up to 2200 km apart.
+ It plots the 90th, 99th, and 99.9th percentiles of E, sampled
at time slave daemons immediately after polling the time
masters.
+ This sampling elides the saw-tooth in due to local-clock
uncertainty, and therefore measures time-master
uncertainty (which is generally 0) plus communication delay
to the time masters.
83. TrueTime
Fig: Distribution of TrueTime ∈ values, sampled right after timeslave daemon
polls the time masters. 90th, 99th, and 99.9th percentiles are graphed.
84. F1
+ Google’s advertising backend called F1
+ based on a MySQL database
+ manually sharded many ways
+ uncompressed dataset is tens of terabytes
+ assigned each customer and all related data to a fixed
shard
+ Resharding took over two years of intense effort
+ Involved coordination and testing across dozens of teams to
minimize risk
+ Storing some data in external Bigtables
+ compromising transactional behavior
+ ability to query across all data
85. F1
+ Spanner for F1
a. Spanner removes the need to manually reshard
b. Spanner provides synchronous replication
c. Automatic failover
d. F1 requires strong transactional semantics, which made using
other NoSQL systems impractical
+ transactions across arbitrary data, and consistent reads
e. F1 team also needed secondary indexes on their data
+ Spanner does not support secondary indexes
+ F1 team implemented their own consistent global indexes using
Spanner transactions
86. F1
+ Distribution of the number of fragments per directory in F1
a. directory corresponds to a customer
b. consist of only 1 fragment
+ Directories with more than 100 fragments are all tables that
contain F1 secondary indexes
88. F1
+ All 3 replicas in the east-coast data centers are given
higher priority in choosing Paxos leaders
+ Measured from F1 servers in those data centers
+ SD of Write latencies is caused due to lock conflicts
+ Larger SD in read latencies
+ Paxos leaders are spread across two data centers
+ Only one of which has machines with SSDs
+ The measurement includes every read in data centers
+ Mean: 1.6KB SD: 119KB
90. Conclusions
+Reify clock uncertainty in time APIs
+Known unknowns are better than unknown unknowns
+Rethink algorithms to make use of uncertainty
+Stronger semantics are achievable
+Greater scale != weaker semantics
Spanner 90
Bad hosts are evicted
Timemasters check themselves against other timemasters
Clients check themselves against timemasters
The data shows that these two factors in determining the base value of are generally not a problem. However, there can be significant tail-latency issues that cause higher values of E. The reduction in tail latencies beginning on March 30 were due to networking improvements that reduced transient network-link congestion.
The increase in on April 13, approximately one hour in duration, resulted from the shutdown of 2 time masters at a datacenter for routine maintenance.
We continue to investigate and remove causes of TrueTime spikes.