Google's Globally-Distributed Database Spanner

Spanner: Google’s Globally-
Distributed Database
Presented BY
Sultan Ahmed (0416052090)
Mazharul Hossain (1015052033)

Authors:
+James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes,
Christopher Frost, … , Google, Inc.
Published in：
+ACM Transactions on Computer Systems (TOCS) Volume 31 Issue 3，
August 2013 .
+USENIX Symposium on Operating Systems Design and
Implementation （OSDI）2012，Best Paper .

Topics To be Covered
+ Problem Definition
+ Spanner Database
+ Greedy Forwarding
+ Right Hand Rule - Perimeter
+ Planarized Graph
+ Combining Greedy And Planner Perimeters
+ Simulations And Results

Problem – Need for Scalable MySQL
 Google’s advertising backend
- Based on MySQL
• Relations
• Query language
- Manually sharded
•Resharding is very costly
- Global distribution

Available Solutions To Solve Problem
 Google’s Big Table
 Google Megastore

Overview of Available Solutions
• Scalability
• Throughput
• Performance
• Eventually-consistent
replication support across
data-centers
•Transactional scope limited
to single row
Google Megastore
• Replicated ACID transactions
• Schematized semi-relational
tables
• Synchronous replication support
across data-centers
• Performance
• Lack of query language

Bridging the gap between Megastore and Bigtable
Solution: Google Spanner
 Removes the need to manually partition data
 Synchronous replication and automatic failover
 Strong transactional semantics
Google
Megastore

What is Spanner?
 Distributed multiversion database
+ General-purpose transactions (ACID)
+ SQL query language
+ Schematized tables
+ Semi-relational data model
 Running in production
+ Storage for Google’s ad data
+ Replaced a sharded MySQL database

Example: Social Network
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
US Brazil
RussiaSpain
San Francisco
Seattle
Arizona
Sao Paulo
Santiago
Buenos Aires
Moscow
Berlin
Krakow
London
Paris
Berlin
Madrid
Lisbon
User posts
Friend lists
x1000 x1000
x1000
x1000
Sharding: is a type of database partitioning that separates very large databases the
into smaller, faster, more easily managed parts called data shards. The word shard
means a small part of a whole.

Why Consistency matters?
Generate a page of friends’ recent posts
 Consistent view of friend list and their posts
Single Machine
User posts
Friend lists
User posts
Friend lists
Friend2 post
Generate my page
Friend1 post
Friend1000 post
Friend999 post
Block writes
…

User posts
Friend lists
User posts
Friend lists
Multiple Machines
User posts
Friend lists
Generate my page
Friend2 post
Friend1 post
Friend1000 post
Friend999 post
User posts
Friend lists
Block writes
…

User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
Multiple Datacenters
User posts
Friend lists
Generate my page
Friend2 post
Friend1 post
Friend1000 post
Friend999 post
…
US
Spain
Russia
Brazil
x1000
x1000
x1000
x1000

Spanner Deployment - Universe
 A Spanner deployment is called a universe.
 Spanner is organized as a set of zones, where each zone is the
rough analog of a deployment of Bigtable servers .

 A zone has one zonemaster and several thousand spanservers.
The former assigns data to spanservers; the latter serve data
to clients.
 Proxies are used by clients to locate span servers with the
correct data

 universemaster has a console with status information
 placement driver handles data movement for replication or
load balancing purposes

Spanserver Software Stack
+ Each spanserver is responsible for 100-1000 data structure instances , called tablet
+Tablet maintains following mapping:
(key: string, timestamp:int64) -> string
18

+ Data and logs stored on colossus (successor of GFS)
+ A single paxos state machine on top of each tablet: consistent replication
+ Paxos - to get consensus; i.e. for all participants to agree on common value. We use
paxos for consistent replication
19

+Paxos implementation supports long-lived leaders with time-based leader leases.
+ Writes must initiate the paxos protocol at the leader.
+Reads access state directly from the underlying tablet at any replica .
+ Transaction manager: to support distributed transactions
20

Directories
+ A directory is the unit of data placement.
+ A directory is also a smallest unit whose geographic replication properties can
be specied by an application.
+ All data in a directory has the same replication configuration
.
+ A Paxos group may contain multiple directories .

Directories
+ Spanner might move a directory:
- To shed load from a paxos group.
- To put directories that are frequently accessed together into the same group.
- To move a directory into a group that is closer to its accessors.

+ An application creates one or more databases in a universe.
+ Each database can contain an unlimited number of schematized
tables.
+ Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
+ Hierarchies of tables
• Tables must be partitioned by client into one or more hierarchies of tables
• Table in the top: directory table
Data Model

EXAMPLE
CREATE TABLE Albums {
user_id INT64 NOT NULL, album_id INT64 NOT NULL,
name STRING
} PRIMARY KEY (user_id, album_id) ;
CREATE TABLE Photo {
user_id INT64 NOT NULL, album_id INT64 NOT NULL,
photo_id INT64 NOT NULL, name STRING
} PRIMARY KEY (user_id, album_id, photo_id),
INTERLEAVE IN PARENT Albums ON DELETE CASCADE;
+ Client applications declare the hierarchies in database schemas via the
INTERLEAVE IN declarations.
+ ON DELETE CASCADE says that deleting a row in the directory table
deletes any associated child rows.

Logical data layout
Album
user_id album_id Name
1 1 Picnic
1 2 Birthday
2 1 Rag
3 1 Eid
Photo
user_id album_id photo_id name
1 1 1 pic1
1 1 2 pic2
1 1 3 pic3
1 2 1 pic1
Spanner 28

Physical data layout
1 1 Picnic
1 1 1 pic1
1 1 2 pic2
1 1 3 pic3
1 2 Birthday
1 2 1 pic1
1 2 2 pic2
1 2 3 pic3
1 2 4 pic4
1 2 5 pic5
Spanner 29
Directory TableDirecotry

+ Not purely relational
- Requires rows to have names
- Names are nothing but a set(can be singleton) of primary keys
- In a way, it’s a key value store with primary keys mapped to non-key columns as values
Semi-Relational Data Model

+Spanner knows what time it is.
Key Innovation

+Is synchronizing time at the global scale possible?
+Synchronizing time within and between datacenters extremely
hard and uncertain.
+ Serialization of requests is impossible at global scale
Time Synchronization

+Idea: accept uncertainty, keep it small and quantify (using
GPS and Atomic Clocks).
Time Synchronization

TrueTime
+“Global wall-clock time” with bounded
uncertainty
Spanner 35
time
earliest latest
TT.now()
2*ε

TrueTime Architecture
Spanner 36
Datacenter 1 Datacenter n…Datacenter 2
GPS timemaster GPS timemaster GPS timemaster
Atomic-clock
timemaster
GPS timemaster
Client
GPS timemaster
Compute reference [earliest, latest] = now ± ε

How True Time is Implemented
• Daemon polls variety of masters:
• Chosen from nearby datacenters
• From further datacenters
• Armageddon masters
• Daemon polls variety of masters
and reaches a consensus about
correct timestamp.

External consistency
• External Consistency: Formally, If commit of T1 preceded the
initiation of a new transaction T2 in wall-clock (physical) time,
then commit of T1 should precede commit of T2 in the serial
ordering also.

• Jerry unfriends Tom to write a controversial comment.
• If serial order is as above, Jerry will be in trouble!
T1: unfriend Tom
T2: post comment
T3: view Jerry’s profile
by Jerry
by Tom
physical time
T2 : post comment T3 : view Jerry’s Profile T1 : unfriend Tom

Assigning Timestamps to RW Transactions
+Concept : one transaction starts after other transaction’s ending
(Happens-Before)

Key Innovation
+ The Spanner implementation supports 3 type of transactions
1. Readwrite transactions
2. Read-only transactions
3. Snapshot reads.
+ Standalone writes are implemented as read-write transactions;
+ Non-snapshot standalone reads are implemented as read-only
transactions.

Snapshot Read
+ Read in past without locking.
+ Client can specify timestamp for read or an upper bound of
timestamp.
+ Each replica tracks a value called safe time tsafe , which is the
maximum timestamp at which a replica is up-to-date.
+ Replica can satisfy read at any t ≤ tsafe

Read-only Transactions
+ Assign timestamp sread and do snapshot read at sread
+ sread = TT.now().latest()
+ It guarantees external consistency.

Read-Write Transactions
+ Leader must only assign timestamps within the interval of its
leader lease
+ Timestamps must be assigned in monotonically increasing order.
+ If transaction T1 commits before T2 starts, T2’s commit
timestamp must be greater than T1’s commit timestamp

+ Clients buffer writes.
+ Client chooses a coordinate group that initiates two-phase
commit.
+ A non-coordinator-participant leader chooses a prepare
timestamp and logs a prepare record through paxos and notifies
the coordinator

+ The coordinator assigns a commit timestamp si no less than all
prepare timestamps and TT.now().latest().
+ The coordinator ensures that clients cannot see any data
committed by Ti until TT.after(si) is true. This is done by commit
wait (wait until absolute time passes si to commit).

Timestamps, Global Clock
+Strict two-phase locking for write transactions
+Assign timestamp while locks are held
T
Pick s = now()
Acquired locks Release locks

Timestamps and TrueTime
Spanner 52
T
Pick s = TT.now().latest
Wait until TT.now().earliest > ss
average ε
Commit wait
average ε

Commit Wait and 2-Phase Commit
Spanner 53
TC
TP1
TP2
Notify participants of s
Commit wait doneCompute s for each
Start logging Done logging
Prepared
Compute overall s
Committed
Send s

Example
Spanner 54
TP
Remove X from
my friend list
Remove myself from
X’s friend list
sC=6
sP=8
s=8 s=15
Risky post P
s=8
Time <8
[X]
[me]
15
TC T2
[P]
My friends
My posts
X’s friends
8
[]
[]

Assign timestamp(R/O)
+Simple:
+ Can read if within
+
Spanner 55
latestnowTTsread ()..
),min( TM
safe
Paxos
safesafe ttt 
1)(min ,  prepare
gii
TM
safe st

Assign timestamp (paxos)
+One paxos group -> lastTS()
+More paxos group-> TT.now().latest
+which may wait for safe time to advance

Transaction Detail
Atomic Schema Change
Spanner 57

Atomic Schema change
+Assign timestamp t in future
+Do the schema change in background (prepare
phase)
+Block request if it is after t.
Spanner 58

Reduce wait time for read
+Fine-grained mapping from key ranges to
+Fine-grained mapping from key ranges to
LastTS()
Spanner 60
TM
safet

Evaluation
+ Spanner’s performance
+ replication
+ transactions
+ availability
+ Some insight on TrueTime behavior
+ Case study of their first client, F1

Test Bench
+ CPUs: timeshared machines:
+ Scheduling units of
+ 4GB RAM
+ 4 cores (AMD Barcelona 2200MHz)
+ Clients were run on separate machines

Test Bench
+ Each zone contained 1 spanserver
+ Clients and zones were placed in a set of data
centers with network distance of less than 1ms
+ Test database:
+ 50 Paxos groups
+ 2500 directories

Test Bench
+ Operations were standalone reads and writes
of 4KB
+ All reads were served out of memory after a
compaction for measuring the overhead of
Spanner’s call stack
+ 1 unmeasured round of reads was done first to
warm location caches

Latency
+ Clients issued sufficiently few operations to avoid
queuing at the servers
+ From the 1-replica experiments:
+ Commit wait is about 5ms
+ Paxos latency is about 9ms

Latency
Table: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with
commit wait disabled

Latency
+ With increase of replicas
+ the latency stays roughly constant
+ less standard deviation
+ latency to achieve a quorum becomes less
sensitive

Throughput
+ Clients issued sufficiently many operations so as to saturate
the servers’ CPUs
+ Snapshot reads can execute at any up-to-date replicas, so
their throughput increases almost linearly with the number
of replicas
+ Single-read read-only transactions only execute at leaders
because timestamp assignment must happen at leaders
+ Read-only-transaction throughput increases with the
number of replicas because the number of effective span
servers increases

Throughput
Table: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with
commit wait disabled

Throughput
+ Write throughput benefits from the same experimental
artifact (which explains the increase in throughput from 3 to
5 replicas)
+ But increase with replicas, this benefit outweighed by the
linear increase in the amount of work performed per write

Throughput
Table: Two-phase commit scalability. Mean and standard deviations
over 10 runs

Throughput
+ two-phase commit can scale to a reasonable number of
participants:
+ Experiments with 3 zones, each with 25 spanservers
+ Scaling up to 50 participants is reasonable in both mean
and 99th-percentile
+ But latencies start to rise noticeably at 100 participants

Availability
+ The test universe : 5 zones named Zi
+ With 25 span servers each .
+ Test database was sharded into 1250 Paxos groups
+ Test clients 100
+ constantly issued non-snapshot reads at an aggregate
rate of 50K reads/second.

Availability
+ All of the leaders were explicitly placed in Z1.
+ Five seconds into each test, all of the servers in one zone
were killed:
+ non-leader kills Z2
+ leader-hard kills Z1
+ leader-soft kills Z1

Availability
Fig: Effect of killing servers on throughput.

Availability
+ Non-leader kills Z2: no effect on read throughput
+ Leader-soft kills Z1:
+ it gives notifications to all of the servers that they
should handoff leadership first
+ time to handoff leadership to a different zone has a
minor effect: throughput drop around 3-4%

Availability
+ Leader-hard kills Z1:
+ severe effect: the rate of completion drops almost to 0
+ After Leaders get re-elected:
+ throughput of the system rises to approximately
100K reads/second because
1. Extra capacity in the system
2. Operations are queued while the leader is
unavailable
+ Goes back to its steady-state rate

Availability
+ Paxos leader leases are set to 10 seconds
+ After killing the zone, the leader-lease expiration times for
the groups should be evenly distributed over the next 10
seconds
+ Soon after each lease from a dead leader expires, a new
leader is elected
+ So, Approximately 10 seconds after the kill time
+ all of the groups have leaders and throughput has recovered

Availability
+ Shorter lease times would reduce the effect of server deaths
on availability
+ Trade-off greater amounts of lease-renewal network
traffic
+ New mechanism is under development
+ mechanism that will cause slaves to release Paxos
leader leases upon leader failure

TrueTime
+ Two questions must be answered with respect to TrueTime:
+ is E truly a bound on clock uncertainty
+ how bad does get?
+ Bad CPUs are 6 times more likely than bad clocks

TrueTime
+ TrueTime data taken at several thousand span server
machines across data centers up to 2200 km apart.
+ It plots the 90th, 99th, and 99.9th percentiles of E, sampled
at time slave daemons immediately after polling the time
masters.
+ This sampling elides the saw-tooth in due to local-clock
uncertainty, and therefore measures time-master
uncertainty (which is generally 0) plus communication delay
to the time masters.

TrueTime
Fig: Distribution of TrueTime ∈ values, sampled right after timeslave daemon
polls the time masters. 90th, 99th, and 99.9th percentiles are graphed.

F1
+ Google’s advertising backend called F1
+ based on a MySQL database
+ manually sharded many ways
+ uncompressed dataset is tens of terabytes
+ assigned each customer and all related data to a fixed
shard
+ Resharding took over two years of intense effort
+ Involved coordination and testing across dozens of teams to
minimize risk
+ Storing some data in external Bigtables
+ compromising transactional behavior
+ ability to query across all data

F1
+ Spanner for F1
a. Spanner removes the need to manually reshard
b. Spanner provides synchronous replication
c. Automatic failover
d. F1 requires strong transactional semantics, which made using
other NoSQL systems impractical
+ transactions across arbitrary data, and consistent reads
e. F1 team also needed secondary indexes on their data
+ Spanner does not support secondary indexes
+ F1 team implemented their own consistent global indexes using
Spanner transactions

F1
+ Distribution of the number of fragments per directory in F1
a. directory corresponds to a customer
b. consist of only 1 fragment
+ Directories with more than 100 fragments are all tables that
contain F1 secondary indexes

F1
Fig: Distribution of directory-fragment counts in F1

F1
+ All 3 replicas in the east-coast data centers are given
higher priority in choosing Paxos leaders
+ Measured from F1 servers in those data centers
+ SD of Write latencies is caused due to lock conflicts
+ Larger SD in read latencies
+ Paxos leaders are spread across two data centers
+ Only one of which has machines with SSDs
+ The measurement includes every read in data centers
+ Mean: 1.6KB SD: 119KB

F1
Fig: F1-perceived operation latencies measured over the course of 24 hours.

Conclusions
+Reify clock uncertainty in time APIs
+Known unknowns are better than unknown unknowns
+Rethink algorithms to make use of uncertainty
+Stronger semantics are achievable
+Greater scale != weaker semantics
Spanner 90

Google's Globally-Distributed Database Spanner

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Google's Globally-Distributed Database Spanner

Semelhante a Google's Globally-Distributed Database Spanner (20)

Último

Último (20)

Google's Globally-Distributed Database Spanner

Notas do Editor