SlideShare uma empresa Scribd logo
1 de 91
Spanner: Google’s Globally-
Distributed Database
Presented BY
Sultan Ahmed (0416052090)
Mazharul Hossain (1015052033)
Authors:
+James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes,
Christopher Frost, … , Google, Inc.
Published in:
+ACM Transactions on Computer Systems (TOCS) Volume 31 Issue 3,
August 2013 .
+USENIX Symposium on Operating Systems Design and
Implementation (OSDI)2012,Best Paper .
Topics To be Covered
+ Problem Definition
+ Spanner Database
+ Greedy Forwarding
+ Right Hand Rule - Perimeter
+ Planarized Graph
+ Combining Greedy And Planner Perimeters
+ Simulations And Results
Problem – Need for Scalable MySQL
 Google’s advertising backend
- Based on MySQL
• Relations
• Query language
- Manually sharded
•Resharding is very costly
- Global distribution
Available Solutions To Solve Problem
 Google’s Big Table
 Google Megastore
Overview of Available Solutions
• Scalability
• Throughput
• Performance
• Eventually-consistent
replication support across
data-centers
•Transactional scope limited
to single row
Google Megastore
• Replicated ACID transactions
• Schematized semi-relational
tables
• Synchronous replication support
across data-centers
• Performance
• Lack of query language
Bridging the gap between Megastore and Bigtable
Solution: Google Spanner
 Removes the need to manually partition data
 Synchronous replication and automatic failover
 Strong transactional semantics
Google
Megastore
What is Spanner?
 Distributed multiversion database
+ General-purpose transactions (ACID)
+ SQL query language
+ Schematized tables
+ Semi-relational data model
 Running in production
+ Storage for Google’s ad data
+ Replaced a sharded MySQL database
Example: Social Network
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
US Brazil
RussiaSpain
San Francisco
Seattle
Arizona
Sao Paulo
Santiago
Buenos Aires
Moscow
Berlin
Krakow
London
Paris
Berlin
Madrid
Lisbon
User posts
Friend lists
x1000 x1000
x1000
x1000
Sharding: is a type of database partitioning that separates very large databases the
into smaller, faster, more easily managed parts called data shards. The word shard
means a small part of a whole.
Why Consistency matters?
Generate a page of friends’ recent posts
 Consistent view of friend list and their posts
Single Machine
User posts
Friend lists
User posts
Friend lists
Friend2 post
Generate my page
Friend1 post
Friend1000 post
Friend999 post
Block writes
…
User posts
Friend lists
User posts
Friend lists
Multiple Machines
User posts
Friend lists
Generate my page
Friend2 post
Friend1 post
Friend1000 post
Friend999 post
User posts
Friend lists
Block writes
…
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
Multiple Datacenters
User posts
Friend lists
Generate my page
Friend2 post
Friend1 post
Friend1000 post
Friend999 post
…
US
Spain
Russia
Brazil
x1000
x1000
x1000
x1000
Spanner Deployment
Spanner Deployment - Universe
 A Spanner deployment is called a universe.
 Spanner is organized as a set of zones, where each zone is the
rough analog of a deployment of Bigtable servers .
Spanner Deployment - Universe
 A zone has one zonemaster and several thousand spanservers.
The former assigns data to spanservers; the latter serve data
to clients.
 Proxies are used by clients to locate span servers with the
correct data
Spanner Deployment - Universe
 universemaster has a console with status information
 placement driver handles data movement for replication or
load balancing purposes
Spanserver Software Stack
Spanserver Software Stack
+ Each spanserver is responsible for 100-1000 data structure instances , called tablet
+Tablet maintains following mapping:
(key: string, timestamp:int64) -> string
18
Spanserver Software Stack
+ Data and logs stored on colossus (successor of GFS)
+ A single paxos state machine on top of each tablet: consistent replication
+ Paxos - to get consensus; i.e. for all participants to agree on common value. We use
paxos for consistent replication
19
Spanserver Software Stack
+Paxos implementation supports long-lived leaders with time-based leader leases.
+ Writes must initiate the paxos protocol at the leader.
+Reads access state directly from the underlying tablet at any replica .
+ Transaction manager: to support distributed transactions
20
Directories and Placement
Directories and Placement
Directories
+ A directory is the unit of data placement.
+ A directory is also a smallest unit whose geographic replication properties can
be specied by an application.
+ All data in a directory has the same replication configuration
.
+ A Paxos group may contain multiple directories .
Directories
+ Spanner might move a directory:
- To shed load from a paxos group.
- To put directories that are frequently accessed together into the same group.
- To move a directory into a group that is closer to its accessors.
Data Model
+ An application creates one or more databases in a universe.
+ Each database can contain an unlimited number of schematized
tables.
+ Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
+ Hierarchies of tables
• Tables must be partitioned by client into one or more hierarchies of tables
• Table in the top: directory table
Data Model
EXAMPLE
CREATE TABLE Albums {
user_id INT64 NOT NULL, album_id INT64 NOT NULL,
name STRING
} PRIMARY KEY (user_id, album_id) ;
CREATE TABLE Photo {
user_id INT64 NOT NULL, album_id INT64 NOT NULL,
photo_id INT64 NOT NULL, name STRING
} PRIMARY KEY (user_id, album_id, photo_id),
INTERLEAVE IN PARENT Albums ON DELETE CASCADE;
+ Client applications declare the hierarchies in database schemas via the
INTERLEAVE IN declarations.
+ ON DELETE CASCADE says that deleting a row in the directory table
deletes any associated child rows.
Logical data layout
Album
user_id album_id Name
1 1 Picnic
1 2 Birthday
2 1 Rag
3 1 Eid
Photo
user_id album_id photo_id name
1 1 1 pic1
1 1 2 pic2
1 1 3 pic3
1 2 1 pic1
Spanner 28
Physical data layout
1 1 Picnic
1 1 1 pic1
1 1 2 pic2
1 1 3 pic3
1 2 Birthday
1 2 1 pic1
1 2 2 pic2
1 2 3 pic3
1 2 4 pic4
1 2 5 pic5
Spanner 29
Directory TableDirecotry
+ Not purely relational
- Requires rows to have names
- Names are nothing but a set(can be singleton) of primary keys
- In a way, it’s a key value store with primary keys mapped to non-key columns as values
Semi-Relational Data Model
True Time
and
Consistency
+Spanner knows what time it is.
Key Innovation
+Is synchronizing time at the global scale possible?
+Synchronizing time within and between datacenters extremely
hard and uncertain.
+ Serialization of requests is impossible at global scale
Time Synchronization
+Idea: accept uncertainty, keep it small and quantify (using
GPS and Atomic Clocks).
Time Synchronization
TrueTime
+“Global wall-clock time” with bounded
uncertainty
Spanner 35
time
earliest latest
TT.now()
2*ε
TrueTime Architecture
Spanner 36
Datacenter 1 Datacenter n…Datacenter 2
GPS timemaster GPS timemaster GPS timemaster
Atomic-clock
timemaster
GPS timemaster
Client
GPS timemaster
Compute reference [earliest, latest] = now ± ε
How True Time Is Implemented
How True Time is Implemented
• Daemon polls variety of masters:
• Chosen from nearby datacenters
• From further datacenters
• Armageddon masters
• Daemon polls variety of masters
and reaches a consensus about
correct timestamp.
External consistency
• External Consistency: Formally, If commit of T1 preceded the
initiation of a new transaction T2 in wall-clock (physical) time,
then commit of T1 should precede commit of T2 in the serial
ordering also.
External consistency
• Jerry unfriends Tom to write a controversial comment.
• If serial order is as above, Jerry will be in trouble!
T1: unfriend Tom
T2: post comment
T3: view Jerry’s profile
by Jerry
by Tom
physical time
T2 : post comment T3 : view Jerry’s Profile T1 : unfriend Tom
External consistency
• External Consistency: Formally, If commit of T1 preceded the
initiation of a new transaction T2 in wall-clock (physical) time,
then commit of T1 should precede commit of T2 in the serial
ordering also.
Concurrency Control
Assigning Timestamps to RW Transactions
External consistency
+Concept : one transaction starts after other transaction’s ending
(Happens-Before)
Transaction details
Key Innovation
+ The Spanner implementation supports 3 type of transactions
1. Readwrite transactions
2. Read-only transactions
3. Snapshot reads.
+ Standalone writes are implemented as read-write transactions;
+ Non-snapshot standalone reads are implemented as read-only
transactions.
Snapshot Read
+ Read in past without locking.
+ Client can specify timestamp for read or an upper bound of
timestamp.
+ Each replica tracks a value called safe time tsafe , which is the
maximum timestamp at which a replica is up-to-date.
+ Replica can satisfy read at any t ≤ tsafe
Read-only Transactions
+ Assign timestamp sread and do snapshot read at sread
+ sread = TT.now().latest()
+ It guarantees external consistency.
Read-Write Transactions
+ Leader must only assign timestamps within the interval of its
leader lease
+ Timestamps must be assigned in monotonically increasing order.
+ If transaction T1 commits before T2 starts, T2’s commit
timestamp must be greater than T1’s commit timestamp
Read-Write Transactions
+ Clients buffer writes.
+ Client chooses a coordinate group that initiates two-phase
commit.
+ A non-coordinator-participant leader chooses a prepare
timestamp and logs a prepare record through paxos and notifies
the coordinator
Read-Write Transactions
+ The coordinator assigns a commit timestamp si no less than all
prepare timestamps and TT.now().latest().
+ The coordinator ensures that clients cannot see any data
committed by Ti until TT.after(si) is true. This is done by commit
wait (wait until absolute time passes si to commit).
Timestamps, Global Clock
+Strict two-phase locking for write transactions
+Assign timestamp while locks are held
T
Pick s = now()
Acquired locks Release locks
Timestamps and TrueTime
Spanner 52
T
Pick s = TT.now().latest
Acquired locks Release locks
Wait until TT.now().earliest > ss
average ε
Commit wait
average ε
Commit Wait and 2-Phase Commit
Spanner 53
TC
Acquired locks Release locks
TP1
Acquired locks Release locks
TP2
Acquired locks Release locks
Notify participants of s
Commit wait doneCompute s for each
Start logging Done logging
Prepared
Compute overall s
Committed
Send s
Example
Spanner 54
TP
Remove X from
my friend list
Remove myself from
X’s friend list
sC=6
sP=8
s=8 s=15
Risky post P
s=8
Time <8
[X]
[me]
15
TC T2
[P]
My friends
My posts
X’s friends
8
[]
[]
Assign timestamp(R/O)
+Simple:
+ Can read if within
+
Spanner 55
latestnowTTsread ()..
),min( TM
safe
Paxos
safesafe ttt 
1)(min ,  prepare
gii
TM
safe st
Assign timestamp (paxos)
+One paxos group -> lastTS()
+More paxos group-> TT.now().latest
+which may wait for safe time to advance
Transaction Detail
Atomic Schema Change
Spanner 57
Atomic Schema change
+Assign timestamp t in future
+Do the schema change in background (prepare
phase)
+Block request if it is after t.
Spanner 58
Refinements
Reduce wait time for read
+Fine-grained mapping from key ranges to
+Fine-grained mapping from key ranges to
LastTS()
Spanner 60
TM
safet
Evaluation
Evaluation
+ Spanner’s performance
+ replication
+ transactions
+ availability
+ Some insight on TrueTime behavior
+ Case study of their first client, F1
Test Bench
+ CPUs: timeshared machines:
+ Scheduling units of
+ 4GB RAM
+ 4 cores (AMD Barcelona 2200MHz)
+ Clients were run on separate machines
Test Bench
+ Each zone contained 1 spanserver
+ Clients and zones were placed in a set of data
centers with network distance of less than 1ms
+ Test database:
+ 50 Paxos groups
+ 2500 directories
Test Bench
+ Operations were standalone reads and writes
of 4KB
+ All reads were served out of memory after a
compaction for measuring the overhead of
Spanner’s call stack
+ 1 unmeasured round of reads was done first to
warm location caches
Latency
+ Clients issued sufficiently few operations to avoid
queuing at the servers
+ From the 1-replica experiments:
+ Commit wait is about 5ms
+ Paxos latency is about 9ms
Latency
Table: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with
commit wait disabled
Latency
+ With increase of replicas
+ the latency stays roughly constant
+ less standard deviation
+ latency to achieve a quorum becomes less
sensitive
Throughput
+ Clients issued sufficiently many operations so as to saturate
the servers’ CPUs
+ Snapshot reads can execute at any up-to-date replicas, so
their throughput increases almost linearly with the number
of replicas
+ Single-read read-only transactions only execute at leaders
because timestamp assignment must happen at leaders
+ Read-only-transaction throughput increases with the
number of replicas because the number of effective span
servers increases
Throughput
Table: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with
commit wait disabled
Throughput
+ Write throughput benefits from the same experimental
artifact (which explains the increase in throughput from 3 to
5 replicas)
+ But increase with replicas, this benefit outweighed by the
linear increase in the amount of work performed per write
Throughput
Table: Two-phase commit scalability. Mean and standard deviations
over 10 runs
Throughput
+ two-phase commit can scale to a reasonable number of
participants:
+ Experiments with 3 zones, each with 25 spanservers
+ Scaling up to 50 participants is reasonable in both mean
and 99th-percentile
+ But latencies start to rise noticeably at 100 participants
Availability
+ The test universe : 5 zones named Zi
+ With 25 span servers each .
+ Test database was sharded into 1250 Paxos groups
+ Test clients 100
+ constantly issued non-snapshot reads at an aggregate
rate of 50K reads/second.
Availability
+ All of the leaders were explicitly placed in Z1.
+ Five seconds into each test, all of the servers in one zone
were killed:
+ non-leader kills Z2
+ leader-hard kills Z1
+ leader-soft kills Z1
Availability
Fig: Effect of killing servers on throughput.
Availability
+ Non-leader kills Z2: no effect on read throughput
+ Leader-soft kills Z1:
+ it gives notifications to all of the servers that they
should handoff leadership first
+ time to handoff leadership to a different zone has a
minor effect: throughput drop around 3-4%
Availability
+ Leader-hard kills Z1:
+ severe effect: the rate of completion drops almost to 0
+ After Leaders get re-elected:
+ throughput of the system rises to approximately
100K reads/second because
1. Extra capacity in the system
2. Operations are queued while the leader is
unavailable
+ Goes back to its steady-state rate
Availability
+ Paxos leader leases are set to 10 seconds
+ After killing the zone, the leader-lease expiration times for
the groups should be evenly distributed over the next 10
seconds
+ Soon after each lease from a dead leader expires, a new
leader is elected
+ So, Approximately 10 seconds after the kill time
+ all of the groups have leaders and throughput has recovered
Availability
+ Shorter lease times would reduce the effect of server deaths
on availability
+ Trade-off greater amounts of lease-renewal network
traffic
+ New mechanism is under development
+ mechanism that will cause slaves to release Paxos
leader leases upon leader failure
TrueTime
+ Two questions must be answered with respect to TrueTime:
+ is E truly a bound on clock uncertainty
+ how bad does get?
+ Bad CPUs are 6 times more likely than bad clocks
TrueTime
+ TrueTime data taken at several thousand span server
machines across data centers up to 2200 km apart.
+ It plots the 90th, 99th, and 99.9th percentiles of E, sampled
at time slave daemons immediately after polling the time
masters.
+ This sampling elides the saw-tooth in due to local-clock
uncertainty, and therefore measures time-master
uncertainty (which is generally 0) plus communication delay
to the time masters.
TrueTime
Fig: Distribution of TrueTime ∈ values, sampled right after timeslave daemon
polls the time masters. 90th, 99th, and 99.9th percentiles are graphed.
F1
+ Google’s advertising backend called F1
+ based on a MySQL database
+ manually sharded many ways
+ uncompressed dataset is tens of terabytes
+ assigned each customer and all related data to a fixed
shard
+ Resharding took over two years of intense effort
+ Involved coordination and testing across dozens of teams to
minimize risk
+ Storing some data in external Bigtables
+ compromising transactional behavior
+ ability to query across all data
F1
+ Spanner for F1
a. Spanner removes the need to manually reshard
b. Spanner provides synchronous replication
c. Automatic failover
d. F1 requires strong transactional semantics, which made using
other NoSQL systems impractical
+ transactions across arbitrary data, and consistent reads
e. F1 team also needed secondary indexes on their data
+ Spanner does not support secondary indexes
+ F1 team implemented their own consistent global indexes using
Spanner transactions
F1
+ Distribution of the number of fragments per directory in F1
a. directory corresponds to a customer
b. consist of only 1 fragment
+ Directories with more than 100 fragments are all tables that
contain F1 secondary indexes
F1
Fig: Distribution of directory-fragment counts in F1
F1
+ All 3 replicas in the east-coast data centers are given
higher priority in choosing Paxos leaders
+ Measured from F1 servers in those data centers
+ SD of Write latencies is caused due to lock conflicts
+ Larger SD in read latencies
+ Paxos leaders are spread across two data centers
+ Only one of which has machines with SSDs
+ The measurement includes every read in data centers
+ Mean: 1.6KB SD: 119KB
F1
Fig: F1-perceived operation latencies measured over the course of 24 hours.
Conclusions
+Reify clock uncertainty in time APIs
+Known unknowns are better than unknown unknowns
+Rethink algorithms to make use of uncertainty
+Stronger semantics are achievable
+Greater scale != weaker semantics
Spanner 90
Thanks!Any questions?

Mais conteúdo relacionado

Mais procurados

Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalCodemotion Tel Aviv
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase HBaseCon
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraJesus Guzman
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandraWu Liang
 
[Globant summer take over] Empowering Big Data with Cassandra
[Globant summer take over] Empowering Big Data with Cassandra[Globant summer take over] Empowering Big Data with Cassandra
[Globant summer take over] Empowering Big Data with CassandraGlobant
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoverySteven Francia
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Karthik Ramasamy
 

Mais procurados (20)

Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
[Globant summer take over] Empowering Big Data with Cassandra
[Globant summer take over] Empowering Big Data with Cassandra[Globant summer take over] Empowering Big Data with Cassandra
[Globant summer take over] Empowering Big Data with Cassandra
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014
 

Semelhante a Google's Globally-Distributed Database Spanner

Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationJonathan Katz
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...NETWAYS
 
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrA Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH
 
Chronix: A fast and efficient time series storage based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache SolrChronix: A fast and efficient time series storage based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache SolrFlorian Lautenschlager
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
RedisConf18 - Active-Active Geo-Distributed Apps with Redis CRDTs (conflict f...
RedisConf18 - Active-Active Geo-Distributed Apps with Redis CRDTs (conflict f...RedisConf18 - Active-Active Geo-Distributed Apps with Redis CRDTs (conflict f...
RedisConf18 - Active-Active Geo-Distributed Apps with Redis CRDTs (conflict f...Redis Labs
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataAmazon Web Services
 
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13specialk29
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 

Semelhante a Google's Globally-Distributed Database Spanner (20)

Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the Block
 
The new time series kid on the block
The new time series kid on the blockThe new time series kid on the block
The new time series kid on the block
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management Application
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
 
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrA Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache Solr
 
Chronix: A fast and efficient time series storage based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache SolrChronix: A fast and efficient time series storage based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache Solr
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
RedisConf18 - Active-Active Geo-Distributed Apps with Redis CRDTs (conflict f...
RedisConf18 - Active-Active Geo-Distributed Apps with Redis CRDTs (conflict f...RedisConf18 - Active-Active Geo-Distributed Apps with Redis CRDTs (conflict f...
RedisConf18 - Active-Active Geo-Distributed Apps with Redis CRDTs (conflict f...
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big Data
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 

Último

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 

Último (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 

Google's Globally-Distributed Database Spanner

  • 1. Spanner: Google’s Globally- Distributed Database Presented BY Sultan Ahmed (0416052090) Mazharul Hossain (1015052033)
  • 2. Authors: +James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, … , Google, Inc. Published in: +ACM Transactions on Computer Systems (TOCS) Volume 31 Issue 3, August 2013 . +USENIX Symposium on Operating Systems Design and Implementation (OSDI)2012,Best Paper .
  • 3. Topics To be Covered + Problem Definition + Spanner Database + Greedy Forwarding + Right Hand Rule - Perimeter + Planarized Graph + Combining Greedy And Planner Perimeters + Simulations And Results
  • 4. Problem – Need for Scalable MySQL  Google’s advertising backend - Based on MySQL • Relations • Query language - Manually sharded •Resharding is very costly - Global distribution
  • 5. Available Solutions To Solve Problem  Google’s Big Table  Google Megastore
  • 6. Overview of Available Solutions • Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers •Transactional scope limited to single row Google Megastore • Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language
  • 7. Bridging the gap between Megastore and Bigtable Solution: Google Spanner  Removes the need to manually partition data  Synchronous replication and automatic failover  Strong transactional semantics Google Megastore
  • 8. What is Spanner?  Distributed multiversion database + General-purpose transactions (ACID) + SQL query language + Schematized tables + Semi-relational data model  Running in production + Storage for Google’s ad data + Replaced a sharded MySQL database
  • 9. Example: Social Network User posts Friend lists User posts Friend lists User posts Friend lists User posts Friend lists US Brazil RussiaSpain San Francisco Seattle Arizona Sao Paulo Santiago Buenos Aires Moscow Berlin Krakow London Paris Berlin Madrid Lisbon User posts Friend lists x1000 x1000 x1000 x1000 Sharding: is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.
  • 10. Why Consistency matters? Generate a page of friends’ recent posts  Consistent view of friend list and their posts Single Machine User posts Friend lists User posts Friend lists Friend2 post Generate my page Friend1 post Friend1000 post Friend999 post Block writes …
  • 11. User posts Friend lists User posts Friend lists Multiple Machines User posts Friend lists Generate my page Friend2 post Friend1 post Friend1000 post Friend999 post User posts Friend lists Block writes …
  • 12. User posts Friend lists User posts Friend lists User posts Friend lists Multiple Datacenters User posts Friend lists Generate my page Friend2 post Friend1 post Friend1000 post Friend999 post … US Spain Russia Brazil x1000 x1000 x1000 x1000
  • 14. Spanner Deployment - Universe  A Spanner deployment is called a universe.  Spanner is organized as a set of zones, where each zone is the rough analog of a deployment of Bigtable servers .
  • 15. Spanner Deployment - Universe  A zone has one zonemaster and several thousand spanservers. The former assigns data to spanservers; the latter serve data to clients.  Proxies are used by clients to locate span servers with the correct data
  • 16. Spanner Deployment - Universe  universemaster has a console with status information  placement driver handles data movement for replication or load balancing purposes
  • 18. Spanserver Software Stack + Each spanserver is responsible for 100-1000 data structure instances , called tablet +Tablet maintains following mapping: (key: string, timestamp:int64) -> string 18
  • 19. Spanserver Software Stack + Data and logs stored on colossus (successor of GFS) + A single paxos state machine on top of each tablet: consistent replication + Paxos - to get consensus; i.e. for all participants to agree on common value. We use paxos for consistent replication 19
  • 20. Spanserver Software Stack +Paxos implementation supports long-lived leaders with time-based leader leases. + Writes must initiate the paxos protocol at the leader. +Reads access state directly from the underlying tablet at any replica . + Transaction manager: to support distributed transactions 20
  • 23. Directories + A directory is the unit of data placement. + A directory is also a smallest unit whose geographic replication properties can be specied by an application. + All data in a directory has the same replication configuration . + A Paxos group may contain multiple directories .
  • 24. Directories + Spanner might move a directory: - To shed load from a paxos group. - To put directories that are frequently accessed together into the same group. - To move a directory into a group that is closer to its accessors.
  • 26. + An application creates one or more databases in a universe. + Each database can contain an unlimited number of schematized tables. + Table • Rows and columns • Must have an ordered set one or more primary key columns • Primary key uniquely identifies each row + Hierarchies of tables • Tables must be partitioned by client into one or more hierarchies of tables • Table in the top: directory table Data Model
  • 27. EXAMPLE CREATE TABLE Albums { user_id INT64 NOT NULL, album_id INT64 NOT NULL, name STRING } PRIMARY KEY (user_id, album_id) ; CREATE TABLE Photo { user_id INT64 NOT NULL, album_id INT64 NOT NULL, photo_id INT64 NOT NULL, name STRING } PRIMARY KEY (user_id, album_id, photo_id), INTERLEAVE IN PARENT Albums ON DELETE CASCADE; + Client applications declare the hierarchies in database schemas via the INTERLEAVE IN declarations. + ON DELETE CASCADE says that deleting a row in the directory table deletes any associated child rows.
  • 28. Logical data layout Album user_id album_id Name 1 1 Picnic 1 2 Birthday 2 1 Rag 3 1 Eid Photo user_id album_id photo_id name 1 1 1 pic1 1 1 2 pic2 1 1 3 pic3 1 2 1 pic1 Spanner 28
  • 29. Physical data layout 1 1 Picnic 1 1 1 pic1 1 1 2 pic2 1 1 3 pic3 1 2 Birthday 1 2 1 pic1 1 2 2 pic2 1 2 3 pic3 1 2 4 pic4 1 2 5 pic5 Spanner 29 Directory TableDirecotry
  • 30. + Not purely relational - Requires rows to have names - Names are nothing but a set(can be singleton) of primary keys - In a way, it’s a key value store with primary keys mapped to non-key columns as values Semi-Relational Data Model
  • 32. +Spanner knows what time it is. Key Innovation
  • 33. +Is synchronizing time at the global scale possible? +Synchronizing time within and between datacenters extremely hard and uncertain. + Serialization of requests is impossible at global scale Time Synchronization
  • 34. +Idea: accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks). Time Synchronization
  • 35. TrueTime +“Global wall-clock time” with bounded uncertainty Spanner 35 time earliest latest TT.now() 2*ε
  • 36. TrueTime Architecture Spanner 36 Datacenter 1 Datacenter n…Datacenter 2 GPS timemaster GPS timemaster GPS timemaster Atomic-clock timemaster GPS timemaster Client GPS timemaster Compute reference [earliest, latest] = now ± ε
  • 37. How True Time Is Implemented
  • 38. How True Time is Implemented • Daemon polls variety of masters: • Chosen from nearby datacenters • From further datacenters • Armageddon masters • Daemon polls variety of masters and reaches a consensus about correct timestamp.
  • 39. External consistency • External Consistency: Formally, If commit of T1 preceded the initiation of a new transaction T2 in wall-clock (physical) time, then commit of T1 should precede commit of T2 in the serial ordering also.
  • 40. External consistency • Jerry unfriends Tom to write a controversial comment. • If serial order is as above, Jerry will be in trouble! T1: unfriend Tom T2: post comment T3: view Jerry’s profile by Jerry by Tom physical time T2 : post comment T3 : view Jerry’s Profile T1 : unfriend Tom
  • 41. External consistency • External Consistency: Formally, If commit of T1 preceded the initiation of a new transaction T2 in wall-clock (physical) time, then commit of T1 should precede commit of T2 in the serial ordering also.
  • 43. Assigning Timestamps to RW Transactions External consistency +Concept : one transaction starts after other transaction’s ending (Happens-Before)
  • 45. Key Innovation + The Spanner implementation supports 3 type of transactions 1. Readwrite transactions 2. Read-only transactions 3. Snapshot reads. + Standalone writes are implemented as read-write transactions; + Non-snapshot standalone reads are implemented as read-only transactions.
  • 46. Snapshot Read + Read in past without locking. + Client can specify timestamp for read or an upper bound of timestamp. + Each replica tracks a value called safe time tsafe , which is the maximum timestamp at which a replica is up-to-date. + Replica can satisfy read at any t ≤ tsafe
  • 47. Read-only Transactions + Assign timestamp sread and do snapshot read at sread + sread = TT.now().latest() + It guarantees external consistency.
  • 48. Read-Write Transactions + Leader must only assign timestamps within the interval of its leader lease + Timestamps must be assigned in monotonically increasing order. + If transaction T1 commits before T2 starts, T2’s commit timestamp must be greater than T1’s commit timestamp
  • 49. Read-Write Transactions + Clients buffer writes. + Client chooses a coordinate group that initiates two-phase commit. + A non-coordinator-participant leader chooses a prepare timestamp and logs a prepare record through paxos and notifies the coordinator
  • 50. Read-Write Transactions + The coordinator assigns a commit timestamp si no less than all prepare timestamps and TT.now().latest(). + The coordinator ensures that clients cannot see any data committed by Ti until TT.after(si) is true. This is done by commit wait (wait until absolute time passes si to commit).
  • 51. Timestamps, Global Clock +Strict two-phase locking for write transactions +Assign timestamp while locks are held T Pick s = now() Acquired locks Release locks
  • 52. Timestamps and TrueTime Spanner 52 T Pick s = TT.now().latest Acquired locks Release locks Wait until TT.now().earliest > ss average ε Commit wait average ε
  • 53. Commit Wait and 2-Phase Commit Spanner 53 TC Acquired locks Release locks TP1 Acquired locks Release locks TP2 Acquired locks Release locks Notify participants of s Commit wait doneCompute s for each Start logging Done logging Prepared Compute overall s Committed Send s
  • 54. Example Spanner 54 TP Remove X from my friend list Remove myself from X’s friend list sC=6 sP=8 s=8 s=15 Risky post P s=8 Time <8 [X] [me] 15 TC T2 [P] My friends My posts X’s friends 8 [] []
  • 55. Assign timestamp(R/O) +Simple: + Can read if within + Spanner 55 latestnowTTsread ().. ),min( TM safe Paxos safesafe ttt  1)(min ,  prepare gii TM safe st
  • 56. Assign timestamp (paxos) +One paxos group -> lastTS() +More paxos group-> TT.now().latest +which may wait for safe time to advance
  • 57. Transaction Detail Atomic Schema Change Spanner 57
  • 58. Atomic Schema change +Assign timestamp t in future +Do the schema change in background (prepare phase) +Block request if it is after t. Spanner 58
  • 60. Reduce wait time for read +Fine-grained mapping from key ranges to +Fine-grained mapping from key ranges to LastTS() Spanner 60 TM safet
  • 62. Evaluation + Spanner’s performance + replication + transactions + availability + Some insight on TrueTime behavior + Case study of their first client, F1
  • 63. Test Bench + CPUs: timeshared machines: + Scheduling units of + 4GB RAM + 4 cores (AMD Barcelona 2200MHz) + Clients were run on separate machines
  • 64. Test Bench + Each zone contained 1 spanserver + Clients and zones were placed in a set of data centers with network distance of less than 1ms + Test database: + 50 Paxos groups + 2500 directories
  • 65. Test Bench + Operations were standalone reads and writes of 4KB + All reads were served out of memory after a compaction for measuring the overhead of Spanner’s call stack + 1 unmeasured round of reads was done first to warm location caches
  • 66. Latency + Clients issued sufficiently few operations to avoid queuing at the servers + From the 1-replica experiments: + Commit wait is about 5ms + Paxos latency is about 9ms
  • 67. Latency Table: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with commit wait disabled
  • 68. Latency + With increase of replicas + the latency stays roughly constant + less standard deviation + latency to achieve a quorum becomes less sensitive
  • 69. Throughput + Clients issued sufficiently many operations so as to saturate the servers’ CPUs + Snapshot reads can execute at any up-to-date replicas, so their throughput increases almost linearly with the number of replicas + Single-read read-only transactions only execute at leaders because timestamp assignment must happen at leaders + Read-only-transaction throughput increases with the number of replicas because the number of effective span servers increases
  • 70. Throughput Table: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with commit wait disabled
  • 71. Throughput + Write throughput benefits from the same experimental artifact (which explains the increase in throughput from 3 to 5 replicas) + But increase with replicas, this benefit outweighed by the linear increase in the amount of work performed per write
  • 72. Throughput Table: Two-phase commit scalability. Mean and standard deviations over 10 runs
  • 73. Throughput + two-phase commit can scale to a reasonable number of participants: + Experiments with 3 zones, each with 25 spanservers + Scaling up to 50 participants is reasonable in both mean and 99th-percentile + But latencies start to rise noticeably at 100 participants
  • 74. Availability + The test universe : 5 zones named Zi + With 25 span servers each . + Test database was sharded into 1250 Paxos groups + Test clients 100 + constantly issued non-snapshot reads at an aggregate rate of 50K reads/second.
  • 75. Availability + All of the leaders were explicitly placed in Z1. + Five seconds into each test, all of the servers in one zone were killed: + non-leader kills Z2 + leader-hard kills Z1 + leader-soft kills Z1
  • 76. Availability Fig: Effect of killing servers on throughput.
  • 77. Availability + Non-leader kills Z2: no effect on read throughput + Leader-soft kills Z1: + it gives notifications to all of the servers that they should handoff leadership first + time to handoff leadership to a different zone has a minor effect: throughput drop around 3-4%
  • 78. Availability + Leader-hard kills Z1: + severe effect: the rate of completion drops almost to 0 + After Leaders get re-elected: + throughput of the system rises to approximately 100K reads/second because 1. Extra capacity in the system 2. Operations are queued while the leader is unavailable + Goes back to its steady-state rate
  • 79. Availability + Paxos leader leases are set to 10 seconds + After killing the zone, the leader-lease expiration times for the groups should be evenly distributed over the next 10 seconds + Soon after each lease from a dead leader expires, a new leader is elected + So, Approximately 10 seconds after the kill time + all of the groups have leaders and throughput has recovered
  • 80. Availability + Shorter lease times would reduce the effect of server deaths on availability + Trade-off greater amounts of lease-renewal network traffic + New mechanism is under development + mechanism that will cause slaves to release Paxos leader leases upon leader failure
  • 81. TrueTime + Two questions must be answered with respect to TrueTime: + is E truly a bound on clock uncertainty + how bad does get? + Bad CPUs are 6 times more likely than bad clocks
  • 82. TrueTime + TrueTime data taken at several thousand span server machines across data centers up to 2200 km apart. + It plots the 90th, 99th, and 99.9th percentiles of E, sampled at time slave daemons immediately after polling the time masters. + This sampling elides the saw-tooth in due to local-clock uncertainty, and therefore measures time-master uncertainty (which is generally 0) plus communication delay to the time masters.
  • 83. TrueTime Fig: Distribution of TrueTime ∈ values, sampled right after timeslave daemon polls the time masters. 90th, 99th, and 99.9th percentiles are graphed.
  • 84. F1 + Google’s advertising backend called F1 + based on a MySQL database + manually sharded many ways + uncompressed dataset is tens of terabytes + assigned each customer and all related data to a fixed shard + Resharding took over two years of intense effort + Involved coordination and testing across dozens of teams to minimize risk + Storing some data in external Bigtables + compromising transactional behavior + ability to query across all data
  • 85. F1 + Spanner for F1 a. Spanner removes the need to manually reshard b. Spanner provides synchronous replication c. Automatic failover d. F1 requires strong transactional semantics, which made using other NoSQL systems impractical + transactions across arbitrary data, and consistent reads e. F1 team also needed secondary indexes on their data + Spanner does not support secondary indexes + F1 team implemented their own consistent global indexes using Spanner transactions
  • 86. F1 + Distribution of the number of fragments per directory in F1 a. directory corresponds to a customer b. consist of only 1 fragment + Directories with more than 100 fragments are all tables that contain F1 secondary indexes
  • 87. F1 Fig: Distribution of directory-fragment counts in F1
  • 88. F1 + All 3 replicas in the east-coast data centers are given higher priority in choosing Paxos leaders + Measured from F1 servers in those data centers + SD of Write latencies is caused due to lock conflicts + Larger SD in read latencies + Paxos leaders are spread across two data centers + Only one of which has machines with SSDs + The measurement includes every read in data centers + Mean: 1.6KB SD: 119KB
  • 89. F1 Fig: F1-perceived operation latencies measured over the course of 24 hours.
  • 90. Conclusions +Reify clock uncertainty in time APIs +Known unknowns are better than unknown unknowns +Rethink algorithms to make use of uncertainty +Stronger semantics are achievable +Greater scale != weaker semantics Spanner 90

Notas do Editor

  1. Bad hosts are evicted Timemasters check themselves against other timemasters Clients check themselves against timemasters
  2. The data shows that these two factors in determining the base value of are generally not a problem. However, there can be significant tail-latency issues that cause higher values of E. The reduction in tail latencies beginning on March 30 were due to networking improvements that reduced transient network-link congestion. The increase in on April 13, approximately one hour in duration, resulted from the shutdown of 2 time masters at a datacenter for routine maintenance. We continue to investigate and remove causes of TrueTime spikes.