SlideShare a Scribd company logo
1 of 80
Download to read offline
BIG DATA 101, FOUNDATIONAL
KNOWLEDGE FOR A NEW PROJECT
IN 2017
@doanduyhai
Technical Advocate @ Datastax
Apache Zeppelin™ Committer
@doanduyhai1
Who Am I ?
Duy Hai DOAN
Technical Advocate @ Datastax
•  talks, meetups, confs
•  open-source devs (Achilles, Zeppelin,…)
•  OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
Apache Zeppelin™ committer
@doanduyhai2
Agenda
1) Distributed systems theories & properties
2) Data sharding , replication
3) CAP theorem
4) Distributed systems architecture: master/slave vs masterless
@doanduyhai3
Distributed systems theories
@doanduyhai4
Time
Ordering
Latency
Failure
Consensus
Time
There is no absolute time in theory (even with atomic clocks!)
Time-drift is unavoidable
•  unless you provide atomic clock to each server
•  unless you’re Google
NTP is your friend ☞ configure it properly !
@doanduyhai5
Ordering of operations
How to order operations ?
What does before/after mean ?
•  when clock is not 100% reliable
•  when operations occur on multiple machines …
•  … that live in multiple continents (1000s km distance)
@doanduyhai6
Ordering of operations
Local/relative ordering is possible
Global ordering ?
•  either execute all operations on single machine (☞ master)
•  or ensure time is perfectly synchronized on all machines executing the
operations (really feasible ?)
@doanduyhai7
Known algorithms
Lamport clock
•  algorithm for message sender
•  algorithm for message receiver
•  partial ordering between a pair of (sender, receiver) is possible
@doanduyhai8
time = time+1;
time_stamp = time;
send(Message, time_stamp);
(message, time_stamp) = receive();
time = max(time_stamp, time)+1;
Known algorithms
@doanduyhai9
Vector clock
Latency
Def: time interval between request & response.
Latency is composed of
•  network delay: router/switch delay + physical medium delay
•  OS delay (negligible)
•  time to process the query by the target (disk access, computation …)
@doanduyhai10
Latency
Speed of light physics
•  ≈ 300 000 km/s in the void
•  ≈ 197 000 km/s in fiber optic cable (due to refraction indice)
London – New York bird flight distance ≈ 5500km è 28ms for a
one way trip
Conclusion: a ping between London – New York cannot take less
than 56ms
@doanduyhai11
@doanduyhai12
"The mean latency is
below 10ms"
Database vendor X
✔︎ ✘︎
@doanduyhai13
"The mean latency is
below 10ms"
Database vendor X
✔︎ ✘︎
Failure modes
•  Byzantine failure: same input, different outputs à application bug !!!
•  Performance failure: response correct but arrives too late
•  Omission failure: special case of performance failure, no response (timeout)
•  Crash failure: self-explanatory, server stops responding
Byzantine failure à value issue
Other failures à timing issue
@doanduyhai14
Failure
Root causes
•  Hardware: disk, CPU, …
•  Software: packet lost, process crash, OS crash …
•  Workload-specific: flushing huge file to SAN (🙀)
•  JVM-related: long GC pause
Defining failure is hard
@doanduyhai15
@doanduyhai16
"A server fails when it does
not respond to one or
multiple request(s) in a
timely manner"
Usual meaning of failure
Failure detection
Timely manner ☞ timeout!
Failure detector:
•  heart beat: binary state, (up/down), too simple
•  exponential backoff with threshold: better model
•  phi accrual detector: advanced model using statictics
@doanduyhai17
Distributed consensus protocols
Since time is unreliable, global ordering is hard to achieve & failure is
hard to detect ...
... how different machines can agree on a single value ?
Important properties:
•  validity: the agreed value must have been proposed by some process
•  termination: at least one non-faulty process eventually decides
•  agreement: all processes agree on the same value
@doanduyhai18
Distributed consensus protocols
2-phases commit
•  termination KO: the protocol can be blocked if coordinator fails
3-phases commit
•  agreement KO: in case of network partition, possibility of inconsistent state
Paxos, RAFT & Zab (Zookeeper)
•  OK: satisfies 3 requirements
•  QUORUM-based: requires a strict majority of copies/replicas to be alive
@doanduyhai19
Data sharding & replication
@doanduyhai20
Data Sharding
Why sharding ?
•  scalability: map logical shard to physical hardware (machines/racks,...)
•  divide & conquer: each shard represents the DB at a smaller scale
How to shard ?
•  user-defined algorithm: user chooses the sharding algorithm & the target
columns on which applies the algorithm.
•  fixed algorithm: the DB imposes the sharding algorithm. The user decides only
on which columns to apply the algorithm. Ex: user_id
@doanduyhai21
Data Sharding
Example of user-defined sharding
•  user data with sharding key == user_id, sharding algo == MD5 🙂
@doanduyhai22
18
24
17
19
22
0
5
10
15
20
25
30
0-19 20-39 40-59 60-79 80-99
Dataownershipin%
Shards
MD5 Data Distribution
Data Sharding
Example of user-defined sharding
•  user data with sharding key == email, sharding algo == take 1st letter 😱
@doanduyhai23
19
32
27
15
5
2
0
5
10
15
20
25
30
35
a - c e - h m - p q - t u - x y - z
Dataownershipin%
Shards
1st letter Data Distribution
Data Sharding
Example of fixed sharding algo Murmur3
•  user data with sharding key == user_id or whatever key 😎
@doanduyhai24
19
23
18
19
21
0
5
10
15
20
25
0-19 20-39 40-59 60-79 80-99
Dataownershipin%
Shards
Murmur3 Data Distribution
@doanduyhai25
"With Murmur3 we are
guaranteed to have
even data distribution"
✔︎ ✘︎
@doanduyhai26
"With Murmur3 we are
guaranteed to have
even data distribution"
✔︎ ✘︎
Dice rolling experiment
@doanduyhai27
Dice rolling experiment
@doanduyhai28
Dice rolling experiment
@doanduyhai29
Dice rolling experiment
@doanduyhai30
It’s all about
statistics !
Data Sharding Trade-off
Logical sharding (with ordering)
•  can lead to hotspots & imbalance in data distribution
•  but allows range queries
•  WHERE sharding_key >= xxx AND sharding_key <= yyy
Hash-based sharding
•  guarantees uniform distribution (with sufficient distinct shard key values)
•  range queries not possible, only point queries
•  WHERE sharding_key >= xxx AND sharding_key <= yyy
•  WHERE sharding_key == zzz
@doanduyhai31
Data Sharding and Rebalancing
For some category of NoSQL solutions
•  range queries is mandatory à hotspots not avoidable !!!
•  mainly K/V databases, some wide columns databases too
Rebalancy is necessary
•  sometimes automated process
•  sometimes manual admin process 😭
•  resource-intensive operation (CPU, disk I/O + network) à impact live
production traffic
@doanduyhai32
Data Replication
How ? By having multiple copies
Type of replicas
•  symetric: no role, each replica is similar to others
•  asymetric: "master/slave" style. All operations (read/write) should go through
a single server
Replica definition
•  symetric: 1 replica == 1 copies. 3 replicas == 3 copies in total
•  asymetric: 1 replica == 1 slave copy. Total copies = master + replica(s)
@doanduyhai33
Data Replication
@doanduyhai34
Client
Replica1 Replica2 Replica3
Symetric replicas, write operations
Parallel dispatch
Data Replication
@doanduyhai35
Client
Replica1 Replica2 Replica3
Symetric replicas, read operations
Data Replication
@doanduyhai36
Master
Replica1 Replica2 Replica3
Client
Asymetric replicas, write operations
Data Replication
@doanduyhai37
Master
Replica1 Replica2 Replica3
Client
Asymetric replicas, read operations
Data Replication
@doanduyhai38
Master
Replica1 Replica2 Replica3
Client
Asymetric replicas, read operations
BOTTLENECK !!!
Data Replication
@doanduyhai39
Master
Replica1 Replica2 Replica3
Client
Asymetric replicas, read operations from slaves
✘
Data Replication
@doanduyhai40
Master
Replica
Asymetric replicas, common write failure scenarios
✘ Message lost (network)
àMaster never receives ack
à KO
Master
Replica
✘
Write dropped (overload)
àMaster never receives ack
à KO
Master
Replica
✘
Replica crashed right away
àMaster never receives ack
à KO
Data Replication
@doanduyhai41
Master
Replica
Asymetric replicas, tricky write failure scenarios
✘
Ack lost (network)
àMaster never receives ack
à KO !!!!
Master
Replica
✘
Replica crashes AFTER
sending ACK but before
flushing data to disk
àMaster receives ack
à OK ?
CAP Theorem
@doanduyhai42
Pick 2 out of 3
CAP theorem
@doanduyhai43
Conjecture by Brewer, formalized later in a paper (2002):
The CAP theorem states that any networked shared-data system can
have at most two of three desirable properties
•  consistency (C): equivalent to having a single up-to-date copy of the data
•  high availability (A): of that data (for updates)
•  and tolerance to network partitions (P)
CAP triangle
@doanduyhai44
CAP theorem revised (2012)
@doanduyhai45
You cannot choose not to be partition-tolerance
Choice is not that binary:
•  in the absence of partition, you can tend toward CA
•  if when partition occurs, choose your side (C or A)
☞ tunable consistency
What is Consistency ?
@doanduyhai46
Meaning is different from the C of ACID
Read Uncommited
Read Commited
Cursor Stability
Repeatable Read
Eventual Consistency
Read Your Write
Pipelined RAM
Causal
Snapshot Isolation Linearizability
Serializability
Without coordination
Requires coordination
Consistency with some CP (supposedly) system
@doanduyhai47
Some DB
Consistency with some AP system
@doanduyhai48
Cassandra tunable consistency
Read Uncommited
Read Commited
Cursor Stability
Repeatable Read
Eventual Consistency
Read Your Write
Pipelined RAM
Causal
Snapshot Isolation Linearizability
Serializability
Without coordination
Requires coordination
Consistency Level
ONE
Consistency with some AP system
@doanduyhai49
Cassandra tunable consistency
Read Uncommited
Read Commited
Cursor Stability
Repeatable Read
Eventual Consistency
Read Your Write
Pipelined RAM
Causal
Snapshot Isolation Linearizability
Serializability
Without coordination
Requires coordination
Consistency Level
QUORUM
Consistency with some AP system
@doanduyhai50
Cassandra tunable consistency
Read Uncommited
Read Commited
Cursor Stability
Repeatable Read
Eventual Consistency
Read Your Write
Pipelined RAM
Causal
Snapshot Isolation Linearizability
Serializability
Without coordination
Requires coordination
LightWeight
Transaction
Single partition writes
are linearizable
What is availability ?
@doanduyhai51
Ability to:
•  Read in the case of failure ?
•  Write in the case of failure ?
Brewer definition: high availability of the data (for updates)
Real world example
@doanduyhai52
Cassandra claims to be highly available, is it true ?
Some marketing slide even claims continous availability (100%
uptime), is it true ?
Network partition scenario with Cassandra
@doanduyhai53
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
C*
C*
C*
Read/Write at
Consistency level ONE
✔︎
Network partition scenario with Cassandra
@doanduyhai54
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
C*
C*
C*
Read/Write at
Consistency level ONE ✘︎
So how can it be highly available ???
@doanduyhai55
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
C*
C*
C*
Read/Write at
Consistency level ONE
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
C*
C*
C*
US DataCenter EU DataCenter
✘
Datacenter-aware load balancing strategy at driver level
Architecture
@doanduyhai56
Master/Slave vs Masterless
Pure master/slave architecture
@doanduyhai57
Single server for all writes, read can be done on master or any slave
Advantages
•  operations can be serialized
•  easy to reason about
•  pre-aggregation is possible
Drawbacks
•  cannot scale on write (read can be scaled)
•  single point of failure (SPOF)
Master/slave SPOF
@doanduyhai58
Write request
MASTER
SLAVE1 SLAVE2 SLAVE3
Multi-master/slave layout
@doanduyhai59
Write request
MASTER1
SLAVE11 SLAVE12 SLAVE13
Shard1
MASTER2
SLAVE21 SLAVE22 SLAVE23
Shard2
…
Proxy layer
@doanduyhai60
"Failure of a shard-master is
not a problem because it
takes less than 10ms to elect
a slave into a master"
Wrong Objection Rhetoric
The wrong objection rhetoric
@doanduyhai61
How long does it take to detect that a shard-master has failed ?
•  heart-beat is not used because too simple
•  so usually after a timeout, after some successive retries
Timeout is usually in tens of seconds
•  you cannot write during this time period
Multi-master/slave architecture
@doanduyhai62
Distribute data between shards. One master per shard
Advantages
•  operations can still be serialized in a single shard
•  easy to reason about in a single shard
•  no more big SPOF
Drawbacks
•  consistent only in a single shard (unless global lock)
•  multiple small points of failure (SPOF inside a shard)
•  global pre-aggregation is no longer possible
Fake masterless/shared-nothing architecture
@doanduyhai63
In reality, multi-master architecture …
… but branded as shared-nothing/masterless architecture
@doanduyhai64
Censored
@doanduyhai65
Censored
It was in
Dec. 2016
As of May 2017
Official doc
@doanduyhai66
Censored
As of May 2017
Technical overview doc
@doanduyhai67
As of May 2017
Technical overview doc
@doanduyhai68
Remember this ?
@doanduyhai69
Beware of
marketing!
Shared-nothing architecture
Masterless architecture
Primary-shard == hidden master
Masterless architecture
@doanduyhai70
No master, every node has equal role
☞ how to manage consistency then if there is no master ?
☞ which replica has the right value of my data ?
Some data-structures to the rescue:
•  vector clock
•  CRDT (Convergent Replicated Data Type)
Masterless architecture
@doanduyhai71
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
Client sending request
C* C*
C*
Notion of coordinator
•  just a network proxy !
•  what if the coordinator dies ???
coordinator
replica replica
replica
Masterless architecture
@doanduyhai72
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
Client sending request
C* C*
C*
Anyone can be coordinator !!!
new coordinator
replica replica
replica
CRDT
@doanduyhai73
Riak
•  Registers
•  Counters
•  Sets
•  Maps
•  …
Cassandra only proposes LWW-register (Last Write Win)
•  based on write timestamp
Timestamp, again …
@doanduyhai74
But didn’t we say that timestamp not really reliable ?
Why not implement pure CRDTs ?
Why choose LWW-registered ?
•  because last-write-win is still the most "intuitive"
•  because conflict resolution with other CRDT is the user responsibility
•  because one should not be required to have a PhD in CS to use Cassandra
Example of write conflict with Cassandra
@doanduyhai75
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
UPDATE users SET
age=32 WHERE id=1
C* C*
C*
Local time
10:00:01.050
age=32 @ 10:00:01.050 age=32 @ 10:00:01.050
age=32 @ 10:00:01.050
Example of write conflict with Cassandra
@doanduyhai76
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
UPDATE users SET
age=33 WHERE id=1
C* C*
C*
Local time
10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
Example of write conflict with Cassandra
@doanduyhai77
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
UPDATE users SET
age=33 WHERE id=1
C* C*
C*
Local time
10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
Example of write conflict
@doanduyhai78
How can we cope with this ?
•  It’s functionally rare to have a update on the same column by differents
clients at atmost same time (few millisecs apart)
•  can also force timestamp at client-side (but need to synchronize clients now
…)
•  can always use LightWeight Transaction to guarantee linearizability
UPDATE user SET age = 33 WHERE id = 1 IF age = 32
Masterless architecture
@doanduyhai79
Advantages
•  no SPOF
•  no failover procedure
•  can achieve 0 downtime with correct tuning
Drawbacks
•  hard to reason about
•  require some knowledge about distributed systems
•  pre-aggregation possible
@doanduyhai80
Thank You !

More Related Content

What's hot

From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 

What's hot (20)

Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basics
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Introduction to CQL and Data Modeling with Apache Cassandra
Introduction to CQL and Data Modeling with Apache CassandraIntroduction to CQL and Data Modeling with Apache Cassandra
Introduction to CQL and Data Modeling with Apache Cassandra
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ ING
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of Datastax
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
 

Similar to Big data 101 for beginners riga dev days

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 

Similar to Big data 101 for beginners riga dev days (20)

Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
NOSQL in the Cloud
NOSQL in the CloudNOSQL in the Cloud
NOSQL in the Cloud
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
 
Navigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern DatabasesNavigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern Databases
 
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
 
Distributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevdayDistributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevday
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Cassandra drivers and libraries
Cassandra drivers and librariesCassandra drivers and libraries
Cassandra drivers and libraries
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 

More from Duyhai Doan

Apache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystemApache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystem
Duyhai Doan
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 

More from Duyhai Doan (15)

Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
 
Le futur d'apache cassandra
Le futur d'apache cassandraLe futur d'apache cassandra
Le futur d'apache cassandra
 
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronSpark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotron
 
Cassandra 3 new features @ Geecon Krakow 2016
Cassandra 3 new features  @ Geecon Krakow 2016Cassandra 3 new features  @ Geecon Krakow 2016
Cassandra 3 new features @ Geecon Krakow 2016
 
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Algorithme distribués pour big data saison 2 @DevoxxFR 2016Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
 
Apache Zeppelin @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016Apache Zeppelin @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystem
 
Cassandra UDF and Materialized Views
Cassandra UDF and Materialized ViewsCassandra UDF and Materialized Views
Cassandra UDF and Materialized Views
 
Data stax academy
Data stax academyData stax academy
Data stax academy
 
Apache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystemApache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystem
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
Algorithmes distribues pour le big data @ DevoxxFR 2015
Algorithmes distribues pour le big data @ DevoxxFR 2015Algorithmes distribues pour le big data @ DevoxxFR 2015
Algorithmes distribues pour le big data @ DevoxxFR 2015
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Big data 101 for beginners riga dev days

  • 1. BIG DATA 101, FOUNDATIONAL KNOWLEDGE FOR A NEW PROJECT IN 2017 @doanduyhai Technical Advocate @ Datastax Apache Zeppelin™ Committer @doanduyhai1
  • 2. Who Am I ? Duy Hai DOAN Technical Advocate @ Datastax •  talks, meetups, confs •  open-source devs (Achilles, Zeppelin,…) •  OSS Cassandra point of contact ☞ duy_hai.doan@datastax.com ☞ @doanduyhai Apache Zeppelin™ committer @doanduyhai2
  • 3. Agenda 1) Distributed systems theories & properties 2) Data sharding , replication 3) CAP theorem 4) Distributed systems architecture: master/slave vs masterless @doanduyhai3
  • 5. Time There is no absolute time in theory (even with atomic clocks!) Time-drift is unavoidable •  unless you provide atomic clock to each server •  unless you’re Google NTP is your friend ☞ configure it properly ! @doanduyhai5
  • 6. Ordering of operations How to order operations ? What does before/after mean ? •  when clock is not 100% reliable •  when operations occur on multiple machines … •  … that live in multiple continents (1000s km distance) @doanduyhai6
  • 7. Ordering of operations Local/relative ordering is possible Global ordering ? •  either execute all operations on single machine (☞ master) •  or ensure time is perfectly synchronized on all machines executing the operations (really feasible ?) @doanduyhai7
  • 8. Known algorithms Lamport clock •  algorithm for message sender •  algorithm for message receiver •  partial ordering between a pair of (sender, receiver) is possible @doanduyhai8 time = time+1; time_stamp = time; send(Message, time_stamp); (message, time_stamp) = receive(); time = max(time_stamp, time)+1;
  • 10. Latency Def: time interval between request & response. Latency is composed of •  network delay: router/switch delay + physical medium delay •  OS delay (negligible) •  time to process the query by the target (disk access, computation …) @doanduyhai10
  • 11. Latency Speed of light physics •  ≈ 300 000 km/s in the void •  ≈ 197 000 km/s in fiber optic cable (due to refraction indice) London – New York bird flight distance ≈ 5500km è 28ms for a one way trip Conclusion: a ping between London – New York cannot take less than 56ms @doanduyhai11
  • 12. @doanduyhai12 "The mean latency is below 10ms" Database vendor X ✔︎ ✘︎
  • 13. @doanduyhai13 "The mean latency is below 10ms" Database vendor X ✔︎ ✘︎
  • 14. Failure modes •  Byzantine failure: same input, different outputs à application bug !!! •  Performance failure: response correct but arrives too late •  Omission failure: special case of performance failure, no response (timeout) •  Crash failure: self-explanatory, server stops responding Byzantine failure à value issue Other failures à timing issue @doanduyhai14
  • 15. Failure Root causes •  Hardware: disk, CPU, … •  Software: packet lost, process crash, OS crash … •  Workload-specific: flushing huge file to SAN (🙀) •  JVM-related: long GC pause Defining failure is hard @doanduyhai15
  • 16. @doanduyhai16 "A server fails when it does not respond to one or multiple request(s) in a timely manner" Usual meaning of failure
  • 17. Failure detection Timely manner ☞ timeout! Failure detector: •  heart beat: binary state, (up/down), too simple •  exponential backoff with threshold: better model •  phi accrual detector: advanced model using statictics @doanduyhai17
  • 18. Distributed consensus protocols Since time is unreliable, global ordering is hard to achieve & failure is hard to detect ... ... how different machines can agree on a single value ? Important properties: •  validity: the agreed value must have been proposed by some process •  termination: at least one non-faulty process eventually decides •  agreement: all processes agree on the same value @doanduyhai18
  • 19. Distributed consensus protocols 2-phases commit •  termination KO: the protocol can be blocked if coordinator fails 3-phases commit •  agreement KO: in case of network partition, possibility of inconsistent state Paxos, RAFT & Zab (Zookeeper) •  OK: satisfies 3 requirements •  QUORUM-based: requires a strict majority of copies/replicas to be alive @doanduyhai19
  • 20. Data sharding & replication @doanduyhai20
  • 21. Data Sharding Why sharding ? •  scalability: map logical shard to physical hardware (machines/racks,...) •  divide & conquer: each shard represents the DB at a smaller scale How to shard ? •  user-defined algorithm: user chooses the sharding algorithm & the target columns on which applies the algorithm. •  fixed algorithm: the DB imposes the sharding algorithm. The user decides only on which columns to apply the algorithm. Ex: user_id @doanduyhai21
  • 22. Data Sharding Example of user-defined sharding •  user data with sharding key == user_id, sharding algo == MD5 🙂 @doanduyhai22 18 24 17 19 22 0 5 10 15 20 25 30 0-19 20-39 40-59 60-79 80-99 Dataownershipin% Shards MD5 Data Distribution
  • 23. Data Sharding Example of user-defined sharding •  user data with sharding key == email, sharding algo == take 1st letter 😱 @doanduyhai23 19 32 27 15 5 2 0 5 10 15 20 25 30 35 a - c e - h m - p q - t u - x y - z Dataownershipin% Shards 1st letter Data Distribution
  • 24. Data Sharding Example of fixed sharding algo Murmur3 •  user data with sharding key == user_id or whatever key 😎 @doanduyhai24 19 23 18 19 21 0 5 10 15 20 25 0-19 20-39 40-59 60-79 80-99 Dataownershipin% Shards Murmur3 Data Distribution
  • 25. @doanduyhai25 "With Murmur3 we are guaranteed to have even data distribution" ✔︎ ✘︎
  • 26. @doanduyhai26 "With Murmur3 we are guaranteed to have even data distribution" ✔︎ ✘︎
  • 31. Data Sharding Trade-off Logical sharding (with ordering) •  can lead to hotspots & imbalance in data distribution •  but allows range queries •  WHERE sharding_key >= xxx AND sharding_key <= yyy Hash-based sharding •  guarantees uniform distribution (with sufficient distinct shard key values) •  range queries not possible, only point queries •  WHERE sharding_key >= xxx AND sharding_key <= yyy •  WHERE sharding_key == zzz @doanduyhai31
  • 32. Data Sharding and Rebalancing For some category of NoSQL solutions •  range queries is mandatory à hotspots not avoidable !!! •  mainly K/V databases, some wide columns databases too Rebalancy is necessary •  sometimes automated process •  sometimes manual admin process 😭 •  resource-intensive operation (CPU, disk I/O + network) à impact live production traffic @doanduyhai32
  • 33. Data Replication How ? By having multiple copies Type of replicas •  symetric: no role, each replica is similar to others •  asymetric: "master/slave" style. All operations (read/write) should go through a single server Replica definition •  symetric: 1 replica == 1 copies. 3 replicas == 3 copies in total •  asymetric: 1 replica == 1 slave copy. Total copies = master + replica(s) @doanduyhai33
  • 34. Data Replication @doanduyhai34 Client Replica1 Replica2 Replica3 Symetric replicas, write operations Parallel dispatch
  • 35. Data Replication @doanduyhai35 Client Replica1 Replica2 Replica3 Symetric replicas, read operations
  • 36. Data Replication @doanduyhai36 Master Replica1 Replica2 Replica3 Client Asymetric replicas, write operations
  • 37. Data Replication @doanduyhai37 Master Replica1 Replica2 Replica3 Client Asymetric replicas, read operations
  • 38. Data Replication @doanduyhai38 Master Replica1 Replica2 Replica3 Client Asymetric replicas, read operations BOTTLENECK !!!
  • 39. Data Replication @doanduyhai39 Master Replica1 Replica2 Replica3 Client Asymetric replicas, read operations from slaves ✘
  • 40. Data Replication @doanduyhai40 Master Replica Asymetric replicas, common write failure scenarios ✘ Message lost (network) àMaster never receives ack à KO Master Replica ✘ Write dropped (overload) àMaster never receives ack à KO Master Replica ✘ Replica crashed right away àMaster never receives ack à KO
  • 41. Data Replication @doanduyhai41 Master Replica Asymetric replicas, tricky write failure scenarios ✘ Ack lost (network) àMaster never receives ack à KO !!!! Master Replica ✘ Replica crashes AFTER sending ACK but before flushing data to disk àMaster receives ack à OK ?
  • 43. CAP theorem @doanduyhai43 Conjecture by Brewer, formalized later in a paper (2002): The CAP theorem states that any networked shared-data system can have at most two of three desirable properties •  consistency (C): equivalent to having a single up-to-date copy of the data •  high availability (A): of that data (for updates) •  and tolerance to network partitions (P)
  • 45. CAP theorem revised (2012) @doanduyhai45 You cannot choose not to be partition-tolerance Choice is not that binary: •  in the absence of partition, you can tend toward CA •  if when partition occurs, choose your side (C or A) ☞ tunable consistency
  • 46. What is Consistency ? @doanduyhai46 Meaning is different from the C of ACID Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination
  • 47. Consistency with some CP (supposedly) system @doanduyhai47 Some DB
  • 48. Consistency with some AP system @doanduyhai48 Cassandra tunable consistency Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination Consistency Level ONE
  • 49. Consistency with some AP system @doanduyhai49 Cassandra tunable consistency Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination Consistency Level QUORUM
  • 50. Consistency with some AP system @doanduyhai50 Cassandra tunable consistency Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination LightWeight Transaction Single partition writes are linearizable
  • 51. What is availability ? @doanduyhai51 Ability to: •  Read in the case of failure ? •  Write in the case of failure ? Brewer definition: high availability of the data (for updates)
  • 52. Real world example @doanduyhai52 Cassandra claims to be highly available, is it true ? Some marketing slide even claims continous availability (100% uptime), is it true ?
  • 53. Network partition scenario with Cassandra @doanduyhai53 C* C* C* C* C*C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE ✔︎
  • 54. Network partition scenario with Cassandra @doanduyhai54 C* C* C* C* C*C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE ✘︎
  • 55. So how can it be highly available ??? @doanduyhai55 C* C* C* C* C*C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE C* C* C* C* C*C* C* C* C* C* C* C* C* US DataCenter EU DataCenter ✘ Datacenter-aware load balancing strategy at driver level
  • 57. Pure master/slave architecture @doanduyhai57 Single server for all writes, read can be done on master or any slave Advantages •  operations can be serialized •  easy to reason about •  pre-aggregation is possible Drawbacks •  cannot scale on write (read can be scaled) •  single point of failure (SPOF)
  • 59. Multi-master/slave layout @doanduyhai59 Write request MASTER1 SLAVE11 SLAVE12 SLAVE13 Shard1 MASTER2 SLAVE21 SLAVE22 SLAVE23 Shard2 … Proxy layer
  • 60. @doanduyhai60 "Failure of a shard-master is not a problem because it takes less than 10ms to elect a slave into a master" Wrong Objection Rhetoric
  • 61. The wrong objection rhetoric @doanduyhai61 How long does it take to detect that a shard-master has failed ? •  heart-beat is not used because too simple •  so usually after a timeout, after some successive retries Timeout is usually in tens of seconds •  you cannot write during this time period
  • 62. Multi-master/slave architecture @doanduyhai62 Distribute data between shards. One master per shard Advantages •  operations can still be serialized in a single shard •  easy to reason about in a single shard •  no more big SPOF Drawbacks •  consistent only in a single shard (unless global lock) •  multiple small points of failure (SPOF inside a shard) •  global pre-aggregation is no longer possible
  • 63. Fake masterless/shared-nothing architecture @doanduyhai63 In reality, multi-master architecture … … but branded as shared-nothing/masterless architecture
  • 66. As of May 2017 Official doc @doanduyhai66 Censored
  • 67. As of May 2017 Technical overview doc @doanduyhai67
  • 68. As of May 2017 Technical overview doc @doanduyhai68 Remember this ?
  • 70. Masterless architecture @doanduyhai70 No master, every node has equal role ☞ how to manage consistency then if there is no master ? ☞ which replica has the right value of my data ? Some data-structures to the rescue: •  vector clock •  CRDT (Convergent Replicated Data Type)
  • 71. Masterless architecture @doanduyhai71 C* C* C* C* C*C* C* C* C* C* Client sending request C* C* C* Notion of coordinator •  just a network proxy ! •  what if the coordinator dies ??? coordinator replica replica replica
  • 72. Masterless architecture @doanduyhai72 C* C* C* C* C*C* C* C* C* C* Client sending request C* C* C* Anyone can be coordinator !!! new coordinator replica replica replica
  • 73. CRDT @doanduyhai73 Riak •  Registers •  Counters •  Sets •  Maps •  … Cassandra only proposes LWW-register (Last Write Win) •  based on write timestamp
  • 74. Timestamp, again … @doanduyhai74 But didn’t we say that timestamp not really reliable ? Why not implement pure CRDTs ? Why choose LWW-registered ? •  because last-write-win is still the most "intuitive" •  because conflict resolution with other CRDT is the user responsibility •  because one should not be required to have a PhD in CS to use Cassandra
  • 75. Example of write conflict with Cassandra @doanduyhai75 C* C* C* C* C*C* C* C* C* C* UPDATE users SET age=32 WHERE id=1 C* C* C* Local time 10:00:01.050 age=32 @ 10:00:01.050 age=32 @ 10:00:01.050 age=32 @ 10:00:01.050
  • 76. Example of write conflict with Cassandra @doanduyhai76 C* C* C* C* C*C* C* C* C* C* UPDATE users SET age=33 WHERE id=1 C* C* C* Local time 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020
  • 77. Example of write conflict with Cassandra @doanduyhai77 C* C* C* C* C*C* C* C* C* C* UPDATE users SET age=33 WHERE id=1 C* C* C* Local time 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020
  • 78. Example of write conflict @doanduyhai78 How can we cope with this ? •  It’s functionally rare to have a update on the same column by differents clients at atmost same time (few millisecs apart) •  can also force timestamp at client-side (but need to synchronize clients now …) •  can always use LightWeight Transaction to guarantee linearizability UPDATE user SET age = 33 WHERE id = 1 IF age = 32
  • 79. Masterless architecture @doanduyhai79 Advantages •  no SPOF •  no failover procedure •  can achieve 0 downtime with correct tuning Drawbacks •  hard to reason about •  require some knowledge about distributed systems •  pre-aggregation possible