2. A collection of computers that appear to its user as one computer.
Characteristics
The computers operate concurrently.
The computers fail independently.
The computers do not share global clock.
Examples
Amazon.com
Cassandra Database
Is multi-core processor distributed system ?
Is single core computer with peripherals (Wifi , printer , multiple displays etc. )distributed
system ?
4. Single-Master storage
Powerful machine and scaling up.
Type of loads
Read heavy load
Write heavy load
Mixed read write load
Scaling Strategies / Data distribution
Read Replication (scaling out)
Sharding (scaling out)
5. Master Node – Updates must pass through master node.
Followers Nodes – Gets asynchronous replication or data propagation from master
to follower nodes.
Read requests can be fulfilled by follower nodes. It increases over-all I/O of system.
Problems of such design :
Increased complexity of replication.
No guarantee of strong consistency.
Read after write scenarios will not guarantee of latest value.
Master node could be bottleneck for write request.
Model is suitable for read heavy work load.
Example : Google search engine , Relational Data base with cluster.
7. Data distribution techniques
Sharding
Used in relational databases and distributed databases.
Could be manual to completely automated based on scheme.
Consistent hashing
Used in distributed data bases.
It is automated .
8. Used to partition data across multiple nodes based on some key or aspect.
Techniques ranges from manual to automated sharding
Functional Partitioning (burden on client)
Example : Store all user data on one node and transaction data on another node.
Horizontal partitioning (popular)
Ranges
Hashes
Directory
Vertical partitioning (less popular)
9. Data belongs to same set/table/relation is distributed across nodes.
12. Shard routing layer to distribute writes/reads
More complexity
Routing layer , awareness of network topology , handle dynamic cluster
Limited data model
Every data model should have key which is used for routing.
Limited data access patterns
All read/write/update/delete queries must include the key.
Redundant data for more access models.
For accessing data with more than one key , Data need to be stored by those keys so multiple copies or de-
normalized data .
Need to consider data access patterns before designing the models.
Too much scatter gather for aggregations , read might slow down system.
OLTP will slow down the system.
Number of shards need to be decided early in system design.
13. In case of hash function of in-memory map , when we re-hash after load factor
increases after certain threshold we have to re-hash all keys . So modular hash
function will not work in case when number of buckets of map are changing
dynamically .
Consistent hashing is technique used to limit the reshuffling of keys when a hash
table structure is rebalanced .(Dynamically changing say number of buckets ).
Hash space is shared by key hash space and virtual node hash space.
Keys/virtual nodes are hashed to same value despite of number of physical nodes.
Only difference is they stored on different physical nodes.
Advantages
Avoids re-hashing of all keys when nodes leave and join.
Example : For Cassandra there are 128 virtual nodes per physical node.
14. A , B are physical nodes mapped to 8 virtual nodes A , B ,C,D are physical nodes mapped to 8 virtual nodes
Node Hash
A (0-8),(16-24) ,(32-40) ,
(48-56)
B (8-16) , (24-32) ,(40-
48) ,(56-64)
Node Hash
A 0-8),(32-40)
B (8-16), (40-48)
C (16-24), (48-56)
D (24-32) , (56-64)
Remove Node C,D
50% keys are affected
Add Node C,D
50% keys are affected
Lets say we have hash function which gives 8 bit hash. So hash space is 2^8 = 64.
hash(John) = 00111100 hash(Jane) = 00011000 . Hashes are assigned to node segment ahead of node in
CWD.
15. Eventual Consistency
Consistency Tuning
R+W > N
R – Number of replicas to responds for successful reading
W – Number of replicas to respond for successful writing/updates
N – Number of replicas
Failures
Node offline
Network latency
GC like process making node un-responsive
Hinted Handoffs , read repairs
Huge impact on design ( write then read scenarios)
16. CAP
Consistency - Every request gets most recent value after (write/update/delete)
Availability – Every request receives response without error .(No guarantee of
most recent value)
Partition Tolerance – The system continues to respond despite of arbitrary
number of nodes fails. (can not communicate with other nodes temporarily or
permanently due to network partition , congestion or communication delays ,
GC pause in case of JVM)
Ground Reality
Partition Tolerance is must. Nobody wants data loss.
Practical choice is always choosing between Consistency and Availability.
Example:
Amazon S3 services chooses availability over consistency so it is A-P system.
17.
18. In relational database
Two phase commit in distributed relational databases. (suffer throughput)
ACID properties of transaction in relational databases .
A – Atomic , Transactions ( bundle of statements) completes all or nothing.
C – Consistency , Keep database in valid state before and after transaction.
I – Isolation , Transactions acts like it is alone working the data.(serializable , repeatable
reads, read committed, read uncommitted ( dirty reads , phantom reads) )
D- Durability , Once transaction committed , changes are permanent.
Can roll-back like that transaction as if did not happen
Options in distributed storage systems
Lighter transactions are supported like update if present etc.
Write-off (no money back not guarantee of delivery )
Re-try (try with exponential time interval )
Compensating actions (say revert credit card payment)
Distributed transactions (2PL) (slow you down.)
Main reason to scarifies transactions is availability.
Impact on design of applications using distributed storage.
19. Aspects to consider
Scale
Transactional Needs
Highly available
Design to failures
20. Storage options and scenarios
Relational databases
Strong transactional requirements (OLTP systems)
NoSQL
Giant distributed hash table (which can not fit on single machine) with
nested keys.
Key value stores Map<K,V>
Document databases Map<K,{k1:v1,k2:v2,…}> , value is generally of type JSON or some
kind of serializable/de- serializable format or binary file.
Columnar databases SortedMap<<K1,K2,K3..> , V>
Graph databases AdjacencyMap<K ,[K1,K2,K3..]> , lots of small relations or links
Search Engines , lots of indexes based on search requirements Map<K1,K> ,Map<K2,K>
,Map<K3,K> .. actual raw document storage Map<K,V>
The computers operate concurrently. - No shared things like CPU ,GPU , Memory etc.
The computers fail independently. – Computers might fail due to hardware failures /power outages etc.
Rough overview of all these in next session.
Messaging is key part of observer /pub-sub design pattern.
Paxos – protocol to resolve conflict of values between multiple machines.
Lets start with our own e-commerce site with small customers.
By Range
Easy , No even distribution , it depends on data characteristics.
By Hash
Better distribution if hash function is good . Needs re-hashing of all keys and transfer of most of keys. Which increases internal network traffic.
By directory
Create virtual directory for say server and object but management is challenging as data increases.
Used for tabular data or relational data base .
Less common , as relation grows in row number not in column .
Binary objects columns might be vertically partitioned .
Exmaple Riak uses SHA-1 160 bit hash function . So hash space could be 2^160-1
Companies do not use two phase commits for speed and throughput.
XA transactions implementation of distributed transactions .
Co-Ordinator node and participants nodes .
Commit Request (voting phase) prepare phase
Send query to all participants , execute query but do not commit
Say yes
Commit Phase
If all say yes then commit else abort .
Transaction
Receive order
Process payment
Enque Order
Process order
Deliver the order
Parallel
Separate Systems
Failures
Payment failure
Out of stock
Hardware failures
Services failure
Global locks
Holds on critical resource affecting the chain
Light weight transactions – like update if exists etc .