2. Agenda
â Introduction
â Why Distributed Systems â what problem do they solve?
â Types of Distributed Systems
â Common strategies and patterns in distributed systems
â Conclusion
â Questions
3. What is a distributed system?
â A distributed system is a software system in which components located on
networked computers communicate and coordinate their actions by passing
messages. Wikipedia
â A Distributed system is an Ecosystem â or a set of systems working together to provide a
service, functionality or behavior for clients.
â The behavior is uniform â it appears to come from a single source, but in fact it comes
from a set of systems interacting to produce that behavior.
â The components (systems) know of their peers and work together, passing messages
between each other in order to:
â service users requests;
â detect and respond to failures;
â adapt to changing conditions
4. Vertical Scaling: Problems
â What problems do distributed systems solve; why not build bigger and bigger
machines to address increasing demand?
â Single points of failure â the bigger they are, the harder they fall
â When the big system goes down, everything it contains goes down.
â NOC builds disaster recovery, failover strategies, constant monitoring.
â Ops becomes failure sensitive, vigilant and risk averse.
â Elastic demand - How to size system resources for elastic demand?
â At peak times (Thanksgiving, Christmas, Valentines day, etc) demand
increases. Hordes of consumers descend upon eCommerce sites
simultaneously, causing system meltdown.
â Off-season â usage is bursty. Sometimes steady, sometimes slow and
sometimes relatively idle.
â Business Impacts
â Increased expenditure
â Failure results in loss of current and future business
â Loss of customer confidence,
â Negative Brand impact
â Competitive edge: newer software features take time to be installed.
Development is fast, Ops is slow.
5. Solution: Horizontal Scalability â Adaptive Systems
â Big systems are made of many smaller systems working together.
â Individual system â has a single capability. To service a request it delegates to a peer or peers for
providing the capabilities it does not have. Responses from its peers are processed and presented to
the user.
Node 1 Node2
Node 4 Node 5
Node 3
â Horizontal Scalability by itself is not an Adaptive System.
â So what is an adaptive system?
Message based
Network dependent
Failure Isolation
Optimized Deployment
Elastic: on demand
Service addition
And removal
Parallel Development
High Failure Rates
6. Solution: Horizontal Scalability â Adaptive Systems
Node 1 Node2
Node 4 Node 5
Node 3
â Embrace Failure
â Self Healing: Make the system
âself-awareâ. If one component
fails (which it will) âspin upâ
another instance.
Node2
Admin
â Respond to demand
â Increase and decrease capacity to
meet changes in demand.
Node 1 Node 1
Admin
Node 4 Node 5
Node 3
Node 4
Node 2
â However â the system is still not fault tolerant
â The Admin Service is a single point of failure.
7. Solution: Zookeeper
Contains a list of the
ZK servers in the
cluster.
â Clients connect to a single server.
â All Client requests are served from the in-memory
Broadcast
messages
Server 1
Follower
Server 1
Follower
Server 2
Leader
Server 2
Leader
Server 3
Follower
Server 3
Follower
DDaattaa SSttoorree
Configuration:
Host: IP and Port
Client Data
Configuration:
Host: IP and Port
Client Data
âŠ..............
âŠ..............
All Writes
go to the
Leader
CClileienntt CClileienntt CClileienntt CClileienntt CClileienntt
ZZKK C Clileienntt
Client
Anatomy of a
Client
data store on a server.
â Servers send their data to the leader.
â Leader stores the data in a data store.
â A Server responds to client only after the
leader has stored the data.
Broadcast
messages
Server 1
Leader
Server 1
Leader
Server 2
Leader
Server 2
Leader
Server 3
Follower
Server 3
Follower
DDaattaa SSttoorree
Configuration:
Host: IP and Port
Client Data
Configuration:
Host: IP and Port
Client Data
âŠ..............
âŠ..............
All Writes
go to the
Leader
CClileienntt CClileienntt CClileienntt CClileienntt CClileienntt
ZZKK C Clileienntt
Client
Anatomy of a
Client
Contains a list of the
ZK servers in the
cluster.
â If a Leader fails, a new leader is elected.
â Clients reconnect to the next available server
from their list of available zookeeper servers.
â The data for each client is loaded into each
server that services that client.
8. Patterns
â Leader and Followers
â Continuous communication between servers
â (awareness of the presence or absence of a peer)
â Leader election â dynamically elect a leader on startup and on failure conditions.
â Leader manages common data store (which is the source of truth).
â Common Data Store â single source of data (or state) which is distributed to all servers in the cluster or ensemble.
â Expectation of Failure
â Programming model, storage model, messaging model â all have failure recognition and failure recovery
methodologies built-in.
Server 1
Follower
Server 1
Follower
Broadcast
messages
Server 2
Leader
Server 2
Leader
DDaattaa SSttoorree
Server 3
Follower
Server 3
Follower
/
/app1
/app1/p_1 /app1/p_2 /app1/p_3
/
/app1
Server 1
Follower
Server 1
Follower
Broadcast
messages
Server 2
Leader
Server 2
Leader
DDaattaa SSttoorree
Server 3
Follower
Server 3
Follower
/
/app1
/app1/p_1 /app1/p_2 /app1/p_3
/
/app1
Initial Cluster / Ensemble Leader Failure:
Restructured ensemble
9. Pattern: Stateless Applications
Discovery Service, Load Balancing
ZOOKEEPER
ZOOKEEPER
Service 1
Leader
Service 2
Follower
Service 3
Follower
Service 1
Leader
Service 2
Follower
Service 3
Follower
Service 1
Follower
Service 2
Leader
Service 3
Follower
Client 1
List of all services:
Blue: 1, 2, 3
Light Orange: 1,2,3
Green: 1,2,3
Internal load balancer:
Round robin request to
each Service.
ZOOKEEPER
ZOOKEEPER
Service 1 Service 2
Follower
Service 3
Leader
Service 1
Leader
Service 2
Follower
Service 3
Follower
Service 1 Service 2 Service 3
Client 1
List of all services:
Blue: 2, 3
Light Orange: 1,3
Green: 1,2,3
Internal load balancer:
Round robin request to
each Service.
Cluster configuration - after Cluster configuration: initial deployment. a failure condition.
â ZK Async notification: all services that are part of a âgroupâ receive
asynchronous notifications when any member of that group goes
down.
â ZK Leader Election: when a leader of a group goes down,
zookeeper will elect a new leader.
â Discovery Service built on Zookeeper notifies the client of the new
cluster configuration.
â Shared Data: All members of a group will receive data
(configuration, events) published by any other member of the
group.
10. Snapshot data - Problem
Server 1
Follower
Server 1
Follower
Server 2
Leader
Server 2
Leader
Server 3
Follower
Server 3
Follower
/
/app1
/app1/p_1 /app1/p_2
/
/app1
/
/app1/p_1 /app1/p_2
/app1/p_3
/app1/p_3
/
/app1
/app1/p_1 /app1/p_2
/app1/p_3
/app1 Client 1
/app1/p_1 /app1/p_2
/app1/p_3
Periodic updates /
Snapshots
CAP Theorem:
î Consistency â all nodes see the same data at the same
time
î Availability â a guarantee that every request receives a
response about whether it was successful or failed
î Partition Tolerance - the system continues to operate
despite arbitrary message loss or failure of part of the
system.
Consistent: to synchronize the data, the system will have to
be unavailable for a period of time even though it is fully
operational.
Availability: if the system is always available and is operating
in spite of message loss and component failure, then the data
will be inconsistent at any given point in time.
Partition Tolerance: if the system continues to function when
parts of it fail, then it can be available but the data within it
cannot be consistent.
So if Availability and Partition Tolerance are favored, how can a
client get accurate or viable data?
11. Pattern: Snapshot data â Quorum Management
Server 1
Follower
Server 1
Follower
Quorum
Manager
Server 2
Leader
Server 2
Leader
Server 3
Follower
Server 3
Follower
/
/app1
/app1/p_1 /app1/p_2
/
/app1
/
/app1/p_1 /app1/p_2
/app1/p_3
/app1/p_3
/
/app1
/app1/p_1 /app1/p_2
/app1/p_3
Client 1
/app1
/app1/p_1 /app1/p_2
/app1/p_3
Periodic updates /
Snapshots
Quorum Manager
î A quorum manager issues a request to a number of
systems, takes the results, compares the timestamps
(or vector clock) and returns the most up to date data
back to the client.
î A Quorum manager can exist in the cluster â in each
component â or external to the system as a service.
According to Wikipedia, Quorum is the minimum number of members of a deliberative body necessary to conduct the
business of that group. Ordinarily, this is a majority of the people expected to be there, although many bodies may
have a lower or higher quorum.
12. Pattern: Data Lookup and Replication:
HDFS
NNaammeeNNooddee
DDaatataNNooddee 1 1 DDaatataNNooddee 2 2 DDaatataNNooddee 3 3
Read or Write File / Data
1 2
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
1
3
DDaatataNNooddee 4 4 DDaatataNNooddee 5 5
1
3
2 2
4 3
4 4
5
5
5
6
6
6
Client 1
/user/my-company/file-part-0, r:3, 1,3,
/user/my-company/fie-part-1, r:3, 2,4
/user/my-company/file_part-2, r3, 5,6
Example: WebHdfs which
first contacts the
NameNode to find out the
data nodes to write to. Or
to find out which data
nodes to read from.
13. Consistent Hashing â Replicated Data
Cassandra
A Consistent
B
C
There are two write modes:
Hash based
off namespace
and key
C
A
B
Find the node on the ring with a range
of keys into which the current key falls.
Write the data to that node.
î Quorum write: blocks until quorum is reached
î Async write: sends request to any node. That node will push the data to appropriate nodes but return to client immediately
If the node is down, then write to another node with a hint saying where it should be written to. Harvester [goes through] every 15 min goes through and
find hints and moves the data to the appropriate node
14. Consistent Hashing â Replicated Data
Cassandra
A
C
B
If the node that was hosting B's data
goes down. The node next to it on the
ring will take its data, from Bs replicated
data and it will become the host for Bs
data.
A
B
C
B
If a node is added to a partition, it will
share some of the data that exists in
that partition. The data it is responsible
for is based on its hashed position in the
ring. This results in a division of the
keys among the two nodes. Interestingly
, it promotes load balancing as well â
since now the load is shared between
two data nodes.
A
B
C
Initial state of the cluster Note that all
data (A, B, C) is replicated. If it were not
then a nodes failure will result in data
loss.
15. Conclusion
â Vertical scaling is expensive and error prone.
â Horizontal scaling is elastic, responsive, fault tolerant and self-healing.
â Distributed Systems affect all aspects of software development.
â Programming models
â Testing
â Deployment
â Maintenance
â There are best practices and patterns for designing your distributed system.
â Many existing systems (Cassandra, Hadoop, Solr, Riak, Netflix platform) have are
implementations of these patterns. Look under the hood. Use the patterns to âroll your ownâ.