3. Distribute Middleware reliability and Fault
tolerance support in System S.
Fault-tolerance technique to implementing
operations in a large-scale distributed system
that ensures that all components will
eventually have a consistent view of the
system even in the component failure.
3
4. How to develop a reliable large-scale
distributed system?
How to ensure that in a large-scale
distributed system that all the components
will have a consistent view of the system even
in a component failure?
4
5. Multiple components are employed in a large
scale distributed system.
Failure in any single component can have
system-wide effects.
5
6. Trigger a chain of activities across several
tiers of distributed components.
Example:-
Online purchase can trigger
-Web front-end Component
-Database system Component
-Credit card clearinghouse Component
6
7. Failure in one or more component require
that all state changes related to the current
operation be rolled back across the
components.
This approach is cumbersome and may be
impossible in cases where components do
not have the ability to roll back
7
8. Break distributed operation into a series of
smaller operations (local operations), which is
called as single component, which are linked
together.
The effect of component failure and restart in
the middle of the multi-component operation
is limited to that component and its
immediate neighbors.
8
9. Never roll back once the first local operation
completed.
If local operation fails, only that operation
retried until it completes.
Ensure that communication between
components is tolerant to failure and the
communication protocol implements a retry
policy.
9
10. Ensure that each component persists enough
data when restarted after a failure, it
continues pending requests where the
predecessor left off.
If the state of the system changes we adjust
the operation as appropriate.
Remote Procedure Calls (RPC) between the
component-local operations are stored as
work items in a queue, where queue is also
saved as part of a local action.
10
11. Comprises a middleware runtime system and
application development framework.
System S middleware runtime architecture
separates the logical system view from the
physical system view.
Runtime contains two components
1. Centralized components
2. Distributed Management Components
11
13. Streams Application Manager (SAM)
Centralized gatekeeper for logical system
information related to the application running
on System S.
System entry point for job management
tasks.
13
14. Streams Resource Manager (SRM)
Centralized gatekeeper for physical system
information related to the software and
hardware components that make up a System
S instance.
Middleware bootstrapper which does the
system initialization, upon administrator
request.
14
15. Scheduler (SCH)
Responsible for computing placement
decision for applications to be deployed on
the runtime systems.
15
16. Name Service (NS)
Centralized component responsible for
storing service references which enable inter-
component communication.
16
17. Authentication and Authorization Service (AAS)
Centralized component that provide user
authentication as well as inter-component
cross authentication.
17
18. Host Controller (HC)
Component running on every application host
and is responsible for carrying out all local
job management tasks like starting, stopping,
monitoring processing element on behalf of
the request made by SAM.
18
20. How to achieve wide reliability in System-S.
Two fundamental building blocks required:
1. Building Block 01
2. Building Block 02:
21. Undying inter component communication
infrastructure must be reliable
How this is achieved ?
Ensuring that,
◦ Remote Procedure Call correctly carried out
◦ Failures convey back to caller
This is almost satisfied existing technologies and
some protocols in today,
But , System S uses CORBA as Basic RPC
mechanism
22. The data storage mechanism must be reliable.
System S uses IBM DB2 as the data stores.
23. Distributed Operation
Convert
Component Local Transaction
(connected with Communication
protocol)
Until they succeed
24. Failures can happen due to
◦ Component failure
◦ Communication failure
Operations are retried always in the case of
failures.
Retries are processed… until
1. User cancel the operation
2. System shutdown
3. logical errors
25. Remote operations always executed.
Failures are seen as transient in nature.
(i.e failed component restarted quickly and prime with the
state, they held before the failure)
Client ability to transparently retry or back
out from pending remote operations.
26. 1. Devised the Reliability architecture ,to
deployable as part of the component design
rather than backing into a particular framework
as CORBA .
a challenging task
Because
◦ Distributed system grow organically ,
◦ Different components may choose to represent to
present remote interface with several communication
mechanisms.
◦ Component writers can pick different reliability levels for
different components
◦ Different infrastructure for components
27. 2. Management of component’s internal state.
•Information
•<component’s static state>
component
•<Asynchronous work items >
•(to carry out the request to external
components)
Information that required to be maintained by
the component for its operation.
Info persisted and restored in the case of
failure to recover back
28. For every component that maintain an
internal state to restore after failure
Following information must be store in the
durable data store,
1. The components in-core management data
structure
2. The serialized asynchronous processing requests
(Work item in the component work queue).
3. The repository of completed remote operations
and their associated results
29. Persisting a component’s in-core data
structures need to be engineer in a way that
that one
as it should not tied to a particular durable storage
solution
The System-s use a paradigm made popular
by Hibernate.
•Presents Object/relational interface for
Top layer wrapping traditional data structure like ,
associative maps ,red-black trees
•Used to hook up the data storage to
Lower layer converts entries map into database
records
30. Persisting Asynchronous work item is
achieved by
◦ Serializing the work items while maintaining their
order of submission.
◦ Thus, while retrieving them from data store after a
crash, the work items are scheduled to work in the
same sequence
workitem workitem
workitem
workitem
workitem workitem
crash
31. System-S require to some remote operations
to be execute at most once.
That means , same request made multiple
times…
Reliable middleware should handle them to
ensure that they are harmless or re-issue is
flagged and correctly dealt with.
32. To handle this type of situations, each of external
operations is classified as either,
Idempotent:-
Multiple invocation do not change remote component’s
internal state
But, might be different results.
(Eg: an operation queering the internal state of a component)
Non-Idempotent
An operation invocation will yield an internal state change in
the remote component.
33. Idempotent in safe retries condition as no
change
Concerned much more on Non idempotent
operations
For each Non idempotent operation,
◦ (Operation Transaction Identifier(OTID) ) field
attached to the argument of the interface)
◦ This ensure operation is repeated.
34. X
S OTID SESSION jOBdESC NO
A
Otid
COMPLE
M
TE
submitJob
YES
C
L Retrieve/ Process
results
I request
E Returned
results
N Save
T RPC results
JOB ID Rapos
submitJob
TID SAM
complete
Output Reliability wrapper
parameter
35. Considering Non-Idempotent operation
states,
It change the initial state of component
But does not
Initiate the request to external components
Does not carry out asynchronous processing to complete
the request
Non-idempotent code are implemented that are wrapped
within the Database transaction
◦ First Consider this simple non idem potent code
handling…
36. 1. Begin Network Service(oTid)
2. Non-idempotent code
3. Log service request result(oTid,results)
4. End Network Service
37. 1. Begin Network Service(oTid)
2. DB Transction Begin
3. Non-idempotent code
4. Log service request result(oTid,results)
5. DB Transction End(Commit)
6. End Network Service
38. 1. Begin Network Service(oTid)
2. DB Transction Begin
3. Non-idempotent code
4. Log service request result(oTid,results) Case
5. DB Transction End(Commit) 1
6. 4. End Network Service Case
2
Case 1: if system crashes before 5
State changes are not committed to durable storage
Hence maintain consistent state
Client requesting the remote operation will continue retrying
the request until complete
• Case 2 : if system crashes after 5, but no result send to the client
• Then the framework already committed the log of the service
request
• Contains only service otid and the response need to send back
to the client
• Reliable protocol layer will just look at the log and reply back
with the original result.
39. When middleware performing additional
operations when using other components.
Eg: Launching PEs
Undergo validation of pre condition
Security check
Perform synchronously
Dispatching PEs can be carried out asynchronously
System S approach is
◦ Processing task only after the database transaction
under which under which the task was created to
the to the durable repository.
repository
40. System S approach is better handle the
problems by
◦ Execution of a new unit of work on each thread has
to go with reliability approach.
◦ But quite complicated to implement.
◦ Complexity can be reduced by assumption
Work unit can be scheduled after commit from the original
request.
This guarantee work units are executed once.
41. Interacting with each other is very important.
Framework should handle this interaction.
Interactions due to
1. user initiated
Component x
2. System initiated Component
y
Component z
42. System S job submission process consists of
6 steps
1. Accepts the job description from the user
2.Check the permission
AAS
query
No change in the
AAS local state
3. Determine PE placement. Check node
availability
SSH
query SRM
No state change
43. 4. Update the local state
◦ Insert job into SAM’s local tables
change in the
5. register the job with AAS AAS local state
(registerJOB operation)
6. deploy PEs change the state of the system
But HCs do not in persistent state on restart it does
the state from that.
But ,Not a problem
44. Consider registerJOB operation..(SAM AAS)
What happened if AAS crashes…
◦ appears as failed, but two possibilities,
◦ 1. AAS complete the JOB
Error, if JOB is already in the system
◦ 2. AAS do not complete the JOB
SAM must register the JOB,
if JOB already in can retry
45. What happened if SAM crashes…
◦ may leave the distributed system in a inconsistent
state,
In the case of
◦ Job may not be existed
◦ AAS job might be succeeded
◦ On restart SAM retry to submit operation.
◦ (while SAM down ,client trying to submit the
operation).
◦ But problem , if re-registering the job again.
46. 1. PEREPATATION PHASE
1. Accepts the job description from the user
2.Check the permission
3. Determine PE placement
4. Update the local state
◦ Insert job into SAM’s local tables
5. Generate oTid, for AAS registerJob queue registration work
item with that id.
Commit current state (SAM’s internal tables and work queue) to
the database.
5. register the job with AAS (registerJOB operation)
6. deploy PEs
But HCs do not in persistent state on restart it does the state from that.
47.
2. REGISTER AND LAUNCH PHASE
1. Register the job with AAS using already
generated oTid
2. Start a local database Transaction
3. Deploy PEs
4. Commit current state to the database.
48. With in this approach ,
◦ Preparation phase
contains no calls to change the internal state
harmless
◦ Register and Launch phase
Can repeat many times
No problem, if SAM fails
Since Register and Launch retries from the
beginning
Since same oTid for same call no danger for
registering twice the job
49. 1. Registering PE
For failed PEs
2. Generalizing
For correcting the proceeding sections
50. Retry Policy
Retry Controller
I. Bounded retries
II. Unbounded retries
51. During the normal operation of the System S
middleware, once failures are detected, the
recovery process is automatically kick-started.
In System S, failure detection is the responsibility of
the SRM component.
Failures are detected in two different ways.
Central components are periodically contacted by
SRM to ensure their liveliness. This is done using an
application-level ping operation that is built into all
the components as a part of our framework.
Moreover, all distributed components communicate
their liveliness to SRM via a scalable heartbeat
mechanism.
52. The recovery process is simple and
involves only the restartof the failed
component or components.
Once a failed component is restarted, its
state is rebuilt from information in durable
storage before it starts processing any
new or pending operations.
First, the component in-core structures
are read from storage.
53. Next, the list of completed operations is
retrieved, followed by re-populating the
work queue with any pending asynchronous
operations.
Once all the state is populated, the
component starts accepting new external
requests and the pending requests start
being processed.
Any components trying to contact the
restarted component will be able to receive
responses and the system will resume normal
operation.
54. Able to handle multiple component failures at the
same time without any additional work or
coordination.
Failed components can be restarted in any order
and will begin processing requests as and when
they are restarted.
NB: completion of a pending distributed
operation depends on the availability of all
components needed to service that operation
The failure of a component after it has completed
its part of the distributed operation does not
affect the completion of the operation.
56. Measure the effect of failures in three
different mocked-up component-graph
configurations.
All experiments were conducted with
System S running on up to five Linux hosts.
Each host contains 2 Intel Xeon 3.4 GHz
CPUs with 16GB RAM5IBM DB2
Database as durable storage running on a
separate dedicated host.
57.
58.
59. Source-Relay-Sink (SRS)
Market Data Processing (MDP)
60. Inspiration
Berkeley’s Recovery Oriented Computing
paradigm
Bug free is impossible
Lower MTTR (Mean Time To Recover) rather
than increasing MTTF (Mean Time To Failure)
Fault Tolerance in 3 Tier Applications –
Vaysburd, 1999.
61. Inspiration..
Fault Tolerance in 3 Tier Applications –
Vaysburd, 1999.
Client tier should tag requests
Server tier should offload state to a database
Database tier alone should be concerned with
reliability.
62. 1). Replica and consistency management
How to physically setup replicas?
How to switch to a different one?
How to main consistency?
Disadvantages :
Overhead of having replicas
Difficulty of ensuring consistency in the
presence of non-idempotent operations.
63. Replica and consistency management ..
1). FT CORBA – OMG, 1998.
First standardization effort on fault tolerant
middleware support.
Handles distributed non-idempotent request
through service replication and consistency.
64. Replica and consistency management ..
1). An architecture for Object Replication in
Distributed Systems – Beedubail et all 1997.
Hot replicas (multiple copies of a service exist
in standby)
Fault tolerance layer, a middleware relays
state changes from primary replica to
secondary ones to maintain consistency.
65. Replica and consistency management ..
2). Exactly once end to end semantics in
CORBA Invocation across Heterogeneous
Fault Tolerance ORBs – Vaysburd&Yajnik,
1999.
Similar to TID approach, however assumption
is that in case of failures a replica will pick up
the request and a multicast mechanism is
used to notify all replicas of state changes.
66. Replica and consistency management ..
3). DOORS by Bell Labs – 2000.
Uses interception to capture inter-component
interactions.
FT mainly supported through replication.
67. Replica and consistency management ..
4). Chubby (Lock Service for loosely coupled
distributed systems – 2006) & Zookeeper
(Wait free coordination for Internet scale
systems -2010).
Useful for group services (where a set of
nodes vote to elect a master)
Replicate servers and databases to provide
high availability.
68. 2). Flexible consistency models
Failure is dealt by relaxing ACID and allowing
a temporary inconsistent state.
It has been shown that many applications can
actually work under such relaxed
assumptions.
69. Flexible consistency models..
1). Cluster Based Scalable Network Services –
Fox et all, 1997)
BASE (Basically Available, Soft State Eventual
Consistency) model.
Doesn’t handle situations where non-
idempotent requests are carried out.
70. Flexible consistency models..
2). Neptune – Shen et all, 2003)
Middleware for clustering support and
replication management of network services.
Flexible replication consistency support.
71. 3). Distributed transaction support
Allow a distributed transaction to roll back in
case of failures.
Done at the expense of central coordination
and a global roll back mechanism.
72. Gave a mechanism for achieving reliability
and fault tolerance in large scale distributed
system.
Used in real world middleware – IBM
Infosphere Streams.
This approach avoids complex rollbacks and
the overhead of maintaining active replicas of
components.
Can be implemented as an extension to
existing low level distributed computing
technologies (CORBA, DCOM)
73. Support for both stateful and stateless
components allowing the system to grow
organically while providing different levels of
reliability for components (global state
consistency).
Low MTTR.
Can incorporate other low cost alternatives
for ensuring durability(eg: journaling file
systems).
Can tolerate or recover from one or more
concurrent failures.
74. Future plan is to experiment with alternate
durable storage mechanisms and use this
mechanism in other distributed middleware.
75. Good mechanism for implementing FT in a
distributed system, by using middleware.
Unlike traditional FT mechanisms, this
approach focuses on converting a distributed
operation into component local operations
and implementing FT in the communication
protocol (reliable RPC).
Test results prove reliable FT.
This mechanism is used in IBM’s Infoshpere
Streams enterprise platform, which supports
large scale distribution and can handle
petabytes of data.
Editor's Notes
3 tier – client/server/database.Focus of this work is analyzing FT when applications makee use of commercial dbs.
Failed request taken over by replica.
OMG – Object Management Group.OMG has been an international, open membership, not-for-profit computer industry consortium since 1989.
ACID – Atomicity, Consistency, Isolation, Durability(eg: Internet based content providers)
ACID – Atomicity, Consistency, Isolation, Durability(eg: Internet based content providers)
DCOM – MicrosoftInfosphere Streams – real time Big Data Analysis platform for enterpirse.
DCOM – MicrosoftA journalingfile system is a type of a log file that keeps track of the changes that will be madebefore committing them to the main file system. In the event of a system crash or power failure, they are quicker to bring back online and less likely to become corrupted.