Distributed Middleware Reliability & Fault Tolerance Support in System S

RohitWagle, Henrique Andrade, Kristen Hildrum,
ChitraVenkatramani and Michael Spicer

1

 Laksri Wijerathna

 HimaliErangika

 Erica Jayasundara

 HariniSirisena

2

 Distribute Middleware reliability and Fault
tolerance support in System S.

 Fault-tolerance technique to implementing
operations in a large-scale distributed system
that ensures that all components will
eventually have a consistent view of the
system even in the component failure.

3

 How to develop a reliable large-scale
distributed system?
 How to ensure that in a large-scale
distributed system that all the components
will have a consistent view of the system even
in a component failure?

4

 Multiple components are employed in a large
scale distributed system.

 Failure in any single component can have
system-wide effects.

5

 Trigger a chain of activities across several
tiers of distributed components.

Example:-
Online purchase can trigger
-Web front-end Component
-Database system Component
-Credit card clearinghouse Component

6

 Failure in one or more component require
that all state changes related to the current
operation be rolled back across the
components.

 This approach is cumbersome and may be
impossible in cases where components do
not have the ability to roll back

7

 Break distributed operation into a series of
smaller operations (local operations), which is
called as single component, which are linked
together.

 The effect of component failure and restart in
the middle of the multi-component operation
is limited to that component and its
immediate neighbors.

8

 Never roll back once the first local operation
completed.
 If local operation fails, only that operation
retried until it completes.
 Ensure that communication between
components is tolerant to failure and the
communication protocol implements a retry
policy.

9

 Ensure that each component persists enough
data when restarted after a failure, it
continues pending requests where the
predecessor left off.
 If the state of the system changes we adjust
the operation as appropriate.
 Remote Procedure Calls (RPC) between the
component-local operations are stored as
work items in a queue, where queue is also
saved as part of a local action.
10

 Comprises a middleware runtime system and
application development framework.

 System S middleware runtime architecture
separates the logical system view from the
physical system view.
 Runtime contains two components
1. Centralized components
2. Distributed Management Components

11

Streams Application Manager (SAM)

 Centralized gatekeeper for logical system
information related to the application running
on System S.
 System entry point for job management
tasks.

13

Streams Resource Manager (SRM)

 Centralized gatekeeper for physical system
information related to the software and
hardware components that make up a System
S instance.
 Middleware bootstrapper which does the
system initialization, upon administrator
request.

14

Scheduler (SCH)

 Responsible for computing placement
decision for applications to be deployed on
the runtime systems.

15

Name Service (NS)

 Centralized component responsible for
storing service references which enable inter-
component communication.

16

Authentication and Authorization Service (AAS)

 Centralized component that provide user
authentication as well as inter-component
cross authentication.

17

Host Controller (HC)

 Component running on every application host
and is responsible for carrying out all local
job management tasks like starting, stopping,
monitoring processing element on behalf of
the request made by SAM.

18

Processing Element container (PEC)

 Hosts the application user code embedded in
a processing element.

19

 How to achieve wide reliability in System-S.

Two fundamental building blocks required:

1. Building Block 01

2. Building Block 02:

Undying inter component communication
infrastructure must be reliable

How this is achieved ?
 Ensuring that,
◦ Remote Procedure Call correctly carried out

◦ Failures  convey back to caller

 This is almost satisfied existing technologies and
some protocols in today,
 But , System S uses  CORBA as Basic RPC
mechanism

The data storage mechanism must be reliable.

 System S uses  IBM DB2 as the data stores.

Distributed Operation

Convert

Component Local Transaction
(connected with Communication
protocol)

Until they succeed

 Failures can happen due to
◦ Component failure
◦ Communication failure

 Operations are retried always in the case of
failures.

 Retries are processed… until
1. User cancel the operation
2. System shutdown
3. logical errors

 Remote operations always executed.
 Failures are seen as transient in nature.
(i.e failed component restarted quickly and prime with the
state, they held before the failure)

 Client ability to transparently retry or back
out from pending remote operations.

1. Devised the Reliability architecture ,to
deployable as part of the component design
rather than backing into a particular framework
as CORBA .
 a challenging task
 Because
◦ Distributed system grow organically ,
◦ Different components may choose to represent to
present remote interface with several communication
mechanisms.
◦ Component writers can pick different reliability levels for
different components
◦ Different infrastructure for components

2. Management of component’s internal state.
•Information
•<component’s static state>

component
•<Asynchronous work items >
•(to carry out the request to external
components)

 Information that required to be maintained by
the component for its operation.
 Info  persisted and restored in the case of
failure to recover back

 For every component that maintain an
internal state  to restore after failure
 Following information must be store in the
durable data store,
1. The components in-core management data
structure
2. The serialized asynchronous processing requests
(Work item in the component work queue).
3. The repository of completed remote operations
and their associated results

 Persisting a component’s in-core data
structures need to be engineer in a way that
that one
 as it should not tied to a particular durable storage
solution
 The System-s use a paradigm made popular
by Hibernate.
•Presents Object/relational interface for
Top layer wrapping traditional data structure like ,
associative maps ,red-black trees

•Used to hook up the data storage to
Lower layer converts entries map into database
records

 Persisting Asynchronous work item is
achieved by
◦ Serializing the work items while maintaining their
order of submission.
◦ Thus, while retrieving them from data store after a
crash, the work items are scheduled to work in the
same sequence

workitem workitem

workitem
workitem

workitem workitem

crash

 System-S require to some remote operations
to be execute at most once.

 That means , same request made multiple
times…

 Reliable middleware should handle them to
ensure that they are harmless or re-issue is
flagged and correctly dealt with.

To handle this type of situations, each of external
operations is classified as either,

 Idempotent:-
 Multiple invocation do not change remote component’s
internal state
 But, might be different results.

(Eg: an operation queering the internal state of a component)
 Non-Idempotent
 An operation invocation will yield an internal state change in
the remote component.

 Idempotent in safe retries condition  as no
change
 Concerned much more on  Non idempotent
operations

 For each Non idempotent operation,
◦ (Operation Transaction Identifier(OTID) ) field
attached to the argument of the interface)
◦ This ensure operation is repeated.

X
S OTID SESSION jOBdESC NO

A
Otid
COMPLE

M
TE

submitJob
YES
C
L Retrieve/ Process
results
I request

E Returned
results
N Save

T RPC results

JOB ID Rapos
submitJob
TID SAM
complete
Output Reliability wrapper
parameter

 Considering Non-Idempotent operation
states,
 It change the initial state of component
 But does not
 Initiate the request to external components
 Does not carry out asynchronous processing to complete
the request
 Non-idempotent code are implemented that are wrapped
within the Database transaction
◦ First Consider this simple non idem potent code
handling…

1. Begin Network Service(oTid)
2. Non-idempotent code
3. Log service request result(oTid,results)
4. End Network Service

2. DB Transction Begin
4. Log service request result(oTid,results)
5. DB Transction End(Commit)

6. End Network Service

2. DB Transction Begin
4. Log service request result(oTid,results) Case
5. DB Transction End(Commit) 1
6. 4. End Network Service Case
2

 Case 1: if system crashes before 5
 State changes are not committed to durable storage
 Hence maintain consistent state
 Client requesting the remote operation will continue retrying
the request until complete
• Case 2 : if system crashes after 5, but no result send to the client
• Then the framework already committed the log of the service
request
• Contains only service otid and the response need to send back
to the client
• Reliable protocol layer will just look at the log and reply back
with the original result.

 When middleware performing additional
operations when using other components.
 Eg: Launching PEs
 Undergo validation of pre condition
 Security check
Perform synchronously

 Dispatching PEs can be carried out asynchronously
 System S approach is
◦ Processing task only after the database transaction
under which under which the task was created to
the to the durable repository.

repository

 System S approach is better handle the
problems by
◦ Execution of a new unit of work on each thread has
to go with reliability approach.
◦ But quite complicated to implement.
◦ Complexity can be reduced by assumption
 Work unit can be scheduled after commit from the original
request.
 This guarantee work units are executed once.

 Interacting with each other is very important.
 Framework should handle this interaction.
 Interactions due to

1. user initiated
Component x

2. System initiated Component
y

Component z

 System S job submission process consists of
6 steps
 1. Accepts the job description from the user
 2.Check the permission
AAS
query
No change in the
AAS local state
 3. Determine PE placement. Check node
availability
SSH
query SRM
No state change

 4. Update the local state
◦ Insert job into SAM’s local tables
change in the
 5. register the job with AAS  AAS local state
(registerJOB operation)

 6. deploy PEs change the state of the system

 But HCs do not in persistent state  on restart it does
the state from that.
But ,Not a problem

 Consider registerJOB operation..(SAM  AAS)

 What happened if AAS crashes…
◦ appears as failed, but two possibilities,

◦ 1. AAS complete the JOB
 Error, if JOB is already in the system
◦ 2. AAS do not complete the JOB
 SAM must register the JOB,
 if JOB already in  can retry

 What happened if SAM crashes…
◦ may leave the distributed system in a inconsistent
state,
In the case of
◦ Job may not be existed
◦ AAS job might be succeeded

◦ On restart SAM  retry to submit operation.
◦ (while SAM down ,client trying to submit the
operation).
◦ But problem , if re-registering the job again.

1. PEREPATATION PHASE
 1. Accepts the job description from the user
 2.Check the permission
 3. Determine PE placement
 4. Update the local state
◦ Insert job into SAM’s local tables
 5. Generate oTid, for AAS registerJob queue registration work
item with that id.
 Commit current state (SAM’s internal tables and work queue) to
the database.

 5. register the job with AAS  (registerJOB operation)

 6. deploy PEs
 But HCs do not in persistent state  on restart it does the state from that.


2. REGISTER AND LAUNCH PHASE
 1. Register the job with AAS using already
generated oTid
 2. Start a local database Transaction
 3. Deploy PEs
 4. Commit current state to the database.

 With in this approach ,
◦ Preparation phase
 contains no calls to change the internal state
harmless
◦ Register and Launch phase
 Can repeat many times
 No problem, if SAM fails
 Since Register and Launch retries from the
beginning
 Since same oTid for same call  no danger for
registering twice the job

 1. Registering PE
 For failed PEs
 2. Generalizing
 For correcting the proceeding sections

 Retry Policy

Retry Controller
I. Bounded retries
II. Unbounded retries

 During the normal operation of the System S
middleware, once failures are detected, the
recovery process is automatically kick-started.
 In System S, failure detection is the responsibility of
the SRM component.

Failures are detected in two different ways.
 Central components are periodically contacted by
SRM to ensure their liveliness. This is done using an
application-level ping operation that is built into all
the components as a part of our framework.
 Moreover, all distributed components communicate
their liveliness to SRM via a scalable heartbeat
mechanism.

 The recovery process is simple and
involves only the restartof the failed
component or components.

 Once a failed component is restarted, its
state is rebuilt from information in durable
storage before it starts processing any
new or pending operations.

 First, the component in-core structures
are read from storage.

 Next, the list of completed operations is
retrieved, followed by re-populating the
work queue with any pending asynchronous
operations.

 Once all the state is populated, the
component starts accepting new external
requests and the pending requests start
being processed.
 Any components trying to contact the
restarted component will be able to receive
responses and the system will resume normal
operation.

 Able to handle multiple component failures at the
same time without any additional work or
coordination.
 Failed components can be restarted in any order
and will begin processing requests as and when
they are restarted.
 NB: completion of a pending distributed
operation depends on the availability of all
components needed to service that operation
 The failure of a component after it has completed
its part of the distributed operation does not
affect the completion of the operation.

 Operation Completion Time

 Measure the effect of failures in three
different mocked-up component-graph
configurations.
 All experiments were conducted with
System S running on up to five Linux hosts.
Each host contains 2 Intel Xeon 3.4 GHz
CPUs with 16GB RAM5IBM DB2
 Database as durable storage running on a
separate dedicated host.

 Source-Relay-Sink (SRS)
 Market Data Processing (MDP)

Inspiration

Berkeley’s Recovery Oriented Computing
paradigm
 Bug free is impossible
 Lower MTTR (Mean Time To Recover) rather
than increasing MTTF (Mean Time To Failure)

Fault Tolerance in 3 Tier Applications –
Vaysburd, 1999.

Inspiration..

Fault Tolerance in 3 Tier Applications –
Vaysburd, 1999.
 Client tier should tag requests
 Server tier should offload state to a database
 Database tier alone should be concerned with
reliability.

1). Replica and consistency management
 How to physically setup replicas?
 How to switch to a different one?
 How to main consistency?

Disadvantages :
 Overhead of having replicas
 Difficulty of ensuring consistency in the
presence of non-idempotent operations.

Replica and consistency management ..

1). FT CORBA – OMG, 1998.
 First standardization effort on fault tolerant
middleware support.
 Handles distributed non-idempotent request
through service replication and consistency.


1). An architecture for Object Replication in
Distributed Systems – Beedubail et all 1997.
 Hot replicas (multiple copies of a service exist
in standby)
 Fault tolerance layer, a middleware relays
state changes from primary replica to
secondary ones to maintain consistency.


2). Exactly once end to end semantics in
CORBA Invocation across Heterogeneous
Fault Tolerance ORBs – Vaysburd&Yajnik,
1999.
 Similar to TID approach, however assumption
is that in case of failures a replica will pick up
the request and a multicast mechanism is
used to notify all replicas of state changes.


3). DOORS by Bell Labs – 2000.
 Uses interception to capture inter-component
interactions.
 FT mainly supported through replication.


4). Chubby (Lock Service for loosely coupled
distributed systems – 2006) & Zookeeper
(Wait free coordination for Internet scale
systems -2010).
 Useful for group services (where a set of
nodes vote to elect a master)
 Replicate servers and databases to provide
high availability.

2). Flexible consistency models

 Failure is dealt by relaxing ACID and allowing
a temporary inconsistent state.
 It has been shown that many applications can
actually work under such relaxed
assumptions.

Flexible consistency models..
1). Cluster Based Scalable Network Services –
Fox et all, 1997)
 BASE (Basically Available, Soft State Eventual
Consistency) model.
 Doesn’t handle situations where non-
idempotent requests are carried out.

Flexible consistency models..
2). Neptune – Shen et all, 2003)
 Middleware for clustering support and
replication management of network services.
 Flexible replication consistency support.

3). Distributed transaction support

 Allow a distributed transaction to roll back in
case of failures.
 Done at the expense of central coordination
and a global roll back mechanism.

 Gave a mechanism for achieving reliability
and fault tolerance in large scale distributed
system.
 Used in real world middleware – IBM
Infosphere Streams.
 This approach avoids complex rollbacks and
the overhead of maintaining active replicas of
components.
 Can be implemented as an extension to
existing low level distributed computing
technologies (CORBA, DCOM)

 Support for both stateful and stateless
components allowing the system to grow
organically while providing different levels of
reliability for components (global state
consistency).
 Low MTTR.
 Can incorporate other low cost alternatives
for ensuring durability(eg: journaling file
systems).
 Can tolerate or recover from one or more
concurrent failures.

 Future plan is to experiment with alternate
durable storage mechanisms and use this
mechanism in other distributed middleware.

 Good mechanism for implementing FT in a
distributed system, by using middleware.
 Unlike traditional FT mechanisms, this
approach focuses on converting a distributed
operation into component local operations
and implementing FT in the communication
protocol (reliable RPC).
 Test results prove reliable FT.
 This mechanism is used in IBM’s Infoshpere
Streams enterprise platform, which supports
large scale distribution and can handle
petabytes of data.

Distributed Middleware Reliability & Fault Tolerance Support in System S

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed Middleware Reliability & Fault Tolerance Support in System S

Similar to Distributed Middleware Reliability & Fault Tolerance Support in System S (20)

Distributed Middleware Reliability & Fault Tolerance Support in System S

Editor's Notes