3. Complexities
In distributed systems
Absence of shared memory
Inter-node communication delays can be considerable
Global system state cannot be observed by constituent
machines due to communication delays, component
failures, absence of shared memory
Many more modes of failures yet fail soft is a goal
CS 704D Advanced OS 3
4. Some Considerations
Policies/strategies developed for a distributed system
can be made applicable in a uniprocessor case
However, policies/strategies developed for
uniprocessor case cannot be extended to distributed
case
Same can be simulated by adding a central resources
allocator
Increase traffic to central allocator, the system will fail
when the allocator fails
Election of a successor would be needed
CS 704D Advanced OS 4
5. Required Assumptions
Messages exchanged by a pair of communicating
processes need to be received in the same order as they
were generated (pipelining property)
Every message is received without errors, no duplicates
The underlying network ensures all nodes are fully
connected. Any node can communicate with every
other node
CS 704D Advanced OS 5
6. Desirable Properties
of Algorithms
All nodes should have equal amount of information
Each node makes decisions on the basis of local
information. The algorithm should ensure that nodes
make consistent & coherent decisions
All nodes reach decisions through about equal effort
Failure of a node should not cause complete break
down. The ability of reaching a decision and accessing
the resources should not be affected
CS 704D Advanced OS 6
7. Time & Ordering of Events
Happened Before Relationship
Logical clock needs to ensure
If a and b are events in the same process and a comes
before b then a->b
If event a is a representation of sending of a message
and b is that of receiving of message in another process
the a->b
It is a transitive relationship; that is if a->b and b->c
then a->c
If a and b has no happened before relationship the a and
b are said to be concurrent
CS 704D Advanced OS 7
8. Time & Ordering of Events
Logical Clock Properties
If a->b the C(a) < C(b)
Clock condition is satisfied if
If a and b are events in a process Pi and if a comes before
b the Ci(a)< Ci(b)
If a is an event sending message m by process Pi and b is
the receipt of message by process Pj then Ci(a) <Cj(b)
CS 704D Advanced OS 8
9. Time & Ordering of Events
Logical Clock Implementation
Process Pi increments the clock Ci between successive
events
Message m needs to be time stamped so that
T(m)=ci(a)
Receiving process adjusts clock such that it is max of
(Cj+1, Tm)
CS 704D Advanced OS 9
10. Total Ordering
ab only when
Ci(a) “less than” Cj(b) or
Ci(a) = Cj(b) and Pi “less than” Pj
Simple way to implement “less than” relation would be
to assign a unique number to each process and define
the “less than” such that i < j.
CS 704D Advanced OS 10
11. Lamport’s Algorithm
Initiator i: Process Pi requires an exclusive access to a resource. Sends
time stamped message request (Ti, i) where Ti = Ci to all the other
processes.
Other processes(j, j not= i): When Pj receives the request, places the
request on its own queue, send a reply with time stamp (Tj, j) to
Process Pi
Pi is allowed access only when
Pi request is in front of the queue and
All replies are time stamped later that the Pi time stamp
Pi sends a release message by sending a release message, time stamped
suitably
Pj removes Pi request from it request queue
Cost: 3 (N-1) messages, works best on bus based system where broadcast
costs are minimal
CS 704D Advanced OS 11
12. Ricart-Agarwala Algorithm
Initiator i: Process Pi requires an exclusive access to a resource. Sends
time stamped message request (Ti, i) where Ti = Ci to all the other
processes.
Other processes(j, j not= i): When Pj receives the request reacts as
follows,
If Pj is not requesting the resource, it sends a time stamped reply
If Pj needs the resource and the time stamp precedes the Pi’s time stamp
Pi’s request is retained, else a time stamped reply is returned.
Pi is allowed access only when
Pi request is in front of the queue and
All replies are time stamped later that the Pi time stamp
Pi sends a releases resource by sending a release message, for each
pending resources
Cost: 2(N-1) messages
CS 704D Advanced OS 12
13. Distributed Shared Memory
A software abstraction over the loosely coupled
systems
Provides a shared memory kind of operation over the
underlying IPC/RPC mechanisms
Can be implemented in OS kernel or runtime system
Also known as Distributed Shared Virtual Memory
System (DSVM)
The shared space exists only virtually
CS 704D Advanced OS 13
15. DSM Architecture
Unlike tightly coupled systems, this shared memory is
entirely virtual
Partitioned into blocks
Local memory is treated as large local caches
If the data requested is not available locally a network fault
is generated
OS, through a message, requests the node holding the
block and gets it migrated to the node where fault occurred
Data may be replicated locally
Configuration varies depending on what kind of
replication, migration policies are used
CS 704D Advanced OS 15
16. Design issues
Granularity (block size): Smaller size, higher faults,
traffic; larger blocks mean jobs with higher locality
Structure: Layout of data, depends on application
Coherence & access synchronization: Like the cache
situation in a uniprocessor system
Data Location & access: what data to be replicated,
located
Replacement strategy
Thrashing
Heterogeneity
CS 704D Advanced OS 16
18. Block Size Selection Factors
Large block sizes favored as overheads to transfer
smaller blocks and larger one not too different
Paging overhead- paging overheads also favors larger
block sizes, application should thus have larger locality
of reference
Directory size-smaller block larger directory, larger
management overhead
Thrashing- thrashing is likely to increase with larger
block size
False sharing-larger block sizes increases probability.
Consequence, higher thrashing
CS 704D Advanced OS 18
19. Page Size as Block Size
Page size is preferred as the DSM block size
Advantages are
Existing page fault hardware can be used as block fault
mechanism. Memory coherence can be handled in page
fault handlers
Access control can be managed with existing memory
mapping systems
If page size is less than packet size, no extra overhead
Page size proved to be, over time, the right unit as far as
memory contention
CS 704D Advanced OS 19
21. Structure of Shared Memory
Space
Approaches to structuring
No structure: a linear array of memory, easy to design
By data type: granularity per variable, complex to
handle
As database: as tuple space, associative memory,
primitives need to be added to languages, non
transparent access to shared data
CS 704D Advanced OS 21
24. Strict Consistency Model
Value read of a memory address is the same as the
latest write at that address
Writes become visible to all nodes
Needs absolute ordering of memory read/write
operations, a global time required (to define most
recent)
Nearly impossible to implement
CS 704D Advanced OS 24
25. Sequential Consistency Model
All processes should see the same ordering of read,
writes
Exact interleaving does not matter
No memory operation is started unless earlier
operations have completed
Acceptable in most applications
CS 704D Advanced OS 25
26. Causal Consistency Model
Operations are seen in same order (correct order)when
they are causally related
W2 follows w1 and causally related, then w1, w2 is the
order every process should see
They may not be seen in same order when not related
causally
CS 704D Advanced OS 26
27. Pipelined RAM Consistency Model
All writes of a single process are seen in the same order
by other processes (as in a pipeline)
However, writes by other processes may appear in
different order.
(W11,w12) and (w21, w22) can be seen as (wi1,wi2)
followed by (w21, w22) or (w21, w22) followed by
(w11,w12)
Simple to implement
CS 704D Advanced OS 27
28. Processor Consistency Model
Adds memory coherence to the PRAM model
That is if the writes are for a particular memory
location then all processes should see the writes in the
same order that maintains memory coherence
CS 704D Advanced OS 28
29. Weak Consistency Model
Changes in memory can be made after a set of changes has happened (example critical
section)
Isolated access to variable is usually rare, usually there will be several accesses and then
none at all
Difficulty is the system would not know when to show the changes
Application programmers can take care of this through a synchronization variable
Necessarily
All accesses to sync variable must follow strongest consistency9sequential)
All pending writes must be completed before access to sync variable is allowed
All previous access to sync must be completed before another access is allowed
CS 704D Advanced OS 29
30. Release Consistency Model
Weak consistency model requires that
All changes made by a process are propagated to all
nodes
All changes at other nodes are propagated to the
processor node
Acquire and release variable used for sync so that only
one of the operations above need to be done
CS 704D Advanced OS 30
31. Discussion of Models
Strict sequential model s difficult to implement,
almost never implemented
Sequential consistency model is most commonly used
Causal, PRAM, processor, weak and release
consistency are the ones implemented in many DSM
systems, programmers need to intervene
Weak and release consistency provides explicit sync
variables to help with the consistency
CS 704D Advanced OS 31
32. Implementing Sequential
Concurrency Model
Implementing sequential consistency would depend
on what replication/ migration are allowed
Migration/Replication strategies
Non replicated, non migrating blocks (NRNMBs)
Non replicated, migrating blocks (NRMBs)
Replicated, migrating blocks (RMBs)
Replicated, non migrating blocks (RNMBs)
CS 704D Advanced OS 32
33. NRNMB
All requests to a block are routed through the OS and
MMU to this one block that is not replicate and does
not move anywhere
Can cause
Bottleneck because of serializing of memory accesses
Parallelism is not possible
CS 704D Advanced OS 33
34. NRMB
No copies, if required entire block may be moved to
the node that requires it
Advantages
No communication costs, all accesses are local
Applications can take advantage of locality, applications
with high locality will perform better
Disadvantages
Prone to thrashing
No advantage of parallelism
CS 704D Advanced OS 34
35. Data Locating in NRMB
Broadcast
Fault happens, a request is broadcast, current owner sends
the block
Broadcast cause communication overheads
Centralized server
Request sent to the server, servers asks the node holding the
block to send it to the requesting node, updates location
information
Fixed distributed server
Fault handler finds mapping of block to the specific server, send
request and gets the block
Dynamic distributed server
Fault causes a local search for probable owner, goes to that node,
finds another probable owner or the block, gets block updates info
CS 704D Advanced OS 35
36. RMB
Replication is required to increase parallelism
Reads can be done locally, writes has overheads
High read/write ratio systems can apportion the write
overhead over many reads
Maintaining coherence throughout replicated block is
an issue
Two basic protocols used are
Write-invalidate
Write update
CS 704D Advanced OS 36
37. Coherence Protocols
Write-invalidate
On write fault, the fault handler copies the block from
one of the nodes to its own
Invalidates all the copies, writes data
If another node needs it now, the updated block is
replicated
Write update
On write fault, copy block to local node, update data
Send address & new data to all the replicas
Operation resumes after all the writes are done
CS 704D Advanced OS 37
38. Comparison
Write update typically needs a global sequencer to
makes sure all nodes see writes in the same sequence
Also the operations are full writes
Together there is a significant communication
overhead
Write invalidate does not need all that, just a
invalidation signal
Write invalidate is thus more often used method
CS 704D Advanced OS 38
39. Data Locating in RMB Strategy
Owner of a block needs to be located, the most recent
node which had write access
Node that has a valid copy will need to be tracked
Use on of the following
Broadcasting
Centralized server algorithm
Fixed distributed server algorithm
Dynamic distributed server algorithm
CS 704D Advanced OS 39
40. RNMB
Replicas are maintained but blocks do not migrate
Consistency is maintained by updating all the replicas
by a write update like process
CS 704D Advanced OS 40
41. Data Locating in RNMB Strategy
Replica locations do not change
Replicas are kept consistent
Read requests can go to the nodes that has the data
block
Writes through global sequencer
CS 704D Advanced OS 41
42. Munin: A Release Consistent DSM
System
Structure: a collection of shared variables
Each shared variable goes to a separate memory page
acquireLock and releaselock are used
Different consistency protocol is applied for different
types of shared variable used in the system
Read-only, migratory, write-shared, producer-consumer,
result, reduction and conventional
CS 704D Advanced OS 42
44. Replacement Strategy
Shared memory blocks are replicated and/or migrated
so two strategies need to be decided
Block to be replaced
Where should the replaced block go
CS 704D Advanced OS 44
45. Blocks to Replace
Usage based vs. non-usage based
Fixed space vs. variable space
Unused
Nil
Read only
Read-owned
Writable
CS 704D Advanced OS 45
46. Place for Replacement Block
Using secondary store locally
Using memory space of other nodes- store at free
memory space in some other node. Free memory space
status need to be exchanged, piggybacking on normal
communication messages
CS 704D Advanced OS 46
48. Thrashing Situations
DSM allows migration, so migration back and forth leads
to thrashing
Bata blocks keep migrating between nodes due to interleaved
accesses by processes
Read only blocks are repeatedly invalidated so after
replication
CS 704D Advanced OS 48
49. Thrashing Reduction Strategies
Application controlled locks
Locking an application to a node for a time, deciding t
could be a very difficult issue
Tune coherence strategy to the usage pattern,
transparency of the memory system is compromised
CS 704D Advanced OS 49
51. Approaches
Data caching managed by the OS
Data Caching managed by MMUs
Data Caching managed by the language run time
system
CS 704D Advanced OS 51
53. Features of Heterogeneous DSM
Data Conversion
Structuring DSM as a source of source language objects
Allowing one type of data in a block only (has
complications)
Memory fragmentation
Compilation issues
Entire page is converted but a small part may be used before
transfer
Not transparent, user provided conversion may be required
CS 704D Advanced OS 53
55. Advantages
Simpler abstraction
Better portability of distributed applications
Better performance of some Systems
Flexible communications environment
Ease of process migration
CS 704D Advanced OS 55