2. What does “parallel” mean?
ACCORDING TO WEBSTER, PARALLEL IS
“AN ARRANGEMENT OR STATE THAT
PERMITS SEVERAL OPERATIONS OR
TASKS TO BE PERFORMED
SIMULTANEOUSLY RATHER THAN
CONSECUTIVELY.”
3. What is a parallel computer?
“A LARGE COLLECTION OF
PROCESSING ELEMENTS THAT CAN
COMMUNICATE AND COOPERATE
TO SOLVE LARGE PROBLEMS FAST.”
4. PARALLELISM
• Parallel computing -form of computation in which
many calculations are carried out simultaneously.
• Parallel computers can be roughly classified according
to the level at which the hardware supports
parallelism.
• Multi-core and Multi-processor computers have
multiple processing elements within a single
machine;clusters and grids use multiple computers to
work on the same task. Specialized parallel computer
architectures are used alongside traditional processors,
for accelerating specific tasks,eg GPUs.
5. Flynn’s taxonomy
SISD(Single Instruction Single Data)
SIMD(Single Instruction Multiple Data)-
available on CPU enables single op on multiple
data at once.
MISD(Multiple Instruction Single Data)
MIMD(Multiple Instruction Multiple Data)-
several cores on a single die
6. Parallelism-How?
• Task parallelism
• Data parallelism
• Recent CPU-several parallelisation techniques-
branch prediction,out of order
execution,superscalar
• These increase complexity,limiting the number of
CPUs on a single chip.
• GPU each processing unit is simple,but large
number on a single chip
7. Parallel Architectures
Three popular:
1. Shared memory (uniform memory access and
symmetric multiprocessing),
2. Distributed memory (clusters and network of
workstations), and
3. Shared Distributed (non-uniform memory
access)
8. Difference With Distributed
Computing
Parallel Computing different processors/computers work on a
single common goal
Eg.Ten men pulling a rope to lift up one rock.
Supercomputers implement parallel computing.
Distributed computing is where several different computers work
separately on a multi-faceted computing workload.
Eg Ten men pulling ten ropes to lift ten different rocks.
Employees working in an office doing their own work.
9. Difference With Cluster Computing
A computer cluster is a group of linked computers, working together
closely so that in many respects they form a single computer.
Eg.,In an office of 50 employees,group of 15 doing some work,25
some other,and remaining 10 something else.
Similarly,in a network of 20 computers,16 working on a common
goal,whereas 4 on some other common goal.
Cluster Computing is a specific case of parallel computing.
10. Difference With Grid Computing
Grid Computing makes use of computers communicating over the
Internet to work on a given problem.
Eg. When 3 persons, one of them from USA, another from Japan and a
third from Norway are working together online on a common project.
Websites like Wikipedia,Yahoo!Answers,YouTube,FlickR or open
source OS like Linux are examples of grid computing.
Again,an example of parallel computing.
11. Cluster Computing
• Loosely connected n/w of nodes(computers)
via a high speed LAN
• Orchestrated by "clustering middleware“
• Relies on a centralized management approach
which makes the nodes available as
orchestrated shared servers.
12. GPU-Graphics Processing Unit
• the dominant , massively parallel architecture
available to the masses.
• simple yet energy-efficient computational
cores,
• thousands of simultaneously active fine-
grained threads, and
13. Where are GPUs used?
Designed for a particular class of applications
with the following characteristics:
Computational requirements are large.
Parallelism is substantial.
Throughput is more important than latency.
14. Fixed function GPUs
• The hardware in any given stage could exploit
data parallelism within that stage, processing
multiple elements at the same time.
• Each stage’s hardware customized for its given
task
• a lengthy, feed-forward GPU pipeline with many
stages, each typically accelerated by special
purpose parallel hardware.
• Advantage-High throughput
• Disadvantage-Load balancing
15. GPU evolution
6 years ago Today
• A fixed-function processor • a full fledge parallel
• built around the graphics programmable processor
pipeline • both application
• Best described as additions programming interface
of programmability to fixed (APIs) and hardware
function pipeline • increasingly focusing on the
programmable aspects of
the GPU-vertex pgms &
fragment pgms
16. Remote Sensing Processing
• On-the-flow processing: part by part
• Most algorithm do not consider neighborhood
of each pixel
• Development of languages like CUDA and
OpenCL motivated programmers to
heterogenous processing platforms
17. Challenges for parallel-computing
chips
1. Power supply voltage scaling diminishing
2. memory bandwidth improvements is slowing
down
3. Programmability
– Memory model
– Degree of parallelism
– Heterogeneity
4. Research still going strong in parallel
computing
18. Cluster memory
Increased CPU
utilisation requires
limiting number of
parallel processes.
However as
problem size
increases page
fault occurs
20. .
• Total memory is distributed
Memory into discrete chunks
fragmentation • Uneven and inefficient
utilisation
• Disk paging in heavily
Paging loaded nodes-high cost
overhead • Hard disks are too much
slower
21. NETWORK RAM
Applications can allocate memory
greater than what is locally
available
Idle memory of other
machines is used using
a fast interconnecting
network
No Page
faults
23. Disadvantages of existing NRAM
• Parallel job divides into processes which needs
to be synchronised regularly.
• Nodes seek NRAM independently, uneven
amount maybe granted-processes run at
different speeds
• The whole job is limited by speed of the
slowest process
24. Diagram of Parallel Network-RAM. Application 2
is assigned to
nodes of P3, P4, and P5, but utilizes the
available memory spaces in other nodes, such
as P2, P6, and P7.
25. Generic Description
• All nodes host PNR servants-a servant acts as
both client and server
• Managers(some servants) coordinate client
requests
• Server has more unallocated memory than a
threshold, it will grant NRAM request and
allocate memory to the manager.
• Read and write requests are directly from the
clients.
26. Generic Description
Client attempts to Once allocated,
Client will send
allocate and de- client is informed
pages to the server
allocate NRAM on which are the server
for storage and later
behalf of hosting nodes and the amt
retrieval
node of memory allocated
28. CEN Strategy
Only one manager
coordinating all client
requests.
All servents know him
Advantage-No
broadcast of memory
load information
Disadvantage-Network
connection leading to
manager node
becomes bottleneck
29. CLI Strategy
Each client is a manager
and sends allocation
requests directly.
Advantage- No
synchronisation overhead
and allocates NRAM
quickly
Disadvantage-Some clients
receive large amounts of
NRAM while some may
not, worsening the overall
performance.
30. MAN
Strategy
When a job starts or
stops one client
volunteers as the
manager.
Each servant should
agree on the selected
manager node.
Drawback-broadcast
memory load
information causing
congestion
31. BB Strategy
•Subset of servents act as
managers
•All clients associated with
a job must agree on which
manager to contact.
•It is more scalable than
the centralized solution
• Since load is shared
among many servents, and
it uses fewer messages for
synchronization
32. Models
• Each node-33 MHz,32 MB local RAM,hard
disk with seek time=9ms transfer rate 50
MB/s
• Ethernet 100 Mbps star topology
• Each link latency 50 ns,central switch
processing delay 80 microsec.
• No collisions
• System tasks by separate dedicated
processors
• One centralised scheduler for the system
• Cache hit ratio of 50% ,memory access every
4 clock cycles is assumed.
33. Metrics
• To directly compare DP (“disk paging,” a
system without PNR) to the various PNR
designs, we create another metric based on
average response time (R):optimization ratio,
which is defined as
34. Experimental set up
• We evaluate the performance of PNRAM
under the following situations:
1. Varying memory loads
2. Varying network speeds
3. Different network topologies
4. Different scheduling strategies
35. Varying network
Varying memory Schedulers
performance
• Vary RAM at • Link BW • Gang scheduler
each node &processing • Space sharing
delay scheduler’
• Memory
demands of
jobs constant
Paging methods Topologies
• Bus
• Base method is • Star
disk paging
• Fully connected
• Four PNR n/w
methods
36. Results-1
• As memory load increases,PNR and DP tend to
infinite response times
• As memory load decreases,the response time
converges to a constant number
• Adding PNR to systems loaded within some
bounds (and adequate communication links)
lead to performance benefit
37. Results-2
• PNR is very sensitive to network performance
• PNR response time tends to infinity as
network service time is increased and
converges to a constant number when service
time is decreased
• DP does not follow this model
• PNR cannot be considered with low
BW/comm bottlenecks
38. Result-3
• In space sharing system,only one process is
allowed on a node at a time.
• In low load case ,CLI is the best choice
• Under heavy load,NRAM allocation
coordination is a limiting factor
• In gang scheduling,n/w performance is crucial
• Lighter load OR-12%,heavier load OR>90%
39. Future work
• For some exps,PNR memory usage was even
more non uniform than DP’s.
• More work needed to ensure that PNR itself
doesnot create more overloaded nodes
• Coordination of allocation of memory
resources and communication overhead
needs to be taken care of.
40. CONCLUSION
• Using a coordinating PNR method under heavier loads
is essential for good performance.
• Coordinating PNR methods offer the best performance
enhancement when under moderate load.
• Performance gains can be as high as 100 percent.
• CLI can provide acceptable or superior results under
light load only.
• All PNR methods offer little benefit under very heavy
or very light loads.
• Good network performance is crucial for good PNR
performance.