5. Performance Issues.
Cache performance is depended on :
Behavior of uniprocessor cache miss traffic.
Traffic caused by communication.
Factors affecting the two components of miss rate:
Changing the CPU count.
Cache size.
Block size.
6. The misses that arise from interprocessor
communication, which are often
called coherence misses, can be broken into two
separate sources.
True Sharing Misses.
False Sharing Misses [3].
7. Synchronization issues.
Synchronization mechanisms are typically built with user-
level software routines that rely on hardware supplied
synchronization instructions.
For smaller multiprocessors or low-contention situations,
instruction sequence capable of atomically retrieving.
In larger-scale multiprocessors or high-contention
situations, synchronization can become a performance
bottleneck [4].
8. Types of Synchronization.
Mutual exclusion.
Synchronize entry into critical sections.
Normally done with locks.
Point-to-point synchronization.
Tell a set of processors (normally set cardinality is one) that
they can proceed.
Normally done with flags.
Global synchronization.
Bring every processor to sync.
Wait at a point until everyone is there.
Normally done with barriers [4].
9. Basic Hardware Primitives.
Atomic Exchange.
addi register, r0, 0x1 /* r0 is hardwired to 0 */
Lock: xchg register, lock_addr /* An atomic load and store */
bnez register, Lock
Unlock remains unchanged
Various processors support this type of instruction
Intel x86 has xchg , Sun UltraSPARC has ldstub (load-store-
unsigned byte), UltraSPARC also has swap.
Normally easy to implement for bus-based systems: whoever wins
the bus for xchg can lock the bus.
Difficult to support in distributed memory systems [4].
10. Test and Set
which tests a value and sets it if the value passes the test.
For example, we could define an operation that tested for 0
and set the value to 1, which can be used in a fashion similar to
how we used atomic exchange [4].
12. To implement a mutual exclusion lock, we define the operation
FetchAndIncrement, which is equivalent to FetchAndAdd with inc=1.
With this operation, a mutual exclusion lock can be implemented using
the ticket lock algorithm as:
13. The pair of instructions includes a special load called
a load linked or load locked and a special store called
a store conditional.
These instructions are used in sequence: If the contents of the
memory location specified by the load linked are changed
before the store conditional to the same address occurs, then
the store conditional fails.
The store conditional is defined to return 1 if it was successful
and a 0 otherwise [4].
14. References
1. David E.Ott “Optimizing Software Applications for NUMA ”
Internet:http://www.drdobbs.com/go-
parallel/article/print?articleId=218401502, July 10 2009[Jan
29,2015].
2. Prof. H.P.Oscer “Technical Design Issues” Internet:
http://www.oser.org/~hp/ds/node15.html, June 08 2001 [Jan
29,2015].
3. John L. Hennessy , David A. Patterson. “Multiprocessors and Thread
Level Parallelism” in “Computer Architecture: A Quantitative
Approach”, 4th
edition, Morgan Kaufmann Publishers: San Francisco,
2007, pp. 218-219.
4. Prof. Rajat Moona, Dr. Mainak Chaudhuri, Prof. Sanjeev K.
Aggarwal, “Program Optimization for Multi-core Architectures”
Internet: http://nptel.ac.in/courses/106104025/13, [Jan 29,2015].
Notas do Editor
UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory. That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.
In the NUMA shared memory architecture, each processor has its own local memory module that it can access directly and with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:
What gives NUMA its name is that memory access time varies with the location of the data to be accessed. If data resides in local memory, access is fast. If data resides in remote memory, access is slower. The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory.
In computer architecture, distributed shared memory (DSM) is a form of memory architecture where the (physically separate) memories can be addressed as one (logically shared) address space. Here, the term shared does not mean that there is a single centralized memory but shared essentially means that the address space is shared (same physical address on two processors refers to the same location in memory).[1] Distributed Global Address Space (DGAS), is a similar term for a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's non-shared private memory.
The first source is the so-called true sharing misses that arise from the communication of data through the cache coherence mechanism. They directly arise from the sharing of data among processors.
· The second effect, called false sharing, arises from the use of an invalidation based coherence algorithm with a single valid bit per cache block.
Synchronization mechanisms are typically built with user-level software routines that rely on hardware supplied synchronization instructions.
For smaller multiprocessors or low-contention situations, the key hardware capability is an uninterruptible instruction or instruction sequence capable of atomically retrieving.
In larger-scale multiprocessors or high-contention situations, synchronization can become a performance bottleneck because contention introduces additional delays and because latency is potentially greater in such a multiprocessor.
The key ability we require to implement synchronization in a multiprocessor is asset of hardware primitives with the ability to atomically read and modify a memory location.
There are a number of alternative formulations of the basic hardware primitives, all of which provide the ability to atomically read and modify a location, together with some way to tell if the read and write were performed atomically
These hardware primitives are the basic building blocks that are used to build a wide variety of user-level synchronization operations, including things such as locks and barriers.