O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
BATCH 2012
Assignment tittle
“Summary of Paper”
Simultaneous Multithreading: Maximizing On-Chip Parallelism
Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy
complemented model with the use of state-
of-the-art static scheduling, using the
Multiflow trace scheduling compiler. Thi...
combined multithreaded and superscalars
 Fine-Grain Multithreading
 SM:Full Simultaneous Issue
 SM:Single I...
6. Simultaneous
Multithreading versus
Single-Chip Multiprocessing
The performance of simultaneous
multithreading to small-...
Próximos SlideShares
Carregando em…5

Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism

Computer Architecture: A brief summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism

  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism

  1. 1. COMPUTER ARCHITECTURE BATCH 2012 Assignment tittle “Summary of Paper” BY FARWA ABDUL HANNAN (12-CS-13) & ZAINAB KHALID (12-CS-33) Date OF Submission: Wednesday, 11 May, 2016 NFC – INSITUTDE OF ENGINEERING AND FERTILIZER RESEARCH, FSD
  2. 2. Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy Department of Computer Science and Engineering University of Washington Seattle, WA 98195 ______________________________________________________________________________ 1. Introduction The paper examines simultaneous multithreading which is a technique that allows several independent threads to issue multiple functional units in each cycle. The objective of simultaneous multithreading is to increase processor utilization for both long memory latencies and limited available parallelism per thread. This study evaluates the potential improvement, relative to wide superscalar architectures and conventional multithreaded architectures, of various simultaneous multithreading models. The proposed results show the limits of superscalar execution and traditional multithreading to increase instruction throughput in future processors. 2. Methodology The main goal is to evaluate several architectural alternatives in order to examine simultaneous multithreading. For this a simulation environment has been developed that defines the implementation of the simultaneous multithreaded architecture and that architecture is the extension of next generation wide superscalar processors. 2.1 Simulation Environment The simulator uses the emulated based instruction level simulation that caches the partially decoded instructions for fast emulated execution. The simulator models the pipeline execution, hierarchy of memory and the branch prediction logic of wide superscalar processors. The simulator is based on Alpha 21164. Unlike Alpha this model supports the increased single stream parallelism. The simulated configuration consists of 10 functional units of four types such as four integer, two floating point, three load/store and 1 branch and issue rate id at maximum of 8 instructions per cycle. It is assumed that all functional units are completely pipelined. Assuming that the first and second-level on-chip caches considerably larger than on the Alpha, for two reasons. First, multithreading puts a larger strain on the cache subsystem, and second, larger on-chip caches are expected to be common in the same time frame that simultaneous multithreading becomes viable. Simulations with caches closer to current processors, discussed in these experiments as appropriate, are also run but do not show any results. Whenever the program counter crosses the boundary of 32 bytes, the instruction caches access occurs otherwise the instruction is fetched from the already fetched buffer. Dependence free instructions are issued in order to an eight instructions per thread scheduling window. From there, instructions can be scheduled onto functional units, depending on functional unit availability. Instructions that are not scheduled due to functional unit availability have priority in the next cycle. This straightforward issue is
  3. 3. complemented model with the use of state- of-the-art static scheduling, using the Multiflow trace scheduling compiler. This reduces the benefits that might be gained by full dynamic execution, thus eliminating a great deal of complexity (e.g., there is no need for register renaming unless we need precise exceptions, and we can use a simple 1-bitper- register score boarding scheme) in the replicated register sets and fetch/decode pipes. 2.2 Workload The workload consists of SPEC92 benchmark suite that consists of twenty public-domain, non-trivial programs that are widely used to measure the performance of computer systems, particularly those in the UNIX workstation market. These benchmarks were expressly chosen to represent real-world applications and were intended to be large enough to stress the computational and memory system resources of current-generation machines. To gauge the raw instruction throughput which is achievable by multithreaded superscalar processors, the uniprocessor applications are chosen by assigning a distinct program to each thread. This models a parallel workload which is achieved by multiprogramming rather than parallel processing. Hence the throughput results are not affected by synchronization delays, inefficient parallelization, etc. Each program is compiled with the Multiflow trace scheduling compiler and is modified to produce Alpha code scheduled for target machine. The applications were each compiled with several different compiler options. 3. Superscalar Bottlenecks: Where Have All the Cycles Gone? This section provides motivation for SM. By using the base single hardware context machine, the issue utilization is measured, i.e., the percentage of issue slots that are filled in each cycle, for most of the SPEC benchmarks. The cause of each empty issue slot is also recorded. The results also demonstrate that the functional units of proposed wide superscalar processor are highly underutilized. These results also indicate that there is no dominant source of wasted issue bandwidth. Simultaneous multithreading has the potential to recover all issue slots lost to both horizontal and vertical waste. The next section provides details on how effectively it does so. 4. Simultaneous Multithreading The performance results for simultaneous multithreaded processors are discussed in this section. Several machine models for simultaneous multithreading are defined and it is showed here that simultaneous multithreading provides significant performance improvement for both single threaded superscalar and fine grain multithreaded processors. 4.1 The Machine Models The Fine-Grain Multithreading, SM:Full Simultaneous Issue, SM:Single Issue, SM:Dual Issue, and SM:Four Issue, SM:Limited Connection models reflects several possible design choices for a
  4. 4. combined multithreaded and superscalars processors.  Fine-Grain Multithreading  SM:Full Simultaneous Issue  SM:Single Issue  SM:Dual Issue.  SM:Four Issue  SM:Limited Connection 4.2 The Performance of Simultaneous Multithreading Simultaneous Multithreading act also displayed. The fine-grain multithreaded architecture offers a maximum speed up. The advantage of the real-time multithreading models, achieve maximum speedups over single thread. The speedups are calculated using the full real-time issue. By using Simultaneous Multithreading, it’s not compulsory for any particular thread to use the whole resources of processor to get the maximum performance. One of the four issue model it becomes good with full simultaneous issue like the ratio of threads & slots increases. After the experiments it is seen the possibility of transacting the number of hardware contexts against the complexity in other areas. The increasing rate in processor consuming are the actual results of threads which shares the processor resources if not then it will remain idle for many time but sharing the resources also contains negative effects. The resources that are not executed plays important role in the performance area. Single-thread is not so reliable so it is founded that it’s comfortable with multiple one. The main effect is to share the caches and it has been searched out that increasing the public data brings the wasted cycles down to 1%. To gain the speedups the higher caches are not so compulsory. The lesser caches tells us the size of that caches which disturbs the 1- thread and 8-thread results correspondingly and the total speedups becomes constant in front of extensive range of size of caches. As a result it is shown that the limits of simultaneous multithreading exceeds on the performance possible through either single thread execution or fine-grain multithreading, when run on a wide superscalar. It is also noticed that basic implementations of SM with incomplete per- thread abilities can still get high instruction throughput. For this no change of architecture required. 5. Cache Design for a Simultaneous Multithreaded Processor The cache problem has been searched out. Focus was on the organization of the first- level (L1) caches, which related the use of private per-thread caches to public caches for both instructions and data. The research use the 4-issue model with up to 8 threads. Not all of the private caches will be consumed when less than eight threads are running. When there are many properties for multithreaded caches then the public caches adjusts for a small number of threads while the private ones perform with large number of threads well. For instance the two caches gives the opposite results because of their transactions are not the same for both data and instructions. Public cache leaves a private data cache total number of threads whereas the caches which holds instructions can take advantage from private cache at 8 threads. The reason is that they access different patterns between the data and instructions.
  5. 5. 6. Simultaneous Multithreading versus Single-Chip Multiprocessing The performance of simultaneous multithreading to small-scale, single-chip multiprocessing (MP) has been compared. While comparing it is been noted that the two scenarios are same that is both have multiple register sets, multiple FU and higher bandwidth on a single chip. The basic difference is in the method of how these resources are separated and organized. Obviously scheduling is more complex for an SM processor. Functional unit configuration is frequently enhanced for the multiprocessor and represents a useless configuration for simultaneous multithreading. MP calculates with 1, 2 and 4 issues per cycle on every processor and SM processors with 4 and 8 issues per cycle. 4 issue model is used for all SM values. By using that model it reduces the difficulties between SM and MP architectures. After the experiments we see that the SM results are good in two ways that is the amount of time required to schedule instructions onto functional units, and the public cache access time. The distance between the data cache and instructions or the load & store units may have a big influence on cache access time which is that the multiprocessor, with private caches and private load & store units, can decrease the distances between them but the SM processor unable to do so even if with private caches, the reason is that the load & store units are public. The solution was that the two different structures could remove this difference. There comes further advantages of SM over MP that are not presented by the experiments: the first one is Performance with few threads: Its results display only the performance at maximum exploitation. The advantage of SM over the MP is greater as some of the processors become unutilized. The second advantage is Granularity and flexibility of design: the options of configurations are better-off with SM. For this in multiprocessor, we have to add calculating in units of whole processor. Our evaluations did not take advantage of this flexibility. Like the performance and complexity results displayed the reasons is that when constituent concentrations allows us to set multiple hardware contexts and wide-ranging issue bandwidth on a single chip, instantaneous multithreading denotes the most well- organized organization of those resources.