SlideShare uma empresa Scribd logo
1 de 46
Intel® Hyper-Threading
Technology
Mohammad Radpour
Amirali Sharifian
Outline
• Introduction
• Traditional Approaches
• Hyper-Threading Overview
• Hyper-Threading Implementation
• Front-End Execution
• Out-of-Order Execution
• Performance Results
• OS Supports
• Conclusion
Introduction
• Hyper-Threading technology makes a single processor appear as
two logical processors.
• It was first implemented in the Prestonia version of the Pentium®
4 Xeon processor on 02/25/02.
Traditional Approaches (I)
• High requirements of Internet and
Telecommunications Industries
• Results are unsatisfactory compared the gain
they provide with the cost they cause
• Well-known techniques;
• Super Pipelining
• Branch Prediction
• Super-scalar Execution
• Out-of-order Execution
• Fast memories (Caches)
Traditional Approaches (II)
• Super Pipelining:
• Have finer granularities, execute far more instructions within a
second (Higher clock frequencies)
• Hard to handle cache misses, interrupts and branch mispredictions
• Instruction Level Parallelism (ILP)
• Mainly targets to increase the number of instructions within a cycle
• Super Scalar Processors with multiple parallel execution units
• Execution needs to be verified for out-of-order execution
• Fast Memory (Caches)
• To reduce the memory latencies, hierarchical units are using which
are not an exact solution
Traditional Approaches (III)
Same silicon technology
Normalized speed-ups with
Intel486™
microarchitecture, has improved
integer performance five- or six-
fold1
die size has gone up
fifteen-fold, a three-times-higher rate
power increased almost eighteen-fold
during this period1.
Thread-Level Parallelism
• Chip Multi-Processing (CMP)
• Put 2 processors on a single die
• Processors (only) may share on-chip cache
• Cost is still high
• Single Processor Multi-Threading;
• Time-sliced multi-threading
• Switch-on-event multi-threading ( well for server application)
• Simultaneous multi-threading
Hyper-Threading Technology
Hyper-Threading (HT)
Technology
• Provides more satisfactory solution
• Single physical processor is shared as
two logical processors
• Each logical processor has its own
architecture state
• Single set of execution units are shared
between logical processors
• N-logical PUs are supported
• Have the same gain % with only 5% die-
size penalty.
• HT allows single processor to fetch and
execute two separate code streams
simultaneously.
Firstimplementationon the intel xeon
processorfamily
• Several goals were at the heart of the microarchitecture:
• Minimize the die area cost of implementing
• One logical processor is stalled the other logical processor
could continue to make forward progress.
• Allow a processor running only one active software thread
to run at the same speed
HT Resource Types
• Replicated Resources
• Flags, Registers, Time-Stamp Counter, APIC
• Shared Resources
• Memory, Range Registers, Data Bus
• Shared | Partitioned Resources
• Caches & Queues
HT Pipeline (I)
Execution Pipeline
Execution Pipeline
Partition queues between major pipestages of pipeline
Partitioned Queue Example
• Partitioning resource ensures fairness and
ensures progress for both logical processors!
HT Pipeline (II)
HT Pipeline (III)
The Execution Trace Cache
• Primary or Advanced form of L1 Instruction Cache.
• Delivers 3 μops/clock to the out-of-order execution logic.
• Most instructions fetched and decoded from the Trace Cache.
• caches the μops of previously decoded instructions here, so it
bypasses the instruction decode
• Recovery time for a mispredicted branch is much shorter in
compression of re-decode the IA-32 instruction
Execution Trace Cache (TC) (I)
• Stores decoded instructions called “micro-
operations” or “uops”
• Arbitrate access to the TC using two IPs
• If both PUs ask for access then switch will occur in the next
cycle.
• Otherwise, access will be taken by the available PU
• Stalls (stem from misses) lead to switch
• Entries are tagged with the owner thread info
• 8-way set associative, Least Recently Used (LRU)
algorithm
• Unbalanced usage between processors ( shared
nature of TC)
Execution Trace Cache (TC) (I)
Microcode Store ROM (MSROM) (I)
• Complex instructions (e.g. IA-32) are decoded into more than
4 uops
• TC sends a microcode-instruction pointer to MSROM
• Shared by the logical processors
• Independent flow for each processor(Two microcode
instruction pointers)
• Access to MSROM alternates between logical processors as in
the TC
Microcode Store ROM (MSROM) (II)
The Microcode ROM controller
then fetches the uops needed
and returns control to the TC
TLB
• processors have been working not with physical memory
addresses, but with virtual addresses
• Advantages:
• More memory be allocated
• Keeping only necessary data
• Disadvantages:
• Virtual addresses need to translate to physical address
• Table gets so large and can’t be stored on chip ( paged)
• Translation stage need for each memory access, too much slow
Translation look aside buffer
(TLB)(II)
• A small cache memory directly on the processor that stored
the correspondences for a few recently accessed addresses.
• Until Core 2 used two level cache:
• Level 1 TLB: small but very fast( for loads only) – small (16 entries)
• Level 2 TLB: handled load missed( 256 entries)
Translation look aside buffer
(TLB)(III)
• Nehalem:
• First level - data TLB:
• Shared between data and instruction
• Stores 64 entries( small pages 4k) or 32 entries( large pages 2M/4M)
• First level - instruction TLB:
• Shared between data and instruction
• Stores 128 entries( small pages) and 7( large pages)
• Second level (unified):
• Stores up to 512 entries ( only small pages)
Branch Predictors
• A branch breaks the parallelism
• Branch prediction determines whether or not a branch will be
taken and if it is, quickly determines the target address for
continuing execution
• complicated techniques are needed
• needed is an array of branches—the Branch Target Buffer (BTB)
• and an algorithm for determining the result of the next branch
Intel hasn’t provided details on the algorithm used for their new
predictors
ITLB and Branch Prediction (I)
• If there is a TC miss, bytes need to be loaded from L2
cache and decoded into TC
• ITLB gets the “instruction deliver” request
• ITLB translates next Pointer address to the physical
address
• ITLBs are duplicated for each logical processors
• L2 cache arbitrates on first-come first-served basis while
always reserve at least one slot for each processor
ITLB and Branch Prediction (II)
• Branch prediction structures are either duplicated or shared
• If shared owner tags should be included
• Return stack buffer is duplicated
• Very small structure
• call/return pairs are better predicted for software threads
independently.
• branch history buffer tracked independently for each logical
processor
• large global history array is a shared structure tagged with a logical
processor ID.
ITLB and Branch Prediction (II)
Uop Queue
Decouples the Front End
from the Out-of-order
execution unit
OUT-OF-ORDER EXECUTION ENGINE
Allocator
• The out-of-order execution:
• Re-ordering
• Tracing
• sequencing
• Allocates many of the key machine buffers;
• 126 re-order buffer entries
• 128 integer and floating-point registers
• 48 load, 24 store buffer entries
• Allocator logic takes uops from the qeue
• Resources shared equal (partition) between processors
• Limitation of the key resource usage, we enforce fairness and
prevent deadlocks over the Arch.
• For every clock cycle, allocator switches between uop queues
• If there is “stall” or “HALT”, there is no need to alternate
between processors
Register Rename
• The register rename logic renames the architectural IA-32
registers onto the machine’s physical registers
• The 8 general usage IA-32 register expand to 128 available
physical register.
• Involves with mapping shared registers names for each
processor
• Each processor has its own Register Alias Table (RAT)
• The register renaming process is done in parallel to the
allocator logic
• Uops are stored in two different queues;
• Memory Instruction Queue (Load/Store)
• General Instruction Queue (Rest)
• Queues are partitioned among PUs
Instruction Scheduling
• Schedulers are at the heart of the out-of-order execution
engine
• There are five schedulers which have queues of size 8-12
• Collectively, they can dispatch up to six uops each clock cycle
• Scheduler is oblivious when getting and dispatching uops
• It ignores the owner of the uops
• Dependent inputs and availability of execution resources
• It can get uops from different PUs at the same time
• To provide fairness and prevent deadlock, there is a limit on the
number of active entries of each PU qeue
Execution Units & Retirement
• Execution Units are oblivious when getting and
executing uops
• Since resource and destination registers were renamed earlier,
during/after the execution it is enough to access physical registries
• After execution, the uops are placed in the re-order
buffer which decouples the execution stage from
retirement stage
• The re-order buffer is partitioned between PUs
• Uop retirement commits the architecture state in
program order
• Once stores have retired, the store data needs to be written into
L1 data-cache, immediately
Memory Subsystem
• Totally oblivious to logical processors
• Schedulers can send load or store uops without regard to PUs and
memory subsystem handles them as they come
• Memory types;
• DTLB:
• Translates addresses to physical addresses
• 64 fully associative entries; each entry can map either 4K or 4MB page
• Shared between PUs (Tagged with ID)
• L1, L2 and L3 caches
• The L1 data cache is virtually addressed and physically tagged.
• Cache conflict might degrade performance ( share data in the cache)
• Using same data ( same data) might increase performance (more mem.
hits) – common in server application code
System Modes (I)
• Two modes of operation;
• single-task (ST)
• When there is one software thread to execute
• multi-task (MT)
• When there are more than one software threads to execute
• ST0 or ST1 where number shows the active PU
• HALT command was introduced where resources are combined after the
call
• Reason is to have better utilization of resources
System Modes (II)
• HALT transitions the processor from MT-mode to ST0- or ST1-
mode
• In ST0- or ST1-modes, an interrupt sent to the HALTed
processor would cause a transition to MT-mode.
Operating system and
Applications
• Hyper-Threading:
• operating system and application software as having twice the
number of processors
• OS optimizations:
1. use the HALT instruction if one logical processor is active and
the other is not.
• Not use HALT -> Idle loop !
2. scheduling software threads to logical processors
Performance
21% increase performance 28% increase performance
OS Support for HT
• Native HT Support
• Windows XP Pro Edition
• Windows XP Home Edition
• Linux v 2.4.x (and higher)
• Compatible with HT
• Windows 2000 (all versions)
• Windows NT 4.0 (limited driver support)
• No HT Support
• Windows ME
• Windows 98 (and previous versions)
Conclusion
• Measured performance (Xeon) showed performance gains of
up to 30% on common server applications.
• HT is expected to be viable and market standard from Mobile
to server processes.
Questions ?

Mais conteúdo relacionado

Mais procurados

Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technologydeepakmarndi
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiAnkit Raj
 
Report on hyperthreading
Report on hyperthreadingReport on hyperthreading
Report on hyperthreadingdeepakmarndi
 
Core i3,i5,i7 and i9 processors
Core i3,i5,i7 and i9 processorsCore i3,i5,i7 and i9 processors
Core i3,i5,i7 and i9 processorshajra azam
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architecturesAmit Kumar Rathi
 
KERNAL ARCHITECTURE
KERNAL ARCHITECTUREKERNAL ARCHITECTURE
KERNAL ARCHITECTURElakshmipanat
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computingVajira Thambawita
 
Rtos concepts
Rtos conceptsRtos concepts
Rtos conceptsanishgoel
 

Mais procurados (20)

Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technology
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
intel core i7
intel core i7 intel core i7
intel core i7
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
 
Report on hyperthreading
Report on hyperthreadingReport on hyperthreading
Report on hyperthreading
 
Core i3,i5,i7 and i9 processors
Core i3,i5,i7 and i9 processorsCore i3,i5,i7 and i9 processors
Core i3,i5,i7 and i9 processors
 
Virtual memory
Virtual memoryVirtual memory
Virtual memory
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
 
KERNAL ARCHITECTURE
KERNAL ARCHITECTUREKERNAL ARCHITECTURE
KERNAL ARCHITECTURE
 
UNIT 3.docx
UNIT 3.docxUNIT 3.docx
UNIT 3.docx
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Intel Core i7 Processors
Intel Core i7 ProcessorsIntel Core i7 Processors
Intel Core i7 Processors
 
Memory management
Memory managementMemory management
Memory management
 
VIRTUAL MEMORY
VIRTUAL MEMORYVIRTUAL MEMORY
VIRTUAL MEMORY
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
Rtos concepts
Rtos conceptsRtos concepts
Rtos concepts
 
CPU vs GPU Comparison
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparison
 
Multicore Processor Technology
Multicore Processor TechnologyMulticore Processor Technology
Multicore Processor Technology
 

Destaque

General Director Of Procurement -updated-MODIFIED COVERING LETTER
General Director Of Procurement -updated-MODIFIED COVERING LETTERGeneral Director Of Procurement -updated-MODIFIED COVERING LETTER
General Director Of Procurement -updated-MODIFIED COVERING LETTERmohsen hussain
 
Task and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesTask and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesSasha Goldshtein
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
 
Siemens computex smart grid
Siemens computex smart gridSiemens computex smart grid
Siemens computex smart gridCOMPUTEX TAIPEI
 
Adding Intelligence To Your Mobile Apps
Adding Intelligence To Your Mobile AppsAdding Intelligence To Your Mobile Apps
Adding Intelligence To Your Mobile AppsMayur Tendulkar
 
Sixth Sence Technology
Sixth Sence TechnologySixth Sence Technology
Sixth Sence TechnologyBeat Boyz
 
Collaborative Mapping with Google Wave
Collaborative Mapping with Google WaveCollaborative Mapping with Google Wave
Collaborative Mapping with Google WavePamela Fox
 
Waves of Innovation: Using Google Wave in the ESL Classroom
Waves of Innovation: Using Google Wave in the ESL ClassroomWaves of Innovation: Using Google Wave in the ESL Classroom
Waves of Innovation: Using Google Wave in the ESL ClassroomDavid Bartsch
 
Hyper Transport Technology
Hyper Transport TechnologyHyper Transport Technology
Hyper Transport Technologynayakslideshare
 
GSM 2.5G Migration
GSM 2.5G MigrationGSM 2.5G Migration
GSM 2.5G Migrationmaddiv
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...Ahmed kasim
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Dr.K. Thirunadana Sikamani
 

Destaque (20)

Hyper threading
Hyper threadingHyper threading
Hyper threading
 
Fahad surahio
Fahad surahioFahad surahio
Fahad surahio
 
H T T1
H T T1H T T1
H T T1
 
Blue Brain
Blue Brain Blue Brain
Blue Brain
 
General Director Of Procurement -updated-MODIFIED COVERING LETTER
General Director Of Procurement -updated-MODIFIED COVERING LETTERGeneral Director Of Procurement -updated-MODIFIED COVERING LETTER
General Director Of Procurement -updated-MODIFIED COVERING LETTER
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 
Task and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesTask and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World Examples
 
Concurrency basics
Concurrency basicsConcurrency basics
Concurrency basics
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
 
Siemens computex smart grid
Siemens computex smart gridSiemens computex smart grid
Siemens computex smart grid
 
GIFI
GIFI GIFI
GIFI
 
Adding Intelligence To Your Mobile Apps
Adding Intelligence To Your Mobile AppsAdding Intelligence To Your Mobile Apps
Adding Intelligence To Your Mobile Apps
 
Sixth Sence Technology
Sixth Sence TechnologySixth Sence Technology
Sixth Sence Technology
 
Collaborative Mapping with Google Wave
Collaborative Mapping with Google WaveCollaborative Mapping with Google Wave
Collaborative Mapping with Google Wave
 
Hawk eye Technology
Hawk eye TechnologyHawk eye Technology
Hawk eye Technology
 
Waves of Innovation: Using Google Wave in the ESL Classroom
Waves of Innovation: Using Google Wave in the ESL ClassroomWaves of Innovation: Using Google Wave in the ESL Classroom
Waves of Innovation: Using Google Wave in the ESL Classroom
 
Hyper Transport Technology
Hyper Transport TechnologyHyper Transport Technology
Hyper Transport Technology
 
GSM 2.5G Migration
GSM 2.5G MigrationGSM 2.5G Migration
GSM 2.5G Migration
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 

Semelhante a Intel® hyper threading technology

Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerAmrutaMehata
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CanSecWest
 
Chapter1 Computer System Overview Part-1.ppt
Chapter1 Computer System Overview Part-1.pptChapter1 Computer System Overview Part-1.ppt
Chapter1 Computer System Overview Part-1.pptShikhaManrai1
 
Chapter1 Computer System Overview.ppt
Chapter1 Computer System Overview.pptChapter1 Computer System Overview.ppt
Chapter1 Computer System Overview.pptShikhaManrai1
 
IT209 Cpu Structure Report
IT209 Cpu Structure ReportIT209 Cpu Structure Report
IT209 Cpu Structure ReportBis Aquino
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdfwisard1
 
Computer organization & architecture chapter-1
Computer organization & architecture chapter-1Computer organization & architecture chapter-1
Computer organization & architecture chapter-1Shah Rukh Rayaz
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors pptSiddhartha Anand
 
Computer_Organization and architecture _unit 1.pptx
Computer_Organization and architecture _unit 1.pptxComputer_Organization and architecture _unit 1.pptx
Computer_Organization and architecture _unit 1.pptxManimegalaM3
 

Semelhante a Intel® hyper threading technology (20)

Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and Microcontroller
 
UNIT 2.pptx
UNIT 2.pptxUNIT 2.pptx
UNIT 2.pptx
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
 
13_Superscalar.ppt
13_Superscalar.ppt13_Superscalar.ppt
13_Superscalar.ppt
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
 
Chapter1 Computer System Overview Part-1.ppt
Chapter1 Computer System Overview Part-1.pptChapter1 Computer System Overview Part-1.ppt
Chapter1 Computer System Overview Part-1.ppt
 
Control unit
Control unitControl unit
Control unit
 
Chapter1 Computer System Overview.ppt
Chapter1 Computer System Overview.pptChapter1 Computer System Overview.ppt
Chapter1 Computer System Overview.ppt
 
IT209 Cpu Structure Report
IT209 Cpu Structure ReportIT209 Cpu Structure Report
IT209 Cpu Structure Report
 
Memory Management.pdf
Memory Management.pdfMemory Management.pdf
Memory Management.pdf
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdf
 
Chapter01 (1).ppt
Chapter01 (1).pptChapter01 (1).ppt
Chapter01 (1).ppt
 
Computer organization & architecture chapter-1
Computer organization & architecture chapter-1Computer organization & architecture chapter-1
Computer organization & architecture chapter-1
 
Cs intro-ca
Cs intro-caCs intro-ca
Cs intro-ca
 
Ch8 main memory
Ch8   main memoryCh8   main memory
Ch8 main memory
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Architecture of high end processors
Architecture of high end processorsArchitecture of high end processors
Architecture of high end processors
 
Computer_Organization and architecture _unit 1.pptx
Computer_Organization and architecture _unit 1.pptxComputer_Organization and architecture _unit 1.pptx
Computer_Organization and architecture _unit 1.pptx
 
cs-procstruc.ppt
cs-procstruc.pptcs-procstruc.ppt
cs-procstruc.ppt
 

Último

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 

Último (20)

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 

Intel® hyper threading technology

  • 2. Outline • Introduction • Traditional Approaches • Hyper-Threading Overview • Hyper-Threading Implementation • Front-End Execution • Out-of-Order Execution • Performance Results • OS Supports • Conclusion
  • 3. Introduction • Hyper-Threading technology makes a single processor appear as two logical processors. • It was first implemented in the Prestonia version of the Pentium® 4 Xeon processor on 02/25/02.
  • 4. Traditional Approaches (I) • High requirements of Internet and Telecommunications Industries • Results are unsatisfactory compared the gain they provide with the cost they cause • Well-known techniques; • Super Pipelining • Branch Prediction • Super-scalar Execution • Out-of-order Execution • Fast memories (Caches)
  • 5. Traditional Approaches (II) • Super Pipelining: • Have finer granularities, execute far more instructions within a second (Higher clock frequencies) • Hard to handle cache misses, interrupts and branch mispredictions • Instruction Level Parallelism (ILP) • Mainly targets to increase the number of instructions within a cycle • Super Scalar Processors with multiple parallel execution units • Execution needs to be verified for out-of-order execution • Fast Memory (Caches) • To reduce the memory latencies, hierarchical units are using which are not an exact solution
  • 6. Traditional Approaches (III) Same silicon technology Normalized speed-ups with Intel486™ microarchitecture, has improved integer performance five- or six- fold1 die size has gone up fifteen-fold, a three-times-higher rate power increased almost eighteen-fold during this period1.
  • 7. Thread-Level Parallelism • Chip Multi-Processing (CMP) • Put 2 processors on a single die • Processors (only) may share on-chip cache • Cost is still high • Single Processor Multi-Threading; • Time-sliced multi-threading • Switch-on-event multi-threading ( well for server application) • Simultaneous multi-threading
  • 9. Hyper-Threading (HT) Technology • Provides more satisfactory solution • Single physical processor is shared as two logical processors • Each logical processor has its own architecture state • Single set of execution units are shared between logical processors • N-logical PUs are supported • Have the same gain % with only 5% die- size penalty. • HT allows single processor to fetch and execute two separate code streams simultaneously.
  • 10. Firstimplementationon the intel xeon processorfamily • Several goals were at the heart of the microarchitecture: • Minimize the die area cost of implementing • One logical processor is stalled the other logical processor could continue to make forward progress. • Allow a processor running only one active software thread to run at the same speed
  • 11. HT Resource Types • Replicated Resources • Flags, Registers, Time-Stamp Counter, APIC • Shared Resources • Memory, Range Registers, Data Bus • Shared | Partitioned Resources • Caches & Queues
  • 14. Execution Pipeline Partition queues between major pipestages of pipeline
  • 15.
  • 16.
  • 17.
  • 18. Partitioned Queue Example • Partitioning resource ensures fairness and ensures progress for both logical processors!
  • 21. The Execution Trace Cache • Primary or Advanced form of L1 Instruction Cache. • Delivers 3 μops/clock to the out-of-order execution logic. • Most instructions fetched and decoded from the Trace Cache. • caches the μops of previously decoded instructions here, so it bypasses the instruction decode • Recovery time for a mispredicted branch is much shorter in compression of re-decode the IA-32 instruction
  • 22. Execution Trace Cache (TC) (I) • Stores decoded instructions called “micro- operations” or “uops” • Arbitrate access to the TC using two IPs • If both PUs ask for access then switch will occur in the next cycle. • Otherwise, access will be taken by the available PU • Stalls (stem from misses) lead to switch • Entries are tagged with the owner thread info • 8-way set associative, Least Recently Used (LRU) algorithm • Unbalanced usage between processors ( shared nature of TC)
  • 24. Microcode Store ROM (MSROM) (I) • Complex instructions (e.g. IA-32) are decoded into more than 4 uops • TC sends a microcode-instruction pointer to MSROM • Shared by the logical processors • Independent flow for each processor(Two microcode instruction pointers) • Access to MSROM alternates between logical processors as in the TC
  • 25. Microcode Store ROM (MSROM) (II) The Microcode ROM controller then fetches the uops needed and returns control to the TC
  • 26. TLB • processors have been working not with physical memory addresses, but with virtual addresses • Advantages: • More memory be allocated • Keeping only necessary data • Disadvantages: • Virtual addresses need to translate to physical address • Table gets so large and can’t be stored on chip ( paged) • Translation stage need for each memory access, too much slow
  • 27. Translation look aside buffer (TLB)(II) • A small cache memory directly on the processor that stored the correspondences for a few recently accessed addresses. • Until Core 2 used two level cache: • Level 1 TLB: small but very fast( for loads only) – small (16 entries) • Level 2 TLB: handled load missed( 256 entries)
  • 28. Translation look aside buffer (TLB)(III) • Nehalem: • First level - data TLB: • Shared between data and instruction • Stores 64 entries( small pages 4k) or 32 entries( large pages 2M/4M) • First level - instruction TLB: • Shared between data and instruction • Stores 128 entries( small pages) and 7( large pages) • Second level (unified): • Stores up to 512 entries ( only small pages)
  • 29. Branch Predictors • A branch breaks the parallelism • Branch prediction determines whether or not a branch will be taken and if it is, quickly determines the target address for continuing execution • complicated techniques are needed • needed is an array of branches—the Branch Target Buffer (BTB) • and an algorithm for determining the result of the next branch Intel hasn’t provided details on the algorithm used for their new predictors
  • 30. ITLB and Branch Prediction (I) • If there is a TC miss, bytes need to be loaded from L2 cache and decoded into TC • ITLB gets the “instruction deliver” request • ITLB translates next Pointer address to the physical address • ITLBs are duplicated for each logical processors • L2 cache arbitrates on first-come first-served basis while always reserve at least one slot for each processor
  • 31. ITLB and Branch Prediction (II) • Branch prediction structures are either duplicated or shared • If shared owner tags should be included • Return stack buffer is duplicated • Very small structure • call/return pairs are better predicted for software threads independently. • branch history buffer tracked independently for each logical processor • large global history array is a shared structure tagged with a logical processor ID.
  • 32. ITLB and Branch Prediction (II)
  • 33. Uop Queue Decouples the Front End from the Out-of-order execution unit
  • 35. Allocator • The out-of-order execution: • Re-ordering • Tracing • sequencing • Allocates many of the key machine buffers; • 126 re-order buffer entries • 128 integer and floating-point registers • 48 load, 24 store buffer entries • Allocator logic takes uops from the qeue • Resources shared equal (partition) between processors • Limitation of the key resource usage, we enforce fairness and prevent deadlocks over the Arch. • For every clock cycle, allocator switches between uop queues • If there is “stall” or “HALT”, there is no need to alternate between processors
  • 36. Register Rename • The register rename logic renames the architectural IA-32 registers onto the machine’s physical registers • The 8 general usage IA-32 register expand to 128 available physical register. • Involves with mapping shared registers names for each processor • Each processor has its own Register Alias Table (RAT) • The register renaming process is done in parallel to the allocator logic • Uops are stored in two different queues; • Memory Instruction Queue (Load/Store) • General Instruction Queue (Rest) • Queues are partitioned among PUs
  • 37. Instruction Scheduling • Schedulers are at the heart of the out-of-order execution engine • There are five schedulers which have queues of size 8-12 • Collectively, they can dispatch up to six uops each clock cycle • Scheduler is oblivious when getting and dispatching uops • It ignores the owner of the uops • Dependent inputs and availability of execution resources • It can get uops from different PUs at the same time • To provide fairness and prevent deadlock, there is a limit on the number of active entries of each PU qeue
  • 38. Execution Units & Retirement • Execution Units are oblivious when getting and executing uops • Since resource and destination registers were renamed earlier, during/after the execution it is enough to access physical registries • After execution, the uops are placed in the re-order buffer which decouples the execution stage from retirement stage • The re-order buffer is partitioned between PUs • Uop retirement commits the architecture state in program order • Once stores have retired, the store data needs to be written into L1 data-cache, immediately
  • 39. Memory Subsystem • Totally oblivious to logical processors • Schedulers can send load or store uops without regard to PUs and memory subsystem handles them as they come • Memory types; • DTLB: • Translates addresses to physical addresses • 64 fully associative entries; each entry can map either 4K or 4MB page • Shared between PUs (Tagged with ID) • L1, L2 and L3 caches • The L1 data cache is virtually addressed and physically tagged. • Cache conflict might degrade performance ( share data in the cache) • Using same data ( same data) might increase performance (more mem. hits) – common in server application code
  • 40. System Modes (I) • Two modes of operation; • single-task (ST) • When there is one software thread to execute • multi-task (MT) • When there are more than one software threads to execute • ST0 or ST1 where number shows the active PU • HALT command was introduced where resources are combined after the call • Reason is to have better utilization of resources
  • 41. System Modes (II) • HALT transitions the processor from MT-mode to ST0- or ST1- mode • In ST0- or ST1-modes, an interrupt sent to the HALTed processor would cause a transition to MT-mode.
  • 42. Operating system and Applications • Hyper-Threading: • operating system and application software as having twice the number of processors • OS optimizations: 1. use the HALT instruction if one logical processor is active and the other is not. • Not use HALT -> Idle loop ! 2. scheduling software threads to logical processors
  • 43. Performance 21% increase performance 28% increase performance
  • 44. OS Support for HT • Native HT Support • Windows XP Pro Edition • Windows XP Home Edition • Linux v 2.4.x (and higher) • Compatible with HT • Windows 2000 (all versions) • Windows NT 4.0 (limited driver support) • No HT Support • Windows ME • Windows 98 (and previous versions)
  • 45. Conclusion • Measured performance (Xeon) showed performance gains of up to 30% on common server applications. • HT is expected to be viable and market standard from Mobile to server processes.

Notas do Editor

  1. -Since the logical processors share the vast majority of microarchitecture resources and only a few small structures were replicated, the die area cost of the first implementation was less than 5% of the total die area.-A logical processor may be temporarily stalled for a variety of reasons, including servicing cache misses, handling branch mispredictions, or waiting for the results of previous instructions.
  2. -When a complex instruction is encountered, the TC sends a microcode-instruction pointer to the Microcode ROM.-The Microcode ROM controller then fetches the uops needed and returns control to the TC.-Two microcode instruction pointers are used to control the flows independently if both logical processors are executing complex IA-32 instructions.
  3. -keeping only the data necessary at a given moment in actual physical memory with the rest remaining on the hard disk-This means that for each memory access a virtual address has to be translated into a physical address, and to do that an enormous table is put in charge of keeping track of the correspondences-The problem is that this table gets so large that it can’t be stored on-chip—it’s placed in main memory, and can even be paged (part of the table can be absent from memory and itself kept on the hard disk).
  4. -The return stack buffer, which predicts the target of return instructions, is duplicated.--The branch prediction structures are either duplicated or shared. The return stack buffer, which predicts the target of return instructions, is duplicated because it is avery small structure and the call/return pairs are better predicted for software threads independently.
  5. -The allocator logic takes uops from the uop queue and allocates many of the key machine buffers needed to execute each uop-Some of these key buffers are partitioned such that each logical processor can use at most half the entries.
  6. The register rename logic renames the architectural IA- 32 registers onto the machine’s physical registers. This allows the 8 general-use IA-32 integer registers to be dynamically expanded to use the available 128 physical registers.-The renaming logic uses a Register Alias Table (RAT) to track the latest version of each architectural register to tell the next instruction(s) where to get its input operands.
  7. -The schedulers determine when uops are ready to execute based on the readiness of their dependent input register operands and the availability of the execution unit resources.
  8. -The DTLB translates addresses to physical addresses-Each logical processor also has a reservation register to ensure fairness and forward progress in processing DTLB misses.
  9. -An operating system that does not use HALT optimization would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do
  10. -OLTP = online transaction processing