Mais conteúdo relacionado
Semelhante a 01 intel processor architecture core (20)
01 intel processor architecture core
- 2. Intel® Software College
Objectives
After completion of this module you will be able to describe
• Components of an IA processor
• Working flow of the instruction pipeline
• Notable features of the architecture
Intel® Processor Micro-architecture - Core® microarchitecture
2
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 3. Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
3
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 4. Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
4
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 5. Industrial Recognition Intel® Software College
PC Format May 2006
“Intel Strikes Back! Conroe is the name. Pistol-whipping Athlon
64s into burger meat is the game..“
Intel's Next Generation Microarchitecture Unveiled
Real World Tech
“Just as important as the technical innovations in Core MPUs, this
microarchitecture will have a profound impact on the industry. “
Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.com
“…the results were far more than we could hope for and it'll be
amusing to see AMD's response to this beat-down session
Intel Regains Performance Crown, Anandtech
“… At 2.8 or 3.0GHz, a Conroe EE would offer even stronger performance
than what we’ve seen here.”
Intel Reveals Conroe Architecture, Extremetech
“… And not only was the Intel system running at 2.66GHz— a slower
clock rate than the top Pentium 4—it was outpacing an overclocked
Athlon 64 FX-60. Wrap your brain around that idea for a bit…”
Conroe Benchmarks - Intel Showing Big Strength Hot Hardware.com
Intel® Processor Micro-architecture - Core® microarchitecture
“… Intel is poised to change the face of the desktop computing landscape…”
5
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 6. Intel® Software College
Performance Summary
Intel® Core™ Microarchitecture dramatically boosts Intel
platform performance
• Conroe & Woodcrest drive clear Desktop/Server performance
leadership
• Merom extends Intel Mobile performance leadership
Intel® Core™ Microarchitecture-based platforms set the
bar in Performance and Energy Efficiency for the Multi-
Core era
• Intel’s 3rd generation dual-core (while competition stuck on 1st
generation)
• New Intel high-performance ‘engine’: Wider, Smarter, Faster, More
Efficient
Best Processor on the Planet: Energy-Efficient Performance 1
Energy-
The “Core™ Effect”: Intel® Core™ Microarchitecture
20% (Merom), broad roadmap accelerationsPerformance Boosts1 !
ramp fuels 40% (Conroe), 80% (Woodcrest)
Intel® Processor Micro-architecture - Core® microarchitecture
6 1 Based on SPECint*_rate_base2000
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 7. Intel® Software College
Agenda
Introduction
Knowledge preparation
• Architecture VS Microarchitecture
• CISC VS RISC
• Performance Measurements
• Pipeline Design
• Power and Energy
• Chip Multi-Processing
Notable features
Micro-architecture tour
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
7
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 8. Intel® Software College
Architecture and Micro-architecture
What is Computer Architecture?
• Architecture is the set of features which are externally visible:
• Instruction set
• Registers
• Addressing modes
• Bus protocols
Intel Architectures (IA)
• IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture)
• X87 (Floating Point extension)
• MMX (Multi-Media extension)
• SSE, SSE2, SSE3 (SIMD Streaming Extension)
• Intel® 64/EM64T (64-bit Integer extension of IA32) ? Go to detail!
• IA64 (Intel new 64-bit architecture)
• Itanium/Itainium2 processor family
Intel® Processor Micro-architecture - Core® microarchitecture
8
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 9. Intel® Software College
Architecture and Micro-architecture (cont.)
What is Micro-architecture?
• Same as m–Architecture or u-Architecture
• “Invisible” features that provide meaningful value to the end
user (whatever makes you buy a new compatible PC)
• Programs run faster Improved Performance
• Reduced Power consumption Extended Battery life
• H/W fits into Smaller Form Factor
Intel® Processor Micro-architecture - Core® microarchitecture
9
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 10. Intel® Software College
Intel® Architecture History
* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing
Examples:
Architecture:
Instruction set definition EPIC* (Itanium®) IA-32 IXA* (XScale)
and compatibility
Microarchitecture:
Hardware implementation Examples:
maintaining instruction set
compatibility with high-level P5 P6 Intel NetBurst® Banias
architecture
Processors:
Productized
implementation of
Microarchitecture Examples:
Pentium® 4
Pentium® Pro
Pentium® Pentium® D Pentium® M
Pentium® II/III
Xeon®
Intel® Processor Micro-architecture - Core® microarchitecture
10
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 11. Intel® Software College
Intel® Core™ Microarchitecture Processors
Intel® NetBurst®
+ New Innovations
Mobile
Microarchitecture
Intel® Core™ 2 Duo/Quad/Extreme processors
Intel® Processor Micro-architecture - Core® microarchitecture
11
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 12. Intel® Software College
RISC Approach to CPU design
(RISC = Reduced Instruction Set Computers)
Optimize H/W for common basic operations
• Fixed instruction length
• Shorter Execution Pipeline
• Ease of Instruction Level Parallelism
• Large number of registers
• Less memory accesses
• ‘Load/Store’ architecture
• Shorter Execution Pipeline
• Ease of advancing Loads
• Branch Hints
• Reduce pipeline flush events
• ‘Exotic’ stuff to be implemented in S/W with minimal H/W support
• No ‘complex’ H/W instructions
• Handle exceptional conditions in S/W
Examples: MIPS, IBM Power and PowerPC, Sun Sparc
Achieve Maximum performance by
right partitioning between H/W and S/W Intel® Processor Micro-architecture - Core® microarchitecture
12
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 13. Intel® Software College
CISC Approach to CPU design
(CISC = Complex Instruction Set Computers)
Rich architecture
• Variable length instructions.
• Complex addressing modes.
On-chip HW / SW partitioning required
• H/W keeps executing ‘simple’ stuff
• Complex instructions are ‘emulated’ using u-code routines
from ROM
• More instructions treated as ‘simple’ as more H/W is available
COMPATIBILITY has some major advantages:
• Large (and forever increasing) software base
• Code development tools
• Expertise
• H/W - S/W spiral
Example: Intel IA32, Motorola 680X0
Maximize information passed to the HW
Intel® Processor Micro-architecture - Core® microarchitecture
13
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 14. Intel® Software College
Performance Measurement
Performance is the reciprocal of the “Time of execution”:
1 1
Performance ≈ =
Were: Time _ of _ Execution L * CPI * TC
L = Code Length (# of machine instructions)
CPI = Clock cycles Per Instruction
Tc = Clock period (nSecs)
Substitute:
IPC = Instructions Per Cycle = 1/CPI
F = Frequency = 1/Tc
Improve ILP Improve Timing
IPC * F
Performance ≈
L
Arch Enhancements
Intel® Processor Micro-architecture - Core® microarchitecture
14
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 15. Intel® Software College
Performance Measurement (cont.)
Benchmarks examples
Performance considerations: • Industry Standard
• Which Code/Application to run? • Spec (ISPEC, FSPEC)
• Which OS? • TPC
• Commercial
• Which other components in the • SysMark
platform? • MobileMark
• Under which thermal conditions? • PCMark
• Multithreading? Multiprocessing? • Sandra
• ScienceMark
• Applications
• Video (Windows Media encoder, DivX)
• Audio (Lame MP3)
• Compression (RAR)
• Content creation (3DSM, Photoshop, Premiere)
• Latest Games (Doom III, FarCry, but changes
fast)
• Specific industries use specific benchmarks
• Linux compilation, POVRay, LinPack, lmbench
Intel® Processor Micro-architecture - Core® microarchitecture
15
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 16. Intel® Software College
Design Considerations for Different
Market Segments
Constrains:
• Thermally, area constrained Desktop
• Unconstrained Extreme
• Very area constrained Value
• Thermally, Energy and Area constrained Mobile
• Thermally, Energy Servers
Micro-architecture is the Art of Tradeoffs between:
• Schedule
• Requirements / Standards
• Performance
• Features
• Power / Energy
• Area / Cost
Intel® Processor Micro-architecture - Core® microarchitecture
16
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 17. Intel® Software College
Design Metrics
IPC = Instructions per Cycle
• The more the better
Latency – same as Response Time
• The time interval between
• when any request for data is made and
• when the data transfer completes
• The less the better
Throughput
• The amount of work completed by the system per unit of time.
• The more the better
• ops/sec
Intel® Processor Micro-architecture - Core® microarchitecture
17
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 18. Intel® Software College
CPU Pipeline
Break the work to smaller pieces
• Four basic stages of instruction life
• Fetch - bring instruction to core
• Decode - read operands from register
• Execute - perform the operation
• Writeback - save result to register
• Execution timing of simple instructions
(legend: “op src1,src2 dst”)
add eax, ebx eax F D E W
sub ecx, edx ecx F D E W
Increased throughput
• increased number of completed instructions per cycle
Intel® Processor Micro-architecture - Core® microarchitecture
18
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 19. Intel® Software College
Pipeline Design - Explore Parallelism
New instruction not always depends on previous one
• Can start new instruction before previous one is finished
• ...if different stages use different H/W resources
Run instructions in parallel (pipeline)
Add eax, ebx eax F D E W
Sub ecx, edx ecx F D E W
Or edi, esi edi F D E W
Need to balance pipe stages
• Each stage should take same time for best throughput and utilization
Clock cycle is determined
by the longest path!
Fetch Decode Exec WB
Fetch Decode Exec WB
Fetch Decode Exec WB
Fetch Decode Exec WB
Intel® Processor Micro-architecture - Core® microarchitecture
19
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 20. Intel® Software College
Pipeline Design – Fighting Stalls
Data flow dependency (instructions output/input)
• Solved by bypasses, renaming etc
Control flow dependencies
• Solved by branch prediction
Others (Cache misses, long latency instructions)
• Solved by other dynamic scheduling techniques
? Go to detail!
Intel® Processor Micro-architecture - Core® microarchitecture
20
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 21. Intel® Software College
Race of CISC vs. RISC
In modern CPUs Advanced µ-Architecture Techniques minimize the
advantages of RISC over CISC
• Branch Prediction
• Reduces the effect of extra pipeline stages
• Register Renaming
• Effectively Increase the Number of Registers
• Out Of Order
• Reduce Number of stalls caused by shortage of registers
• Speculative Execution
• Further Reduce Number of stalls
• Power saving features
• Reduce the overhead when not needed.
Intel® Processor Micro-architecture - Core® microarchitecture
21
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 22. Intel® Software College
op – Intel’s Take of the CICS/RISC Race
(CISC) Instructions are translated into one or more (RISC)
uop(micro-operation)s
• Fixed format
• Wide and simple
• Temp registers
Usually one uop per instruction
Complex instruction can be thousands of uops
Stores divided into two uops (STA and STD)
Fusion play games here
Intel® Processor Micro-architecture - Core® microarchitecture
22
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 23. Intel® Software College
Power and Energy
Maximum power (TDP):
• Cooling requirements
• Cooling solution
• Computer form factor and acoustic noise
Average power
• Battery life
• Electricity bill
General calculation:
• P = frequency * voltage^2 * activity factor * capacitance + leakage
Reducing TDP
• Less transistors and wires
• Smaller transistors and wires
• Power features less activity
• Low leakage transistors
Reducing average power
• Energy efficiency
• Power states
• Lower leakage
Intel® Processor Micro-architecture - Core® microarchitecture
23
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 24. Intel® Software College
Dual/Multi Core and SMT
Put more than one core per package
Architectural change:
• Software must be multi-threaded or multi-process
• …but backward compatible with multiprocessor systems (MP)
Several ways of implementing it
• All of them being used
I/O I/O
I/O I/O
LLC
LLC LLC LLC LLC
Core Core Core Core Core Core
SMT: Run two (or more) threads on the same core, simultaneously
Intel® Processor Micro-architecture - Core® microarchitecture
24
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 25. Intel® Software College
Intel Approach
?
Intel®
Intel®
XQ6700*
Intel®
Intel®
Core 2 Duo®
Duo®
Intel®
Intel®
Pentium® D
Pentium®
Processor 80 Threads
Intel®
Intel®
Pentium®
Pentium®
With HT
Intel®
Intel® 4 Threads
Pentium®
Pentium®
2 Threads
State
2 Threads Execution Units
Cache
Bus
2 Threads
1 Threads
Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006
While single core performance has increased due to clock speed,
While single core performance has increased due to clock speed,
increased cache and improved ILP the biggest performance increases
increased cache and improved ILP the biggest performance increases
have come from the thread level parallelism.
have come from the thread level parallelism.
Intel® Processor Micro-architecture - Core® microarchitecture
25
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 26. Intel® Software College
A “Acronym Cheat Sheet” of Parallel
Computing
CMP: Chip Multi Processor (two or more cores per package)
• Dual Core: two cores in same package
• Quad Core: four cores in same package
DP: Dual Processor (two packages)
MP: Multi Processor (four or more packages)
SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)
Intel® Processor Micro-architecture - Core® microarchitecture
26
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 27. Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
• Wide Dynamic Execution
• Smart Memory Access
• Advanced Smart Cache
• Advanced Digital Media Boost
• Intelligent Power Capability
Micro-architecture tour
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
27
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 28. Intel® Software College
Intel® Core® Micro-architecture Notable
Features Instruction Fetch
Intel® Wide Dynamic Execution and PreDecode
• 14-stage efficient pipeline
Instruction Queue 2M/4M
• Wider execution path 5 shared L2
• Advanced branch prediction uCode
ROM
Decode Cache
• Macro-fusion 4
• Roughly ~15% of all instructions are
conditional branches up to
• Macro-fusion fuses a comparison Rename/Alloc
and jump to reduce micro-ops
10.4 Gb/s
running down the pipeline FSB
• Micro-fusion Retirement Unit
4
• Merges the load and operation (ReOrder Buffer)
micro-ops into one macro-op
• 64-Bit Support Schedulers
ALU ALU ALU
• Merom, Conroe, and Woodcrest Branch FAdd FMul
support EM64T MMX/SSE MMX/SSE MMX/SSE Load Store
FPmove FPmove FPmove
L1 D-Cache and D-TLB
Intel® Processor Micro-architecture - Core® microarchitecture
28
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 29. Intel® Software College
Intel® Core® Micro-architecture Notable
Features (cont.)
Intel® Advanced Memory Access
• Improved prefetching
• Memory disambiguation
• Advance load before a possible data dependency (pointer conflict)
• Earlier loads hide memory latencies
Intel® Processor Micro-architecture - Core® microarchitecture
29
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 30. Intel® Software College
Intel® Core® Micro-architecture Notable
Features (cont.)
Intel® Advanced Smart Cache
• Multi-core optimization
• Shared between the two cores
• Advanced Transfer Cache architecture
• Reduced bus traffic
• Both cores have full access to the entire cache
• Dynamic Cache sizing
Intel® Processor Micro-architecture - Core® microarchitecture
30
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 31. Intel® Software College
Intel® Core® Micro-architecture Notable
Features (cont.)
Advantages of Shared Cache
Memory
Front Side Bus (FSB)
Shipping L2 Cache Line
~Half access to memory
Cache Line
CPU1 CPU2
Intel® Processor Micro-architecture - Core® microarchitecture
31
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 32. Intel® Software College
Intel® Core® Micro-architecture Notable
Features (cont.)
Advantages of Shared Cache (cont.)
Memory
Front Side Bus (FSB)
L2 is shared:
No need to ship cache
line
Cache Line
CPU1 CPU2
Intel® Processor Micro-architecture - Core® microarchitecture
32
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 33. Intel® Software College
Intel® Core® Micro-architecture Notable
Features (cont.)
Intel® Advanced Digital Media Boost SIMD Operation
(SSE/SSE2/SSE3/SSSE)
• Single Cycle SIMD Operation
SOURCE 127 0
• 8 Single Precision Flops/cycle X4 X3 X2 X1
• 4 Double Precision Flops/cycle SSE/2/3 OP
• Wide Operations Y4 Y3 Y2 Y1
• 128-bit packed Add DEST
• 128-bit packed Multiply
Core™ µarch
• 128-bit packed Load
CLOCK
X4opY4 X3opY3 X2opY2 X1opY1
• 128-bit packed Store CYCLE 1
• Support for Intel® EM64T Previous CLOCK
X2opY2 X1opY1
CYCLE 1
instructions
CLOCK X4opY4 X3opY3
CYCLE 2
Intel® Processor Micro-architecture - Core® microarchitecture
33
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 34. Intel® Software College
Intel® Core® Micro-architecture Notable
Features
Intel® Advanced Digital Media Boost
• Additional Media Instructions - Supplemental Streaming SIMD
Extensions 3 (SSSE3)
• 16 new packed integer instructions
• Targeting video encode/decode
• Significantly improved strings
• REP MOVS and REP STOS
• ~8 bytes / cycle throughput
• mileage may vary
Intel® Processor Micro-architecture - Core® microarchitecture
34
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 35. Intel® Software College
Intel® Core® Micro-architecture Notable
Features
Intel® Advanced Digital Media Boost
• Supplemental SSE-3 (SSSE-3)
Horizontal Addition/Subtraction
PHADDW, PHADDSW, PHADDD,
PHSUBW, PHSUBSW, PHSUBD
Packed Absolute Values
PABSB, PABSW, PABSD
Multiply and Add Packed
Signed/Unsigned bytes
PMADDUBSW
Packed multiply High with
Round and Scale PMULHRSW
Packed Shuffle Bytes
PSHUFB
Packed SIGN PSIGNB/W/D
Packed Align Right
PALIGNR
Intel® Processor Micro-architecture - Core® microarchitecture
35
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 36. Intel® Software College
Intel® Core® Micro-architecture Notable
Features (cont.)
Intelligent Power Capability
• Advanced power gating & Dynamic power coordination
• Multi-point demand-based switching
• Voltage-Frequency switching separation
• Supports transitions to deeper sleep modes
• Event blocking
• Clock partitioning and recovery
• Dynamic Bus Parking
• During periods of high performance execution, many parts of the
chip core can be shut off
Intel® Processor Micro-architecture - Core® microarchitecture
36
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 37. Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
37
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 38. Intel® Software College
Intel® Core® Micro-architecture Drill-down
page miss handler store
icache
branch address integer
prediction
predecode unit
data memory FP
load SIMD
cache order
instruction unit buffer store
(3x)
queue data
instruction register Reservation
decode alias table Station
MS ALLOC Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture
38
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 39. Intel® Software College
Agenda
Introduction
Knowledge refreshment
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
39
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 40. Intel® Software College
Core® Micro-architecture Front End
Instruction preparation before executed icache
branch
• Instruction Fetch Unit prediction
predecode unit
• Instruction Queue
• Instruction Decode Unit
• Branch Prediction Unit instruction
queue
instruction
decode
MS
Intel® Processor Micro-architecture - Core® microarchitecture
40
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 41. Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Queue
Buffer between instruction pre-decode unit and decoder
• up to six predecoded instructions written per cycle
• 18 Instructions contained in IQ
• up to 5 Instructions read from IQ
Potential Loop cache
Loop Stream Detector (LSD) support
• Re-use of decoded instruction
• Potential power saving
Intel® Processor Micro-architecture - Core® microarchitecture
41
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 42. Intel® Software College
Intel® Core™ Microarchitecture – Front End
Macro - Fusion
Scheduler
Roughly ~15% of all instructions are
cmpjae eax, [mem], label
conditional branches.
Macro-fusion merges two instructions
into a single micro-op, as if the two
instructions were a single long
instruction. Execution
Enhanced Arithmetic Logic Unit (ALU)
for macro-fusion. Each macro-fused
instruction executes with a single
dispatch. Branch
Eval
Not supported in EM64T long mode
flags and target to Write back
Intel® Processor Micro-architecture - Core® microarchitecture
42
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 43. Intel® Software College
Intel® Core™ Microarchitecture – Front End
Macro-Fusion Absent Instruction Queue
addps xmm0, [EAX+16]
Read four instructions from
mulps xmm0, xmm0
Instruction Queue
Each instruction gets decoded movps [EAX+240], xmm0
into separate uops
cmp eax, 100000
Enabling Example
jge label
for (int i=0; i<100000; i++) {
… addps xmm0, [EAX+16] dec0
Cycle 1
} mulps xmm0, xmm0 dec1
movps [EAX+240], xmm0 dec2
cmp eax, 100000 dec3
Cycle 2 jge label dec0
Intel® Processor Micro-architecture - Core® microarchitecture
43
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 44. Intel® Software College
Intel® Core™ Microarchitecture – Front End
Macro-Fusion Presented Instruction Queue
addps xmm0, [EAX+16]
Read five Instructions from
Instruction Queue mulps xmm0, xmm0
Send fusable pair to single movps [EAX+240], xmm0
decoder
cmp eax, 100000
Single uop represents two
instructions jae label
Enabling Example
for (unsigned int i=0; Cycle 1 addps xmm0, [EAX+16] dec0
i<100000; i++) {
mulps xmm0, xmm0 dec1
… movps [EAX+240], xmm0 dec2
} cmpjae eax, 100000, label dec3
Intel® Processor Micro-architecture - Core® microarchitecture
44
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 45. Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode / Micro-Op Fusion
Frequent pairs of micro-operations derived from the same
Macro Instruction can be fused into a single micro-operation
Micro-op fusion effectively widens the pipeline
Intel® Processor Micro-architecture - Core® microarchitecture
45
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 46. Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode / Micro-Fusion (cont.)
u-ops of a Store “movps [EAX+240], xmm0”
sta eax+240
st xmm0, [eax+240]
std xmm0, [eax+240]
Intel® Processor Micro-architecture - Core® microarchitecture
46
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 47. Intel® Software College
Intel® Core™ Microarchitecture – Front End
Branch Prediction Improvements
Intel® Pentium® 4 Processor branch prediction
PLUS the following two improvements:
Indirect Branch Predictor Loop Detector
Branch miss-predictions reduced by >20%
Intel® Processor Micro-architecture - Core® microarchitecture
47
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 48. Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
48
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 49. Intel® Software College
Core® Micro-architecture Execution Core
store
Accepted decoded u-ops, assign resources, address integer
execute and retire u-ops FP
load
• Renamer SIMD
store
data
(3x)
• Reservation station (RS)
register Reservation
• Issue ports
alias table Station
• Execution Unit ALLOC Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture
49
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 50. Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Execution Core Building Blocks
Renamer Ports (number)
RS
0,1,5 0,1,5
SIMD/Integer 0,1,5
SIMD Floating
MUL Integer
ROB Integer Point
Execution Unit
2 Load
3,4 Store
Memory Sub-system
Intel® Processor Micro-architecture - Core® microarchitecture
50
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 51. Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Issue Ports and Execution Units
6 dispatch ports from RS
• 3 execution ports
• (shared for integer / fp / simd)
• load
• store (address)
• store (data)
128-bit SSE implementation
• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
• Port 1 has packed add (3 cycles all precisions)
Intel® Processor Micro-architecture - Core® microarchitecture
51
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 52. Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Retirement Unit
ReOrder Buffer (ROB)
• Holds micro-ops in various stages of completion
• Buffers completed micro-ops
• updates the architectural state in order
• manages ordering of exceptions
register Reservation
alias table Station
ALLOC Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture
52
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 53. Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
53
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 54. Intel® Software College
Core® Micro-architecture Memory Sub-
System
Memory Ordering Buffer
• Store Address Buffer
• Stores the address of each store not actually performed
• Loads compare address to any store older than itself
• If it find a hole…
• Store Data Buffer
• Stores data of each store not actually performed
• If load hit on the SAB, it forward the data from here
• Load Buffer
• Stores address of non-retired loads
• For snoops and re-dispatch
• One 128-bit load and one 128-bit store per cycle to different
memory locations
• Out of order Memory operations
Intel® Processor Micro-architecture - Core® microarchitecture
54
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 55. Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Core® Micro-architecture Memory Sub-
System (cont.)
32k D-Cache (8-way, 64 byte line size)
Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache
Cache to cache transfer
• improves producer / consumer style MP
Wider interface to L2
• reduced interference
• processor line fill is 2 cycles
Core1 Core2
Higher bandwidth from the L2 cache to the core
• ~14 clock latency and 2 clock throughput
Load & Store Access order
Bus
1. L1 cache of immediate core
2. L1 cache of the other core 2 MB L2 Cache
3. L2 cache
4. Memory
Intel® Processor Micro-architecture - Core® microarchitecture
55
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 56. Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Enhanced Data
Pre-fetch Logic
Speculates the next needed data and loads it into cache by HW
and/or SW
Door Valet Parking Area Main Parking Lot
(L1 Cache) (L2 Cache) (External Memory)
Intel® Processor Micro-architecture - Core® microarchitecture
56
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 57. Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Enhanced Data
Pre-fetch Logic (cont.)
• L1D cache prefetching
• Data Cache Unit Prefetcher
• Known as the streaming prefetcher
• Recognizes ascending access patterns in recently loaded data
• Prefetches the next line into the processors cache
• Instruction Based Stride Prefetcher
• Prefetches based upon a load having a regular stride
• Can prefetch forward or backward 2 Kbytes
• 1/2 default page size
• L2 cache prefetching: Data Prefetch Logic (DPL)
• Prefetches data to the 2nd level cache before the DCU requests
the data
• Maintains 2 tables for tracking loads
• Upstream – 16 entries
• Downstream – 4 entries
• Every load is either found in the DPL or generates a new entry
• Upon recognition of the 2nd load of a “stream” the DPL will
prefetch the next load
Intel® Processor Micro-architecture - Core® microarchitecture
57
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 58. Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Memory
Disambiguation
Memory Disambiguation predictor
• Loads that are predicted NOT to forward from preceding store
are allowed to schedule as early as possible
• increasing the performance of OOO memory pipelines
Disambiguated loads checked at retirement
• Extension to existing coherency mechanism
• Invisible to software and system
Intel® Processor Micro-architecture - Core® microarchitecture
58
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 59. Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Memory
Disambiguation Absent
Load4 must WAIT until previous stores complete
Memory
Data W
Store1 Y
Load2 Y
Data Z
Store3 W
Load4 X
Data Y
Data X
Intel® Processor Micro-architecture - Core® microarchitecture
59
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 60. Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Memory
Disambiguation Presented
Loads can decouple from stores
Load4 can get its data WITHOUT waiting for stores
Memory
Data W
Load4 X
Store1 Y
Load2 Y Data Z
Store3 W
Data Y
Data X
Intel® Processor Micro-architecture - Core® microarchitecture
60
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 61. Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Stores
Forwarding
If a load follows a store and reloads the data that the store
writes to memory, the micro-architecture can forward the data
directly from the store to the load
Memory
Store1 Y
Internal
Load2 Y Buffers
Data Y
Intel® Processor Micro-architecture - Core® microarchitecture
61
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 62. Intel® Software College
Advanced Memory Access / Stores
Forwarding: Aligned Store Cases
store 16 store 32 bit store 64 bit
load 16 load 32 bit load 64 bit
ld 8 ld 8 load 16 load 16 load 32 bit load 32 bit
ld 8 ld 8 ld 8 ld 8 load 16 load 16 load 16 load 16
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
store 128 bit
load 128 bit
load 64 bit load 64 bit
load 32 bit load 32 bit load 32 bit load 32 bit
load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16
ld 8 ld 8 ld 8 ld 8 ld 8 Intel® Processorld 8 ld 8 ld -8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
ld 8 ld 8 Micro-architecture Core® microarchitecture
62
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 63. Intel® Software College
Advanced Memory Access / Stores
Forwarding: Unaligned Cases
Note that unaligned store forward does not occur when the load
crosses a cache line boundary
store 16 store 32 bit store 64 bit
load 16‡ load 32 bit‡ load 64 bit
ld 8 ld 8 load 16‡ load 16 load 32 bit‡ load 32 bit
ld 8 ld 8 ld 8 ld 8 load 16‡ load 16 load 16 load 16
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
ld 8 Store forwarded to load
Note: Unaligned 128-bit stores
ld 8 No forwarding are issued as two 64-bit stores.
‡:
This provides two alignments for
No forwarding if the load store forwarding
crosses a cache line boundary
Intel® Processor Micro-architecture - Core® microarchitecture
63
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 64. Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture
64
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 65. Intel® Software College
Optimizing for
Instruction Fetch and PreDecode
Avoid “Length Changing Prefixes” (LCPs)
• Affects instructions with immediate data or offset
• Operand Size Override (66H)
• Address Size Override (67H) [obsolete]
• LCPs change the length decoding algorithm – increasing the
processing time from one cycle to six cycles (or eleven cycles
when the instruction spans a 16-byte boundary)
• The REX (EM64T) prefix (4xH) is not an LCP
• The REX prefix does lengthen the instruction by one byte, so use
of the first eight general registers in EM64T is preferred
Intel® Processor Micro-architecture - Core® microarchitecture
65
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 66. Intel® Software College
Optimizing for
Instruction Queue
Includes a “Loop Stream Detector” (LSD)
• Potentially very high bandwidth instruction streaming
• A number of requirements to make use of the LSD
• Maximum of 18 instructions in up to four 16-byte packets
• No RET instructions (hence, little practical use for CALLs)
• Up to four taken branches allowed
• Most effective at 70+ iterations
• LSD is after PreDecode so there is no added cost for LCPs
• Trade-off LSD with conventional loop unrolling
Intel® Processor Micro-architecture - Core® microarchitecture
66
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- 67. Intel® Software College
Optimizing for
Decode
Decoder issues up to 4 uOps for renaming/ allocation per clock
• This creates a trade off between more complex instruction
uOps versus multiple simple instruction uOps
• For example, a single four uOp instruction is all that can be
renamed/allocated in a single clock
• In some cases, multiple simple instructions may be a better
choice than a single complex instruction
• Single uOp instructions allow more decoder flexibility
• For example, 4-1-1-1 can be decoded in one clock
• However, 2-2-2-1 takes three clocks to decode
Intel® Processor Micro-architecture - Core® microarchitecture
67
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.