4. Moore’s Law – GHz to Multi-Core
Performance Through
Multi-Core
Performance
“Concurrency is the next
major revolution in how we Intel MC Assistance
write software”
•Threading
-Dr Dobb’s Journal, •Multi-tasking
Herb Sutter
•Training
March 2005
•Tools
Performance Through
frequency
2006
- +
4 Document# 408075
Intel Confidential
5. Multi-core is Mainstream
Is Your Software Ready?
Multiple execution cores ramping
across Intel platforms
5 Document# 408075
Intel Confidential
7. Simultaneous Multi-Threading (SMT)
w/o SMT SMT
• SMT
– Run 2 threads at the same time per core
• Take advantage of 4-wide execution
engine
Time (proc.
– Keep it fed with multiple threads
cycles)
– Hide latency of a single thread
• Most power efficient performance
feature
– Very low die area cost
– Can provide significant performance
benefit depending on application
– Much more efficient than adding an
entire core Note: Each
box
• Nehalem/Westmere advantages represents a
processor
– Larger caches execution unit
– Massive memory BW
Simultaneous multi-threading enhances
7 performance and energy efficiency
Intel Confidential
Document# 408075
8. Enhanced Cache Subsystem
• 3-level cache hierarchy 32KB FLC 32KB FLC
– First Level Cache (FLC) Instruction Instruction
– 32 KB Instruction & 32 KB Data
per core 32KB FLC 32KB FLC
– Equivalent to L1 Cache in Intel® Data Data
Core™ microarchitecture
– Mid Level Cache (MLC) 256KB 256KB
– 256 KB per core MLC MLC
– Last Level Cache
Core 0 Core 1
– Up-to 4MB shared across both
core
– Inclusive cache policy – minimize ≤ 4MB Last Level Cache
snoop traffic
– Equivalent to L2 Cache in Intel® Processor Cache Subsystem
Core™2 Duo microarchitecture
8 Document# 408075
Intel Confidential
9. All New 2010 Intel® Core™ Performance-Based
Technology Overview
Core 2010 Features
CPU Thread Intel® Turbo Boost
Intel®
and Hyper-
Hyper-
Intel® Hyper-Threading Technology
CPU Thread Threading
CPU Thread
Technologies • Smart multitasking by doubling the number of
GFX Core processor threads per core with Intel® Hyper-
CPU Thread
Threading Technology
Intel® Turbo Boost Technology1
Intelligently and seamlessly delivers
CPU Core Intel HD Graphics
improved CPU performance to match your
with Dynamic workload when thermal and power headroom
CPU Core Frequency
Mobile Only
exist
GFX Core
Intel® HD Graphics with Dynamic Frequency
Available on
Mobile only Delivers graphics performance boost to
graphics intensive applications provided
thermal and power headroom exist
New Intel® Core processors with Intel® Turbo Boost Technology and Dynamic
Frequency to maximize performance of CPU and graphics intensive tasks
Note1: See Intel® Turbo Boost Technology disclaimer in the back-up
9 Document# 408075
Intel Confidential
10. Intel® Turbo Boost Technology
Previous
Current Platform
Generation
+Multiple Dynamically trade TDP budget
Speed Bins Scenario 1 Scenario 2
+Multiple CPU Intensive Load GFX Intensive Load
Speed Bins
+1 Speed
Bin
GFX
Turbo
C3 State C3 State
or lower or lower
Core 1 Core 2 Core 1 Core 2 CPU GFX CPU GFX
Core 1 Core 2
Single core Single Core Dual Core Intel® Intelligent Power sharing
CPU Turbo CPU Turbo CPU Turbo Note: CPU and GFX can turbo simultaneously
Strategy: Maximize CPU and GFX performance while
staying within the processor TDP and Tjmax
Note: Some features may be available only on certain SKU’s
10 Document# 408075
Intel Confidential
11. Intel® Turbo Boost Technology
Processor w/Turbo
Processor w/out Turbo
Intel® Turbo Boost Technology is targeted to deliver
additional performance gains on Platform
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or
components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect
actual performance.
Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system
hardware or software design or configuration may affect actual performance.
11 Document# 408075
Intel Confidential
12. Intel® Advanced Digital Media Boost
Single Cycle SSE
In Each Core SSE Operation
(SSE/SSE2/SSE3)
(SSE/SSE2/SSE3)
Single SOURCE 127 0
Cycle X4 X3 X2 X1
SSE SSE/2/3 OP
Y4 Y3 Y2 Y1
DECODE DECODE DEST
Previous CLOCK
X2opY2 X1opY1
CYCLE 1
EXECUTE EXECUTE CLOCK
X4opY4 X3opY3
CYCLE 2
Intel® Core™ Microarchitecture
CLOCK
X4opY4 X3opY3 X2opY2 X1opY1
CYCLE 1
128 bit Single Cycle in each core
12 Document# 408075
Intel Confidential
13. Single Instruction Multiple Data
(SIMD)
• Anything that fits into 16 byte…
• and all conversions!
4x floats
2x doubles
16x bytes
8x words
4x dwords
2x qwords
1x dqword
13 Document# 408075
Intel Confidential
14. Intel® Advanced Vector Extension (Intel® AVX)
• Features:
– New 256-bit Instruction Set Architecture (ISA)
– Built on legacy 128-bit SIMD (SSEx) and 64-bit
SIMD (MMX) ISA extensions
– Enhancements to 128-bit SIMD instructions
– Support for 3 and 4 -operand syntax
• Benefits:
Expected Intel® AVX benefits:
- Image, video and audio processing
- CNC* & PLC compute performance
- High performance Digital Signal & Image Processing
(DSIP) within small Size, Weight and total Power
(SW&P)
• Targeted segments:
-Military/Aerospace/Government
- Medical Imaging
- Comms, Industrial Controllers & Digital Signage
Source: http://software.intel.com/en-us/avx/
Performance Improvements for Floating Point Intensive
Applications
14 Document# 408075
Intel Confidential
16. Simplified Threaded
Development with Intel® Tools
Architectural Analysis Introduce Threads Confidence/Correctness Optimize / Tune
Analyzers Compilers Checkers Analyzers
• Find the code that • Built-in optimization • Find deadlocks and • Tune for
can benefit from • OpenMP race conditions performance
threading Libraries and scalability
• Find hotspots that • Multimedia & data processing • Visualize efficiency
limit performance • Math Processing of threaded code
• Threading
16 Document# 408075
Intel Confidential
17. Intel® Integrated Performance Primitives
(Intel® IPP) — Overview and Benefits
Application Source Code
Intel IPP Usage Code Samples Rapid
Free Code •
•
Sample video/audio/speech codecs
Image processing and JPEG Application
Samples •
•
Signal processing
Data compression Development
• .NET and Java integration
API calls
Intel IPP Library C/C++ API
Cross-platform •
•
Cryptography
Image processing
•
•
Data Compression
Data Integrity
Compatibility
API •
•
Image color conversion
JPEG / JPEG2000
•
•
Signal processing
Matrix mathematics
and
•
•
Computer Vision
Video coding
•
•
Vector mathematics
String processing
Code Re-Use
• Audio coding • Speech coding
• Speech recognition
Static/Dynamic Link
Intel IPP Processor-Optimized Binaries
Intel® Atom™ Processors
Processor- •
• Intel® Core™ i7 Processors Outstanding
Optimized •
•
Intel® Core™ 2 Duo and Core™ Extreme Processors
Intel® Core™ Duo and Core™ Solo Processors Performance
Implementation •
•
Intel® Pentium® D Dual-Core Processors
Intel® Xeon® 64-bit Dual-Core Processors
• Intel® Pentium® M and Pentium® 4 Processors
• Intel® Itanium® 64-bit Processor Family
• Intel® Xeon® DP and MP Processors
17 Document# 408075
Intel Confidential
18. Intel® IPP Function Library
• Over 11,000 functions in 15 domains
• Threaded application support
– all functions are fully thread-safe
– many functions internally threaded
• Multiple data type support
– Fixed and floating point data type support
– 8, 16, 32 and 64-bit
• Supports both static and dynamic linking
– Maximize performance while balancing application size
18 Document# 408075
Intel Confidential
19. Intel® Integrated Performance Primitives
(IPP)
Intel IPP vs. C on single processor
• 200% faster (average over all domains)
• Optimized C performance normalized to 1
System configuration: Intel® Xeon® 4 Processor, 2.8GHz, 2GB
using Windows* XP
19 Document# 408075
Intel Confidential
22. Intel® IPP Code Samples:
Multithreaded H.264 Video Decode
Measured using a Dell* Inspiron* 9400 PC with an Intel® Core™ Duo Processor 2.2GHz, 512MB RAM using Microsoft Windows* XP SP2. Codec samples compiled using
Intel® C++ Compiler 9.1 using compilation options $(ICL_OMPLIB_OPT) /Qwd9,171,188,593,810,981,1125,1418 -D_OMP_KARABAS -D_OPENMP -Qopenmp
22 Document# 408075
Intel Confidential
23. Intel® Threading Building Blocks
Extend C++ for parallelism
Highlights
• A C++ runtime library that does thread management, letting
developers focus on proven parallel patterns
• Appropriately scales to the number of HW threads available
• Supports nested parallelism
• The thread library API is portable across Linux, Windows,
and Mac OS* platforms. Open Source community extended
support to FreeBSD*, IA Solaris* and XBox* 360
• Run-time library provides optimal size thread pool, task
granularity and performance oriented scheduling
• Automatic load balancing through task stealing
• Cache efficiency and memory reuse
• Committed to:
• compiler independence
• processor independence
• OS independence
Both GPL and commercial licenses are available.
http://threadingbuildingblocks.org
*Other names and brands may be claimed as the property of others
23 Document# 408075
Intel Confidential
24. Check Intel® TBB online
www.threadingbuildingblocks.org
Active user forums, FAQs, technical
blogs, latest documentation
Open Source Package License information.
Several very important contributions
were made by the OS community
allowing TBB 2.1 to build and work on:
XBox* 360, Sun Solaris*, AIX*
TBB news column and introductory videos
*Other names and brands may be claimed as the property of others
24 Document# 408075
Intel Confidential
26. Data Race
• Suppose a=1, b=2
Thread1 Thread2
x=a+b b = 42
What is value of x if:
– Thread1 runs before Thread2? x = 3
– Thread2 runs before Thread1? x = 43
Execution order is
not guaranteed
26 Document# 408075
Intel Confidential
31. Load Imbalance
• Unequal work loads lead to idle threads and wasted
time
Thread
0
Busy
Thread
1
Idle
Thread
2
Thread
3
Start
thread
Time Join
thread
s s
31 Document# 408075
Intel Confidential
32. Synchronization
• By definition, synchronization serializes execution
• Lock contention means more idle time for threads
Thread
0
Thread Busy
1
Idle
Thread
2
In Critical
Thread
3
Time
32 Document# 408075
Intel Confidential
33. Real example : Before fix (thread
profiler)
Switching
Serial Overhead Paralle
l
33 Document# 408075
Intel Confidential
34. Real example: After fix
2 X Speed Up
Serial
Parallel
34 34 Intel Confidential
Document# 408075
35. Summary
• If the hardware doesn’t win outright (unlikely) Then
it is the SW’s fault
– And we can fix the SW
• Parallelization is an imperative
• Intel offers a set of tools, world-wide experience
and online support.
• Questions to be asked:
– Have we enabled SMT?
– Have we investigated the capabilities of SSE?
– Did we license Intel SW tools? (IPP/TBB/Thread Checker…)
– Where can I find Intel acronym dictionary????
35 Document# 408075
Intel Confidential