4. Processor/Node Architecture
• Intel Xeon E5-2600 processor: Sandy Bridge microarchitecture
• Released: March 2012 Up to 8 cores (16 threads)
up to 3.8 GHz (turbo- boost)
DDR3 1600 Memory at 51 GB/s
64 KB L1 (3 cycles)
256 KB L2 (8 cycles)
20 MB L3
Core-Memory: ring-topology
interconnect
CPU-CPU: QPI interconnect
5. Processor/Node Architecture
• Intel Knights Corner: Many Integrated Cores (MIC)
• Early Version: Nov, 2011
• Over 50 cores
• each core operating at 1.2GHz, 512 -bit
vector processing units, 8MB of cache,
and 4 threads per core.
• 1 TFLOPS
• It can be coupled with up to 2GB of
GDDR5 memory. The chip uses the
Sandy Bridge architecture, and
manufactured using a 22nm process with
3D tri-gate transistors.
14. Heterogeneous Platforms: Tianhe-1A
• 14,336 Intel XeonX5670 processors
and 7,168 Nvidia Tesla M2050 general
purpose GPUs
• Theoretical peak performance of
4.701 PFLOPS
• 2PB Disk and 262 TB RAM
• Arch interconnect links the server
nodes together using optical-
electric cables in a hybrid fat tree
configuration
16. From 10 to 1000 PFLOPS
Several critical issues must be addressed:
• Power (GFLOPS/w)
• Fault Tolerance (MTBF and high component count)
• Node Performance (esp. in view of limited
memory)
• I/O (esp. in view of limited I/O bandwidth)
• Heterogeneity (regarding application composition)
• (and many incoming ones)
18. Architectures Considered
• Evolutionary Strawmen
– “Heavyweight” Strawman based on commodity-derived microprocessors
– “Lightweight” Strawman based on custom microprocessors
• Aggressive Strawmen
– “Clean Sheet of Paper” CMOS Silicon
19. Evolutionary Scaling Assumptions
• Applications will demand same DRAM/Flops ratio as today
• Ignore any changes needed in disk capacity
• Processor die size will remain constant
• Continued reduction in device area => multi-core chips
•Vdd, max power dissipation will flatten as forecast
– Thus clock rates limited as before
• On a per core basis, micro-architecture will improve from 2 flops/cycle to 4
in 2008, and 8 in 2015
• Max # of sockets per board will double roughly every 5 years
• Max # of boards per rank will increase once by 33%
• Max power per rack will double every 3 years
• Allow growth in system configuration by 50 racks each year
20. The Power Models
• Simplistic: A highly optimistic model
– Max power per die grows as per ITRS
– Power for memory grows only linearly with # of chips
• Power per memory chip remains constant
– Power for routers and common logic remains constant
• Regardless of obvious need to increase bandwidth
– True if energy for bit moved/accessed decreases as fast as “flops per
second” increase
• Fully Scaled: A pessimistic model
– Same as Simplistic, except memory & router power grow
with peak flops per chip
– True if energy for bit moved/accessed remains constant
23. Architectures Considered
• Evolutionary Strawmen: NOT FEASIBLE
– “Heavyweight” Strawman based on commodity-derived microprocessors
– “Lightweight” Strawman based on custom microprocessors
• Aggressive Strawmen
– “Clean Sheet of Paper” CMOS Silicon
31. My View (based on DARPA report)
• Power is a major consideration
• Faults and fault tolerance are major issues
• Constraints on power density constrain processor speed –
thus emphasizing concurrency
• Levels of concurrency needed to reach exascale are projected
to be over 109 cores
• For these reasons, evolutionary path to exaflop is unlikely
to succeed in/before 2018, to its best in 2020ish
33. NVIDIA Echelon Project: Extreme-scale Computer
Hierarchies with Efficient Locality-Optimized Nodes
• 64 NoC (Network on Chip), each with
4 SMs, each SM with 8 SM Lanes
• 8 LOC (latency optimized core)
• 2.5GHz
• 10nm
Chip Floorplan
Node and System
Objectives: 16 TFLOP (double precision ) per
chip in 2018 at best
• 100X better application energy efficiency
over today’s CPU systems.
• Improved programmer productivity
• Strong scaling for many applications
• High AMTT
• Machines resilient to attack
35. DOE’s points on Exascale System
• Voltage scaling to reduce power and energy
- Explodes parallelism
- Cost of communication vs. computation—critical balance
• Its not about the FLOPS. Its about data movement.
- Algorithms should be designed to perform more work per
unit data movement.
- Programming systems should further optimize this data
movement.
- Architectures should facilitate this by providing an exposed
hierarchy and efficient communication.
• System software to orchestrate all of the above
Self aware operating system