2. Agenda
� National Supercomputer Center in Tianjin( NSCC-TJ)
� TH-1A system
� Hardware sub-system
� Software sub-system
� Applications
3. NSCC-TJ
� National SuperComputer Center in Tianjin
� Sponsored by
� Chinese Ministry of Science and Technology
� Tianjin Binhai New Area
� Public information infrastructure
� To accelerate the economy, education and industry of
Northern China
� To provide high performance computing service to whole
China
� Open platform for research and education
4. NSCC-TJ
Main building
office
Computer room
Transformer station &
Total area: 2400m2
air conditioner
5. NSCC-TJ
The first floor of central computing room: 1200m2
6. NSCC-TJ
The second floor of central computing room:
Visualization environment, 1200m2
11. TH-1A system
� Enhanced system based on TH-1 system (Sep. 2009)
� Installed in NSCC-TJ, Aug. 2010
� Debugging and performance testing, Sept.~Oct. 2010
Sept.~Oct.
� On service, after Nov. 2010
Items Configuration
Processors 14336 Intel CPUs + 7168 nVIDIA GPUs + 2048FT CPUs
Memory 262TB in total
Interconnect Proprietary high-speed interconnecting network
Storage 2PB
120 Compute / service Cabinets
Cabinets 14 Storage Cabinets
6 Communication Cabinets
12. TH-1A system
� TH-1A System Architecture
� Hybrid MPP structure: CPU & GPU
� Proprietary compute nodes
� Connected by proprietary high-speed interconnect
network
� Global shared parallel storage system
� Custom software stack
13. TH-1A hardware sub-system
Service
Service
Compute sub-system
Compute sub-system sub-system
sub-system
CPU CPU CPU CPU CPU
… Operation Operation
diagnosis sub-system
diagnosis sub-system
+ + + + +
node node
GPU GPU GPU GPU GPU
Monitor and
Monitor and
Communication sub-system
Communication sub-system
Storage sub-system
Storage sub-system
MDS …
OSS OSS OSS OSS
14. Compute sub-system
� 7,168 compute nodes
� 2 six-core CPU and 1 GPU per node
� CPU
�Xeon X5670 ( Westmere )
(Westmere
Westmere)
�Processor speed - 2.93GHz
� GPU
�NVIDIA Tesla M2050
�Connected with CPU by PCI-E
� 32GB memory per node
� 2U height
� Peak performance
�4,701,061Gflops
15. Service sub-system
� 1,024 service nodes
� 2 eight-core domestic CPUs
� CPU: FT-1000
� SoC
� 1.0GH z
1.0GHz
� Eight-core, eight-thread per
ight-core,
core
� Peak performance 8Gflops
� 32GB memory per node
� For login, compile, and applications
need throughput computing
16. Proprietary interconnection network
� Interconnection signal speed – 10Gbps
� Bi-directional bandwidth – 160Gbps
� Hierarchy fat-tree structure
� First stage: 16 nodes connected by 16-port switching board
� Second stage: all parts connected to eleven 384-port switches
17. Proprietary interconnection network
� High radix router ASIC:NRC
ASIC:
� Feature size :90nm
� Die size:17.16mm x 17.16mm
size:
� Package :FC-PBGA
Package:
� 2577 pins
� Throughput of single NRC: 2.56Tbps
� Network interface ASIC:NIC
� Same feature size and package as NRC
� Die size :10.76mm x 10.76mm
size:
� 675 pins
18. Proprietary interconnection network
16-port switch board
in cabinet
Leaf switch blade and
Root switch blade of 384-ports switch
Back plane of 384-ports switch
about 700mm *600mm
700mm*
19. Proprietary interconnecting network
� Switching board and high-radix switch
� Based on network interface ASIC and router ASIC
� Reduced user communication protocol
� Throughput: 61.44Tbps
Front
two 384-port
high-radix switches
Back
20. Storage sub-system
� Capacity: 2 PB
� Connected by proprietary interconnection network
� Lustre based parallel file system
21. Monitor and diagnosis sub-system
� Rich monitor & control functions
� Real-time monitor hardware
parameters
� Precise fault position
� Alarm and immediate action
against emergency
� Self-feedback cool adjust for
environment status
� I2C & JTAG diagnosis
mechanism
� Large scale console
� Remote monitor and
management
24. Operating system
� Kylin Linux
� compute node kernel
� Provide virtual running environment
� Isolated running environments for different users
� Custom software package installation
� QoS support
� Power aware computing
25. Compiler system
� C, C++, Fortran, Java
� OpenMP, MPI, OpenMP/MPI
OpenMP, OpenMP/MPI
� CUDA, OpenCL
� Heterogeneous programming framework
� Accelerate the large scale, complex applications, especially
for applications in developing status or their full source codes
are not available
� Use the computing power of CPUs and GPUs, hide the GPU
GPUs,
programming to users
� Inter-node homogeneous parallel programming (users)
� Intra-node heterogeneous parallel computing (computer
experts)
26. Compiler system
� Heterogeneous programming framework
� Inter-node homogeneous parallel programming (JASMIN)
� Patch-based objects data structures
� MPI communication, dynamic load balancing support
� Zero-copy optimization in communication library
27. Compiler system
� Heterogeneous programming framework
� Intra-node heterogeneous parallel computing
� Compiler optimized / hand-tuned threaded code
� Optimizations include
� Adaptive partitioning, balance the workloads between CPUs and
GPU
� Asynchronous data transfer / computing, overlap CPU operations
with GPU operations
� Software pipelining, overlap GPU computing with data transfer
between host and GPU device memory
� ……
28. Compiler system
� Heterogeneous programming framework
� An example: 3-D short range molecular simulations
� For each time step
� Split workload (force calculation) between CPU and GPU
� For each patch allocated to GPU
� Start asynchronous operations: transfer the patch data to
GPU, compute the patch, get results from GPU
� For each patch allocated to CPU
� Launch threads on CPU cores to compute the patch
� CPU waits for GPU completion event
� Adjust the split value according to the CPU/GPU performance
(patches per second + empirical )
� Other workload (velocity, position) computed on CPU
� Performance: one NVIDIA M2050 GPU is 3 times faster than
one Intel X5670 CPU
29. Programming environment
� Virtual running environments
� Provide services on demand
� Parallel toolkits
� Based on Eclipse
� To integrate all kinds of tools
� Editor, debugger, profiler
� Work flow support
� Support QoS negotiate
� Reserve resource for future
requirement
30. Visualization system
� Application area
� Numerical weather
forecast
� Computational fluid
dynamics
� Oil exploration
� Other large-scale data
� Computing platform
� Tianhe-1A
� Render server
� 128 CPU + 64 GPU
� Display device
� 3x6 multi-channel
display wall
31. Applications
� Oil exploration
� High-end equipment development
� Bio-medical research
� Animation design
� New energy research
� New material research
� Weather and climate forecasting
� Engineering design, simulation and
analysis
� Remote sensing data processing
� Financial risk analysis