Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer

Introduction to
National Supercomputer center in Tianjin
TH-1A Supercomputer

Agenda

� National Supercomputer Center in Tianjin( NSCC-TJ)

� TH-1A system
� Hardware sub-system
� Software sub-system

� Applications

NSCC-TJ

� National SuperComputer Center in Tianjin
� Sponsored by
� Chinese Ministry of Science and Technology

� Tianjin Binhai New Area

� Public information infrastructure
� To accelerate the economy, education and industry of
Northern China
� To provide high performance computing service to whole
China
� Open platform for research and education

NSCC-TJ

Main building

office

Computer room
Transformer station &
Total area: 2400m2
air conditioner

NSCC-TJ

The first floor of central computing room: 1200m2

NSCC-TJ

The second floor of central computing room:
Visualization environment, 1200m2

NSCC-TJ

Electric transformer station

NSCC-TJ

Cooling water station

2011-6-28 TH-1 8

NSCC-TJ
� Layout of computing room

TH-1A system
� Enhanced system based on TH-1 system （Sep. 2009）
� Installed in NSCC-TJ, Aug. 2010
� Debugging and performance testing, Sept.~Oct. 2010
Sept.~Oct.
� On service, after Nov. 2010

Items Configuration
Processors 14336 Intel CPUs + 7168 nVIDIA GPUs + 2048FT CPUs
Memory 262TB in total
Interconnect Proprietary high-speed interconnecting network
Storage 2PB
120 Compute / service Cabinets
Cabinets 14 Storage Cabinets
6 Communication Cabinets

TH-1A system

� TH-1A System Architecture
� Hybrid MPP structure: CPU & GPU

� Proprietary compute nodes

� Connected by proprietary high-speed interconnect
network
� Global shared parallel storage system

� Custom software stack

TH-1A hardware sub-system

Service
Service
Compute sub-system
Compute sub-system sub-system
sub-system
CPU CPU CPU CPU CPU
… Operation Operation
diagnosis sub-system
diagnosis sub-system

+ + + + +
node node
GPU GPU GPU GPU GPU
Monitor and
Monitor and

Communication sub-system
Communication sub-system

Storage sub-system
Storage sub-system
MDS …
OSS OSS OSS OSS

Compute sub-system
� 7,168 compute nodes
� 2 six-core CPU and 1 GPU per node
� CPU
�Xeon X5670 ( Westmere )
(Westmere
Westmere)
�Processor speed - 2.93GHz

� GPU
�NVIDIA Tesla M2050

�Connected with CPU by PCI-E

� 32GB memory per node
� 2U height
� Peak performance
�4,701,061Gflops

Service sub-system
� 1,024 service nodes
� 2 eight-core domestic CPUs
� CPU: FT-1000
� SoC

� 1.0GH z
1.0GHz
� Eight-core, eight-thread per
ight-core,
core
� Peak performance 8Gflops

� 32GB memory per node
� For login, compile, and applications
need throughput computing

Proprietary interconnection network
� Interconnection signal speed – 10Gbps
� Bi-directional bandwidth – 160Gbps
� Hierarchy fat-tree structure
� First stage: 16 nodes connected by 16-port switching board
� Second stage: all parts connected to eleven 384-port switches

� High radix router ASIC：NRC
ASIC：
� Feature size ：90nm
� Die size：17.16mm x 17.16mm
size：
� Package ：FC-PBGA
Package：
� 2577 pins
� Throughput of single NRC: 2.56Tbps
� Network interface ASIC：NIC
� Same feature size and package as NRC
� Die size ：10.76mm x 10.76mm
size：
� 675 pins

16-port switch board
in cabinet
Leaf switch blade and
Root switch blade of 384-ports switch

Back plane of 384-ports switch
about 700mm *600mm
700mm*

Proprietary interconnecting network
� Switching board and high-radix switch
� Based on network interface ASIC and router ASIC
� Reduced user communication protocol
� Throughput: 61.44Tbps

Front

two 384-port
high-radix switches

Back

Storage sub-system
� Capacity: 2 PB
� Connected by proprietary interconnection network
� Lustre based parallel file system

Monitor and diagnosis sub-system
� Rich monitor & control functions
� Real-time monitor hardware
parameters
� Precise fault position
� Alarm and immediate action
against emergency
� Self-feedback cool adjust for
environment status
� I2C & JTAG diagnosis
mechanism
� Large scale console
� Remote monitor and
management

Computing cabinet
� Node: 2 CPUs and 1 GPU
� Blade: 2 nodes
� Frame
� 8 computing blades
� 16-port switching board
� 1 monitor and diagnosis board
� Cabinet
� 4 frames, 64 nodes
� Close-coupled chilled water cooling
� 128 CPUs, 64 GPU
� 56KW cooling capacity in a cabinet
� Footprint
� 700m2

TH-1A software sub-system
� Software stack

Operating system

� Kylin Linux
� compute node kernel
� Provide virtual running environment
� Isolated running environments for different users
� Custom software package installation
� QoS support
� Power aware computing

Compiler system
� C, C++, Fortran, Java
� OpenMP, MPI, OpenMP/MPI
OpenMP, OpenMP/MPI
� CUDA, OpenCL
� Heterogeneous programming framework
� Accelerate the large scale, complex applications, especially
for applications in developing status or their full source codes
are not available
� Use the computing power of CPUs and GPUs, hide the GPU
GPUs,
programming to users
� Inter-node homogeneous parallel programming (users)
� Intra-node heterogeneous parallel computing (computer
experts)

Compiler system
� Inter-node homogeneous parallel programming (JASMIN)
� Patch-based objects data structures

� MPI communication, dynamic load balancing support

� Zero-copy optimization in communication library

Compiler system
� Intra-node heterogeneous parallel computing
� Compiler optimized / hand-tuned threaded code

� Optimizations include
� Adaptive partitioning, balance the workloads between CPUs and
GPU
� Asynchronous data transfer / computing, overlap CPU operations
with GPU operations
� Software pipelining, overlap GPU computing with data transfer
between host and GPU device memory
� ……

Compiler system
� An example: 3-D short range molecular simulations
� For each time step
� Split workload (force calculation) between CPU and GPU
� For each patch allocated to GPU
� Start asynchronous operations: transfer the patch data to
GPU, compute the patch, get results from GPU
� For each patch allocated to CPU
� Launch threads on CPU cores to compute the patch

� CPU waits for GPU completion event
� Adjust the split value according to the CPU/GPU performance
(patches per second + empirical )
� Other workload (velocity, position) computed on CPU
� Performance: one NVIDIA M2050 GPU is 3 times faster than
one Intel X5670 CPU

Programming environment
� Virtual running environments
� Provide services on demand
� Parallel toolkits
� Based on Eclipse
� To integrate all kinds of tools
� Editor, debugger, profiler
� Work flow support
� Support QoS negotiate
� Reserve resource for future
requirement

Visualization system
� Application area
� Numerical weather
forecast
� Computational fluid
dynamics
� Oil exploration
� Other large-scale data
� Computing platform
� Tianhe-1A
� Render server
� 128 CPU + 64 GPU
� Display device
� 3x6 multi-channel
display wall

Applications
� Oil exploration
� High-end equipment development
� Bio-medical research
� Animation design
� New energy research
� New material research
� Weather and climate forecasting
� Engineering design, simulation and
analysis
� Remote sensing data processing
� Financial risk analysis

Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer

Semelhante a Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer (20)

Mais de Förderverein Technische Fakultät

Mais de Förderverein Technische Fakultät (20)

Último

Último (20)

Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer