SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
PROMETHEUS
ENERGY EFFICIENT
SUPERCOMPUTING
Marek Magryś
ACC Cyfronet AGH-UST
established in 1973
part of AGH University of Science and Technology
in Krakow, Poland
member of PIONIER consortium
operator of Krakow MAN
centre of competence in HPC and Grid Computing
provides free computing resources for scientific
institutions
home for supercomputers
Prometheus 3
Liquid Cooling
Water: up to 1000x more efficient heat exchange
than air
Less energy needed to move the coolant
Hardware (CPUs, DIMMs) can handle ~80C
CPU/GPU vendors show TDPs up to 300W
Challenge: cool 100% of HW with liquid
network switches
PSUs
MTBF
The less movement the better
less pumps
less fans
less HDDs
Example
pump MTBF: 50 000 hrs
fan MTBF: 50 000 hrs
1800 node system MTBF: 7 hrs
Prometheus
HP Apollo 8000
16 m2, 20 racks (4 CDU, 16 compute)
2.4 PFLOPS
PUE <1.05, 800 kW peak power
2232 nodes:
2160 CPU nodes: 2x Intel Haswell E5-2680v3, 128 GB
RAM, IB FDR 56 Gb/s, Ethernet 1 Gb/s
72 GPU nodes: +2 NVIDIA Tesla K40d
53568 cores, up to 13824 per island
279 TB DDR4 RAM
CentOS 7 + SLURM
Prometheus storage
Diskless compute nodes
Separate project for storage
DDN SFA12kx hardware
Lustre-based
2 file systems:
Scratch: 120 GB/s, 5 PB usable space
Archive: 60 GB/s, 5 PB usable space
HSM-ready
NFS for $HOME and software
Why Apollo 8000?
Most energy efficient
The only solution with
100% warm water cooling
Highest density
Lowest TCO
Even more Apollo
Focuses also on ‘1’ in PUE!
Power distribution
Less fans
Detailed monitoring
‘energy to solution’
Dry node maintenance
Less cables
Prefabricated piping
Simplified management
Deployment timeline
Day 0 - Contract signed (20.10.2014)
Day 23 - Installation of the primary loop starts
Day 35 - First delivery (service island)
Day 56 - Apollo piping arrives
Day 98 - 1st and 2nd island delivered
Day 101 - 3rd island delivered
Day 111 - basic acceptance ends
Official launch event on 27.04.2015
Facility preparation
Primary loop installation took 5 weeks
Secondary (prefabricated) just 1 week
Upgrade of the raised floor done „just in case”
Additional pipes for leakage/condensation drain
Water dam with emergency drain
Lot of space needed for the hardware deliveries (over
100 pallets)
Facility monitoring
Secondary loop
Prometheus - node 14
HP XL730f/XL750f Gen9
• 2x Intel Xeon E5-2680 v3 (Haswell)
• 24 cores, 2100-3300 MHz
• 30 MB cache, 128 GB RAM DDR4
• Mellanox Connect-X3 IB FDR 56Gb/s
Prometheus - rack 15
HP Apollo 8000:
• 8 cells – 9 trays each – 18 CPU or 9 GPU nodes
• 8 IB FDR 36p 56 Gb/s switches
• Dry-disconnect and HEX water cooling
• HVDC 480V
HP Apollo 8000 CDU:
• Heat exchanger
• Vacuum pump
• Cooling controller
• IB FDR 36p 56 Gb/s (18+3) dist+core switches
Prometheus – compute island
Prometheus – IB network
service
nodes
Service island
I/O nodes
IB core network
576 CPU nodes
Compute island
576 CPU nodes
Compute island
576 CPU nodes
Compute island
432 CPU nodes
72 GPU nodes
Compute island
Over 250 Tb/s
aggregate throughput
• 30 km of cables
• 217 switches
• >10 000 ports
Monitoring 18
SLURM node states
IB network traffic
Monitoring of:
• CPU frequencies and temperatures
• Memory usage
• NFS and Lustre bandwith/IOPS/MDOPS
• Power and cooling
Linpack power draw
Linpack water temperatures
Top500 and Green500
3-rd level submision (Nov 2015)
#72, 2068 Mflops/W
#1 petascale x86 system in Europe
submission after expansion
(Nov 2015)
#38, 1,67 PFLOPS Rmax
GPUs not used for the run
Application & software
Academic workload
Lots of small/medium jobs
Few big jobs
330 projects
750 users
Main fields:
Chemistry
Biochemistry (farmaceuticals)
Astrophysics
22
thatmpi code
Institute of Nuclear Physics PAS in Krakow
Study of non-relativistic shock waves hosted by
supernova remnants that are believed to generate
most of the Galactic cosmic-rays
Thatmpi: Two-and-a-Half-Dimensional Astroparticle
Stanford code with MPI (PIC)
2.5-dimensional particle dynamics
fully relativistic with electro-magnetism
colliding plasma jets with perpendicular B-field
large simulations: up to 10k cores/run
Applications
thatmpi: left leptons jet animation
A. Dorobisz, M. Kotwica
thatmpi: joint development
low-level:
vectorization
flow control refactoring
register and cache-use optimization
high-level:
modernization from FORTRAN77 to Fortran2013
new particle sorting method
portable data dumping with HDF5
communication buffering
total time reduction: over 20%
>1.2 MWh less energy per run (350 nodes, 60h)
Future: 3D, domain partitioning, code refactoring
Lessons learned
There will be leaks!
Drycooler seems more simple than a chiller, but the
whole infrastructure is not
Sysadmins need to get a degree in plumbing
Traditional facilities people don’t understand HPC
SCADA systems are dumb and insecure
Monitor everyting, anytime, keep historic data
Keep data easy to correlate
Avoid SPOFs
Never settle for anything less than full load testing
Know your costs, calclulate TCO
Look at hardware, middleware and software
26
Thank you!

Mais conteúdo relacionado

Mais procurados

Provisioning Updates - Juno Edition
Provisioning Updates - Juno EditionProvisioning Updates - Juno Edition
Provisioning Updates - Juno EditionOpenStack Foundation
 
Towards Exascale Simulations of Stellar Explosions with FLASH
Towards Exascale  Simulations of Stellar  Explosions with FLASHTowards Exascale  Simulations of Stellar  Explosions with FLASH
Towards Exascale Simulations of Stellar Explosions with FLASHGanesan Narayanasamy
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty
 
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Tulipp. Eu
 
Simulating connected capacitor bank
Simulating connected capacitor bankSimulating connected capacitor bank
Simulating connected capacitor bankFebrian Sasi Kirana
 
Gas Turbine Nuclear Power Plants
Gas Turbine Nuclear Power PlantsGas Turbine Nuclear Power Plants
Gas Turbine Nuclear Power PlantsAdam Doligalski
 
David Lovelace - Analysing, displaying and sharing historic landscapes from f...
David Lovelace - Analysing, displaying and sharing historic landscapes from f...David Lovelace - Analysing, displaying and sharing historic landscapes from f...
David Lovelace - Analysing, displaying and sharing historic landscapes from f...Shaun Lewis
 
Push Technology's latest data distribution benchmark with Solarflare and Zing
Push Technology's latest data distribution benchmark with Solarflare and ZingPush Technology's latest data distribution benchmark with Solarflare and Zing
Push Technology's latest data distribution benchmark with Solarflare and ZingAzul Systems Inc.
 

Mais procurados (9)

Provisioning Updates - Juno Edition
Provisioning Updates - Juno EditionProvisioning Updates - Juno Edition
Provisioning Updates - Juno Edition
 
db2
db2db2
db2
 
Towards Exascale Simulations of Stellar Explosions with FLASH
Towards Exascale  Simulations of Stellar  Explosions with FLASHTowards Exascale  Simulations of Stellar  Explosions with FLASH
Towards Exascale Simulations of Stellar Explosions with FLASH
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
 
Simulating connected capacitor bank
Simulating connected capacitor bankSimulating connected capacitor bank
Simulating connected capacitor bank
 
Gas Turbine Nuclear Power Plants
Gas Turbine Nuclear Power PlantsGas Turbine Nuclear Power Plants
Gas Turbine Nuclear Power Plants
 
David Lovelace - Analysing, displaying and sharing historic landscapes from f...
David Lovelace - Analysing, displaying and sharing historic landscapes from f...David Lovelace - Analysing, displaying and sharing historic landscapes from f...
David Lovelace - Analysing, displaying and sharing historic landscapes from f...
 
Push Technology's latest data distribution benchmark with Solarflare and Zing
Push Technology's latest data distribution benchmark with Solarflare and ZingPush Technology's latest data distribution benchmark with Solarflare and Zing
Push Technology's latest data distribution benchmark with Solarflare and Zing
 

Destaque

Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...Benoit Combemale
 
Latency tracing in distributed Java applications
Latency tracing in distributed Java applicationsLatency tracing in distributed Java applications
Latency tracing in distributed Java applicationsConstantine Slisenka
 
Libnetwork updates
Libnetwork updatesLibnetwork updates
Libnetwork updatesMoby Project
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY
 
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY
 
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. AvailabilityHPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. AvailabilityHPC DAY
 
Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Dmitry Alexandrov
 
Database Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best PracticesDatabase Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best PracticesMariaDB plc
 
LinuxKit and OpenOverlay
LinuxKit and OpenOverlayLinuxKit and OpenOverlay
LinuxKit and OpenOverlayMoby Project
 
GPU databases - How to use them and what the future holds
GPU databases - How to use them and what the future holdsGPU databases - How to use them and what the future holds
GPU databases - How to use them and what the future holdsArnon Shimoni
 
Design patterns in Java - Monitis 2017
Design patterns in Java - Monitis 2017Design patterns in Java - Monitis 2017
Design patterns in Java - Monitis 2017Arsen Gasparyan
 
Getting Started with Embedded Python: MicroPython and CircuitPython
Getting Started with Embedded Python: MicroPython and CircuitPythonGetting Started with Embedded Python: MicroPython and CircuitPython
Getting Started with Embedded Python: MicroPython and CircuitPythonAyan Pahwa
 
세션1. block chain as a platform
세션1. block chain as a platform세션1. block chain as a platform
세션1. block chain as a platformJay JH Park
 
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...ScyllaDB
 

Destaque (20)

Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
 
Raspberry home server
Raspberry home serverRaspberry home server
Raspberry home server
 
Latency tracing in distributed Java applications
Latency tracing in distributed Java applicationsLatency tracing in distributed Java applications
Latency tracing in distributed Java applications
 
Libnetwork updates
Libnetwork updatesLibnetwork updates
Libnetwork updates
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
 
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. AvailabilityHPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
 
Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Java on the GPU: Where are we now?
Java on the GPU: Where are we now?
 
Database Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best PracticesDatabase Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best Practices
 
LinuxKit and OpenOverlay
LinuxKit and OpenOverlayLinuxKit and OpenOverlay
LinuxKit and OpenOverlay
 
GPU databases - How to use them and what the future holds
GPU databases - How to use them and what the future holdsGPU databases - How to use them and what the future holds
GPU databases - How to use them and what the future holds
 
Design patterns in Java - Monitis 2017
Design patterns in Java - Monitis 2017Design patterns in Java - Monitis 2017
Design patterns in Java - Monitis 2017
 
Getting Started with Embedded Python: MicroPython and CircuitPython
Getting Started with Embedded Python: MicroPython and CircuitPythonGetting Started with Embedded Python: MicroPython and CircuitPython
Getting Started with Embedded Python: MicroPython and CircuitPython
 
An Introduction to OMNeT++ 5.1
An Introduction to OMNeT++ 5.1An Introduction to OMNeT++ 5.1
An Introduction to OMNeT++ 5.1
 
Drive into calico architecture
Drive into calico architectureDrive into calico architecture
Drive into calico architecture
 
Vertx
VertxVertx
Vertx
 
세션1. block chain as a platform
세션1. block chain as a platform세션1. block chain as a platform
세션1. block chain as a platform
 
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
 

Semelhante a HPC DAY 2017 | Prometheus - energy efficient supercomputing

Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIRyousei Takano
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
IBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM Research
 
Nikravesh big datafeb2013bt
Nikravesh big datafeb2013btNikravesh big datafeb2013bt
Nikravesh big datafeb2013btMasoud Nikravesh
 
Interstellar explorerjun01
Interstellar explorerjun01Interstellar explorerjun01
Interstellar explorerjun01Clifford Stone
 
Keith_Coulson_poster
Keith_Coulson_posterKeith_Coulson_poster
Keith_Coulson_posterKeith Coulson
 
Data Center Lessons Learned
Data Center Lessons LearnedData Center Lessons Learned
Data Center Lessons LearnedTom Greenbaum
 
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.lccausp
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
European Processor Initiative & RISC-V
European Processor Initiative & RISC-VEuropean Processor Initiative & RISC-V
European Processor Initiative & RISC-Vinside-BigData.com
 
European Processor Initiative & RISC-V
European Processor Initiative & RISC-VEuropean Processor Initiative & RISC-V
European Processor Initiative & RISC-Vinside-BigData.com
 
RT15 Berkeley | OPAL-RT Solutions for Microgrid Applications
RT15 Berkeley | OPAL-RT Solutions for Microgrid ApplicationsRT15 Berkeley | OPAL-RT Solutions for Microgrid Applications
RT15 Berkeley | OPAL-RT Solutions for Microgrid ApplicationsOPAL-RT TECHNOLOGIES
 
NatSat INSPIRE Presenation
NatSat INSPIRE Presenation NatSat INSPIRE Presenation
NatSat INSPIRE Presenation Linda Schmidt
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
 
Hpc, grid and cloud computing - the past, present, and future challenge
Hpc, grid and cloud computing - the past, present, and future challengeHpc, grid and cloud computing - the past, present, and future challenge
Hpc, grid and cloud computing - the past, present, and future challengeJason Shih
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraLarry Smarr
 
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...OpenNebula Project
 

Semelhante a HPC DAY 2017 | Prometheus - energy efficient supercomputing (20)

Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Super Computers
Super ComputersSuper Computers
Super Computers
 
IBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOME
 
supercomputer
supercomputersupercomputer
supercomputer
 
Nikravesh big datafeb2013bt
Nikravesh big datafeb2013btNikravesh big datafeb2013bt
Nikravesh big datafeb2013bt
 
Interstellar explorerjun01
Interstellar explorerjun01Interstellar explorerjun01
Interstellar explorerjun01
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
Keith_Coulson_poster
Keith_Coulson_posterKeith_Coulson_poster
Keith_Coulson_poster
 
Data Center Lessons Learned
Data Center Lessons LearnedData Center Lessons Learned
Data Center Lessons Learned
 
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
European Processor Initiative & RISC-V
European Processor Initiative & RISC-VEuropean Processor Initiative & RISC-V
European Processor Initiative & RISC-V
 
European Processor Initiative & RISC-V
European Processor Initiative & RISC-VEuropean Processor Initiative & RISC-V
European Processor Initiative & RISC-V
 
RT15 Berkeley | OPAL-RT Solutions for Microgrid Applications
RT15 Berkeley | OPAL-RT Solutions for Microgrid ApplicationsRT15 Berkeley | OPAL-RT Solutions for Microgrid Applications
RT15 Berkeley | OPAL-RT Solutions for Microgrid Applications
 
NatSat INSPIRE Presenation
NatSat INSPIRE Presenation NatSat INSPIRE Presenation
NatSat INSPIRE Presenation
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
Hpc, grid and cloud computing - the past, present, and future challenge
Hpc, grid and cloud computing - the past, present, and future challengeHpc, grid and cloud computing - the past, present, and future challenge
Hpc, grid and cloud computing - the past, present, and future challenge
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...OpenNebulaconf2017US:  Rapid scaling of research computing to over 70,000 cor...
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

HPC DAY 2017 | Prometheus - energy efficient supercomputing

  • 2. ACC Cyfronet AGH-UST established in 1973 part of AGH University of Science and Technology in Krakow, Poland member of PIONIER consortium operator of Krakow MAN centre of competence in HPC and Grid Computing provides free computing resources for scientific institutions home for supercomputers
  • 4. Liquid Cooling Water: up to 1000x more efficient heat exchange than air Less energy needed to move the coolant Hardware (CPUs, DIMMs) can handle ~80C CPU/GPU vendors show TDPs up to 300W Challenge: cool 100% of HW with liquid network switches PSUs
  • 5. MTBF The less movement the better less pumps less fans less HDDs Example pump MTBF: 50 000 hrs fan MTBF: 50 000 hrs 1800 node system MTBF: 7 hrs
  • 6. Prometheus HP Apollo 8000 16 m2, 20 racks (4 CDU, 16 compute) 2.4 PFLOPS PUE <1.05, 800 kW peak power 2232 nodes: 2160 CPU nodes: 2x Intel Haswell E5-2680v3, 128 GB RAM, IB FDR 56 Gb/s, Ethernet 1 Gb/s 72 GPU nodes: +2 NVIDIA Tesla K40d 53568 cores, up to 13824 per island 279 TB DDR4 RAM CentOS 7 + SLURM
  • 7. Prometheus storage Diskless compute nodes Separate project for storage DDN SFA12kx hardware Lustre-based 2 file systems: Scratch: 120 GB/s, 5 PB usable space Archive: 60 GB/s, 5 PB usable space HSM-ready NFS for $HOME and software
  • 8. Why Apollo 8000? Most energy efficient The only solution with 100% warm water cooling Highest density Lowest TCO
  • 9. Even more Apollo Focuses also on ‘1’ in PUE! Power distribution Less fans Detailed monitoring ‘energy to solution’ Dry node maintenance Less cables Prefabricated piping Simplified management
  • 10. Deployment timeline Day 0 - Contract signed (20.10.2014) Day 23 - Installation of the primary loop starts Day 35 - First delivery (service island) Day 56 - Apollo piping arrives Day 98 - 1st and 2nd island delivered Day 101 - 3rd island delivered Day 111 - basic acceptance ends Official launch event on 27.04.2015
  • 11. Facility preparation Primary loop installation took 5 weeks Secondary (prefabricated) just 1 week Upgrade of the raised floor done „just in case” Additional pipes for leakage/condensation drain Water dam with emergency drain Lot of space needed for the hardware deliveries (over 100 pallets)
  • 14. Prometheus - node 14 HP XL730f/XL750f Gen9 • 2x Intel Xeon E5-2680 v3 (Haswell) • 24 cores, 2100-3300 MHz • 30 MB cache, 128 GB RAM DDR4 • Mellanox Connect-X3 IB FDR 56Gb/s
  • 15. Prometheus - rack 15 HP Apollo 8000: • 8 cells – 9 trays each – 18 CPU or 9 GPU nodes • 8 IB FDR 36p 56 Gb/s switches • Dry-disconnect and HEX water cooling • HVDC 480V HP Apollo 8000 CDU: • Heat exchanger • Vacuum pump • Cooling controller • IB FDR 36p 56 Gb/s (18+3) dist+core switches
  • 17. Prometheus – IB network service nodes Service island I/O nodes IB core network 576 CPU nodes Compute island 576 CPU nodes Compute island 576 CPU nodes Compute island 432 CPU nodes 72 GPU nodes Compute island Over 250 Tb/s aggregate throughput • 30 km of cables • 217 switches • >10 000 ports
  • 18. Monitoring 18 SLURM node states IB network traffic Monitoring of: • CPU frequencies and temperatures • Memory usage • NFS and Lustre bandwith/IOPS/MDOPS • Power and cooling
  • 21. Top500 and Green500 3-rd level submision (Nov 2015) #72, 2068 Mflops/W #1 petascale x86 system in Europe submission after expansion (Nov 2015) #38, 1,67 PFLOPS Rmax GPUs not used for the run
  • 22. Application & software Academic workload Lots of small/medium jobs Few big jobs 330 projects 750 users Main fields: Chemistry Biochemistry (farmaceuticals) Astrophysics 22
  • 23. thatmpi code Institute of Nuclear Physics PAS in Krakow Study of non-relativistic shock waves hosted by supernova remnants that are believed to generate most of the Galactic cosmic-rays Thatmpi: Two-and-a-Half-Dimensional Astroparticle Stanford code with MPI (PIC) 2.5-dimensional particle dynamics fully relativistic with electro-magnetism colliding plasma jets with perpendicular B-field large simulations: up to 10k cores/run Applications
  • 24. thatmpi: left leptons jet animation A. Dorobisz, M. Kotwica
  • 25. thatmpi: joint development low-level: vectorization flow control refactoring register and cache-use optimization high-level: modernization from FORTRAN77 to Fortran2013 new particle sorting method portable data dumping with HDF5 communication buffering total time reduction: over 20% >1.2 MWh less energy per run (350 nodes, 60h) Future: 3D, domain partitioning, code refactoring
  • 26. Lessons learned There will be leaks! Drycooler seems more simple than a chiller, but the whole infrastructure is not Sysadmins need to get a degree in plumbing Traditional facilities people don’t understand HPC SCADA systems are dumb and insecure Monitor everyting, anytime, keep historic data Keep data easy to correlate Avoid SPOFs Never settle for anything less than full load testing Know your costs, calclulate TCO Look at hardware, middleware and software 26