SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
NICS, Adaptive
Computing, and Intel:
 Leadership in HPC


     Troy Baer
Senior HPC System
  Administrator
        NICS
Overview
• Introduction to NICS
• NICS and Adaptive Computing
• NICS and Intel
  – SC12 Green 500 effort
• Going Forward
National Institute for Computational Sciences:
   A University of Tennessee / ORNL Partnership

• NICS is an NSF-funded HPC center
 – Founded in 2007
 – Operated by the University of Tennessee, located
   at ORNL
 – XSEDE Partner and Service Provider
• XSEDE Systems
 – Kraken (Cray XT5, 112,984 Opteron cores)
 – Nautilus (SGI UV, 1,152 Nehalem cores + 16 M2070
   GPUs)
 – Keeneland final system (HP GPU cluster, 4,224
   Sandy Bridge cores + 792 M2090 GPUs) in
   conjunction with Georgia Tech
Other Systems and Projects at NICS

• Non-XSEDE Systems
 – Keeneland initial delivery system (HP GPU cluster)
   in conjunction with Georgia Tech
 – Ares (Cray XE/XK6)
 – Beacon (Appro/Cray cluster; more on this later...)
 – Darter (Cray XC30; more on this later...)
• Associated Centers and Projects
 – Application Acceleration Center of Excellence
   (AACE)
   • Parent project for Beacon
 – Remote Data Analysis and Visualization (RDAV)
   project
   • Parent project for Nautilus
NICS and Adaptive Computing
• NICS and Adaptive have been working together
  literally since the founding of the center
• Achievements
 – Kraken: 90-95% utilization on a petaflop-class system
   for 3 years and counting!
   • Over 3 billion core-hours delivered in total, 965 million
     delivered in CY2012
   • Delivering ~65% of all XSEDE computing cycles until very
     recently
   • Bi-modal scheduling for capability vs. capacity
 – Athena (Cray XT4): Dedicated access for COLA
   climate modeling group for ~6 months
 – Kraken/Athena: Annual OU CAPS Spring Experiment
   (storm forecasting)
 – Nautilus: NUMA+GPU scheduling
 – KIDS and KFS: GPU scheduling test bed
NICS and Intel

• AACE was born of conversations between NICS,
  ORNL, and Intel in early 2011
• Beacon project
 – Application readiness for Intel Xeon Phi
 – NSF STCI award provided people funding and initial
   hardware
   • 8 funded science teams
   • Open call for more science teams just ended
 – Second phase of hardware funded by the University of
   Tennessee system and the state of Tennessee
   • Data-intensive computing
   • Power efficiency research
BEACON                    Phase 1                  Phase 2
Compute Nodes                  16 Appro Grizzly Pass    48 Appro GreenBlade
                                                                  GBN814N
Node Processor                2x 8-core Sandy Bridge   2x 8-core Sandy Bridge

Memory/Node (GB)                                 64                       256

SSD/Node (GB)                                   160                       960
Xeon Phis/Node                                    2                         4
Interconnect                         QDR Infiniband           FDR Infiniband
Bandwidth to Storage (GB/s)                    ~2.5                       ~15
OS                                      CentOS 6.2               CentOS 6.2
Installation                               NFS-root                    Diskful
Batch Environment                    TORQUE/Moab              TORQUE/Moab
SC'12 Green 500 Effort

• In the run-up to the Supercomputing 2012
  conference, NICS, Intel, and Appro (now Cray)
  decided to take a shot at #1 on the Green 500
  list
• People worked on the system literally around
  the clock in Tennessee, California, India, and
  Germany for a month to make this happen!
• Result: New record of 112.2 TF/s @ 44.89 kW
  (i.e. 2.499 GF/W)
Stupid Phi Tricks
• Xeon Phis have a number of programming models
  –   Offload (like GPUs)
  –   Reverse offload (i.e. Phis offloading to the host)
  –   Native mode (i.e. running MPI ranks on Phis)
  –   Various hybrids thereof

• Xeon Phis are basically embedded x86_64 Linux
  boxes, complete with SSH, NFS, etc... which allows
  you to do all sorts of clever and/or hilarious things in
  job prologues and epilogues
  – NFS-export Lustre and/or local scratch from host to Phis
       • The Phis' BusyBox NFS client currently doesn't support NFS v3
         locking – Intel is working on this
  – Provision the job owner's uid (and only the job owner's uid)
    on MICs at job start
  – Reboot Phis between jobs
       • A bit slower than one might like – Intel is working on this as well
Going Forward
• New systems
     – Beacon Phase 2 (just accepted)
     – Darter (Cray XC30, just received and accepted)
     – Hopefully more in the future...

• New architectures make for interesting
    challenges WRT allocations and accounting
     – With GPUs and MICs becoming more
         commonplace, the notion of a “CPU-hour” or
         “core-hour” is even less meaningful than it was
         before.
     – Should the new accounting unit be the “node-
         hour”?
• Growing gap between capability/hero users and
    capacity/canned-code users needs to be
    addressed somehow

Mais conteúdo relacionado

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

Mais de inside-BigData.com (20)

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Último

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

NICS, Adaptive Computing, and Intel: Leadership in HPC

  • 1. NICS, Adaptive Computing, and Intel: Leadership in HPC Troy Baer Senior HPC System Administrator NICS
  • 2. Overview • Introduction to NICS • NICS and Adaptive Computing • NICS and Intel – SC12 Green 500 effort • Going Forward
  • 3. National Institute for Computational Sciences: A University of Tennessee / ORNL Partnership • NICS is an NSF-funded HPC center – Founded in 2007 – Operated by the University of Tennessee, located at ORNL – XSEDE Partner and Service Provider • XSEDE Systems – Kraken (Cray XT5, 112,984 Opteron cores) – Nautilus (SGI UV, 1,152 Nehalem cores + 16 M2070 GPUs) – Keeneland final system (HP GPU cluster, 4,224 Sandy Bridge cores + 792 M2090 GPUs) in conjunction with Georgia Tech
  • 4. Other Systems and Projects at NICS • Non-XSEDE Systems – Keeneland initial delivery system (HP GPU cluster) in conjunction with Georgia Tech – Ares (Cray XE/XK6) – Beacon (Appro/Cray cluster; more on this later...) – Darter (Cray XC30; more on this later...) • Associated Centers and Projects – Application Acceleration Center of Excellence (AACE) • Parent project for Beacon – Remote Data Analysis and Visualization (RDAV) project • Parent project for Nautilus
  • 5. NICS and Adaptive Computing • NICS and Adaptive have been working together literally since the founding of the center • Achievements – Kraken: 90-95% utilization on a petaflop-class system for 3 years and counting! • Over 3 billion core-hours delivered in total, 965 million delivered in CY2012 • Delivering ~65% of all XSEDE computing cycles until very recently • Bi-modal scheduling for capability vs. capacity – Athena (Cray XT4): Dedicated access for COLA climate modeling group for ~6 months – Kraken/Athena: Annual OU CAPS Spring Experiment (storm forecasting) – Nautilus: NUMA+GPU scheduling – KIDS and KFS: GPU scheduling test bed
  • 6. NICS and Intel • AACE was born of conversations between NICS, ORNL, and Intel in early 2011 • Beacon project – Application readiness for Intel Xeon Phi – NSF STCI award provided people funding and initial hardware • 8 funded science teams • Open call for more science teams just ended – Second phase of hardware funded by the University of Tennessee system and the state of Tennessee • Data-intensive computing • Power efficiency research
  • 7. BEACON Phase 1 Phase 2 Compute Nodes 16 Appro Grizzly Pass 48 Appro GreenBlade GBN814N Node Processor 2x 8-core Sandy Bridge 2x 8-core Sandy Bridge Memory/Node (GB) 64 256 SSD/Node (GB) 160 960 Xeon Phis/Node 2 4 Interconnect QDR Infiniband FDR Infiniband Bandwidth to Storage (GB/s) ~2.5 ~15 OS CentOS 6.2 CentOS 6.2 Installation NFS-root Diskful Batch Environment TORQUE/Moab TORQUE/Moab
  • 8. SC'12 Green 500 Effort • In the run-up to the Supercomputing 2012 conference, NICS, Intel, and Appro (now Cray) decided to take a shot at #1 on the Green 500 list • People worked on the system literally around the clock in Tennessee, California, India, and Germany for a month to make this happen! • Result: New record of 112.2 TF/s @ 44.89 kW (i.e. 2.499 GF/W)
  • 9. Stupid Phi Tricks • Xeon Phis have a number of programming models – Offload (like GPUs) – Reverse offload (i.e. Phis offloading to the host) – Native mode (i.e. running MPI ranks on Phis) – Various hybrids thereof • Xeon Phis are basically embedded x86_64 Linux boxes, complete with SSH, NFS, etc... which allows you to do all sorts of clever and/or hilarious things in job prologues and epilogues – NFS-export Lustre and/or local scratch from host to Phis • The Phis' BusyBox NFS client currently doesn't support NFS v3 locking – Intel is working on this – Provision the job owner's uid (and only the job owner's uid) on MICs at job start – Reboot Phis between jobs • A bit slower than one might like – Intel is working on this as well
  • 10. Going Forward • New systems – Beacon Phase 2 (just accepted) – Darter (Cray XC30, just received and accepted) – Hopefully more in the future... • New architectures make for interesting challenges WRT allocations and accounting – With GPUs and MICs becoming more commonplace, the notion of a “CPU-hour” or “core-hour” is even less meaningful than it was before. – Should the new accounting unit be the “node- hour”? • Growing gap between capability/hero users and capacity/canned-code users needs to be addressed somehow