SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
HPC Resource Accounting: Progress 
Against Allocation — Lessons Learned! 
Ken Schumacher! 
LISA 2014 - Seattle, WA.! 
12 November 2014
My Backround in Batch System Accounting 
In other words "Why should I listen to this guy?"! 
• I've been at Fermilab since Jan 1997, nearly 18 years! 
- I started supporting a few specific experiments. My group 
managed Fermilab's central Unix cluster. I later moved to batch 
farms using LSF as the batch scheduler.! 
- I was also a developer on the team that developed the first 
prototype for the Gartia Grid Accounting System (used by OSG, 
the Open Science Grid).! 
• For the last 5 years I have been part of the HPPC Group 
administering several InfiniBand based HPC clusters.! 
- I generate weekly resource accounting reports! 
- I work with Principle Investigators (PIs) to manage allocations! 
- I monitor our compliance with SLOs and Fermilab policies 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
2
Why call it Resource Accounting? 
• First: Resources! 
- Wikipedia - A source or supply from which benefit is produced! 
- Compute clusters offering unique resources, designed around the 
needs of a particular group of researchers.! 
• LQCD Cluster - Actually 4 CPU and 2 GPU sub-clusters! 
• Cosmology Cluster ! 
• Accelerator Modeling Cluster! 
- Also offering shared on-line storage in our Lustre Clusters! 
- Access to offline storage service from DMS (Data Movement and 
Stroage) department! 
- And the ever present staff as a resource. But accounting for staff 
is outside the scope of my presentation. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
3
Why do I call it Resource Accounting? 
• Next: Accounting ! 
- Noun: a report or description of an event or experience: a detailed 
account of what has been achieved.! 
• The Stake Holders that oversee and fund these collaborations and 
their research need to know several things! 
• More than just how their money was spent but what it accomplished in 
the form of:! 
• Availability/uptime of the computers, storage and services! 
• Usage by projects within the collaboration of the resources offered! 
• Papers and reports of progress on research being conducted.! 
• The usage reports allow for budgeting and planning for future projects 
including new hardware acquisition 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
4
Current State of the Reporting System 
• The reporting tools that we use today are a work in progress! 
- Over the last four years there as been a great improvement in the 
workflow of generating (automating) the weekly report! 
- The scope of the reporting has been revised as the requirements 
have expanded! 
- There is a significant list of changes and improvements still 
needed. ! 
• I am here to share those things that became important (and 
useful) as the scope of our reports expanded.! 
- We now include additional types of resources (GPU and storage) 
as part of the allocation! 
- We added more detailed reporting of the usage by projects so we 
can adjust both quotas and batch submission priorities 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
5
Who are my customers?! 
• The HPPC department supports several Massively Parallel 
Compute Clusters used by different groups! 
- Theoretical Particle Physicists associated with Lattice Quantum 
Chromodynamics or LQCD.! 
• The users within this collaboration are from all over the world.! 
• The collaboration has compute resources at several institutions! 
- Astrophysicists at Fermilab using our Cosmology Cluster! 
- Fermilab scientists, engineers and software developers doing 
Accelerator Modeling on the AMR cluster 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
6 
Ds Cluster ! 
at FNAL 
10q Cluster ! 
at JLab
Disclaimer: I am not a Theoretical Particle Physicist 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
7
Why we need High Performance Computing (HPC) clusters 
Discovered in the early 1970s, the theory of Quantum ! 
chromodynamics (QCD) consists of equations that describe the! 
strong force that causes quarks to clump together to form ! 
protons and other constituents ! 
of matter. For a long time solving ! 
these equations was a struggle. ! 
But in the last decade using ! 
powerful supercomputers ! 
theorists are now able to finally ! 
solve the equations of QCD ! 
with high precision. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
8
Lattice Quantum Chromodynamics Computing at FNAL 
• Fermilab's LQCD Computing cluster is made up of a few sub-clusters 
based on similar configurations! 
- Sub-clusters of conventional CPU based nodes! 
• Jpsi cluster - decommissioned May 2014, has been our standard for 
normalized core-hours since 2010. 856 nodes, dual-socket quad-core 
Opteron 2352 (2.1 GHz) on DDR InfiniBand fabric! 
• Ds cluster - 420 nodes with quad-socket eight-core Opteron 6128 (2.0 
GHz) on QDR InfiniBand fabric. This is 13,440 cores.! 
• Bc cluster - 224 nodes quad-socket eight-core Opteron 6320 (2.8 GHz) 
on QDR InfiniBand fabric. This is 7,168 cores.! 
• Pi0 cluster - 214 node dual-socket eight-core Intel E5-2650v2 "Ivy 
Bridge" (2.6 GHz) on QDR InfiniBand fabric. This is 3,424 cores.! 
( continued . . .) 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
9
Lattice Quantum Chromodynamics Computing 
• Fermilab's LQCD Computing cluster (. . . continued)! 
- Sub-clusters of nodes enhanced with GPU processors! 
• Dsg cluster - 76 nodes with quad-socket eight-core Opteron 6128 (2.0 
GHz) with two NVidia Tesla M2050 GPUs on QDR InfiniBand fabric. ! 
• Pi0g cluster - 32 node dual-socket eight-core Intel E5-2650v2 "Ivy 
Bridge" (2.6 GHz) with four NVidia Tesla K40m GPUs on QDR fabric.! 
- On-line Disk based storage in a Lustre Cluster! 
• LQCD Lustre Cluster has 964 TB of on-line storage after our most 
recent expansion. ! 
• Cosmology Lustre Cluster has 129 TB of on-line storage! 
- Tape based storage in our SL8500 robotic tape libraries! 
• 1,617 LTO4 tapes (1,293.6 TB)! 
• 331 10KC tapes (1,655.0 TB) 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
10
Other Compute Resources within USQCD 
• From the PY 2014-15 Call for Proposals! 
- Compute resources dedicated to Lattice QCD! 
• 71 M BG/Q core-hours at BNL! 
• 397 M Jpsi-core hours on clusters at FNAL and JLAB! 
• 8.9 M GPU-hours on GPU clusters at FNAL and JLAB! 
- Compute resource awards to USQCD from the DOE's INCITE 
program! 
• 100 M Cray XK7 core-hours at Oak Ridge OLCF! 
• 240 M BG/Q core-hours at Argonne ALCF 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
11
Allocations at FNAL/QCD 
• USQCD Allocation Committee allocates time on the clusters in 
units of 'normalized core hours'.! 
• Program year is July 1 through June 30th! 
• Three classes of Allocations! 
- A - large allocations which to "support calculations of benefit for 
the whole USQCD Collaboration and/or addressing critical 
scientific needs."! 
- B - medium allocations (<2.5M c-h) "intended to support 
calculations in an early stage of development which address, or 
have the potential to address, scientific needs of the 
collaboration."! 
- C - small and/or short term allocations, to explore / test / 
benchmark calculations with the potential to address scientific 
needs of the collaboration. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
12
Lesson 1 - We needed a normalized "currency" 
• The allocation is like a budget. ! 
• We base allocations on normalized core hours. ! 
• Normalized core hours are basically our currency.! 
• CPU performance and GPU performance are like Apples and 
Oranges.! 
• GPUs are designed for vector math or floating point calculations! 
• Some simulations rely more heavily on floating point.! 
• Code compiled for GPUs can not run on CPUs! 
• We had to develop new benchmarks for use with GPUs.! 
• So projects will get separate allocations for CPU and GPU 
processing. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
13
Normalizing the USQCD Clusters at Fermilab 
• Our existing HPC clusters for LQCD! 
» Ds cluster factor of 1.33 nC-H, 13,440 cores, 17,875 nCores! 
» Bc cluster factor of 1.48 nC-H, 7,168 cores, 10,609 nCores! 
» Pi0 cluster factor of 3.14 nC-H, 3,424 cores, 10,751 nCores! 
» Ds GPU cluster factor of 1.0 nC-H, 152 GPUs, 152 nGPUs! 
» Pi0 GPU cluster factor of 2.2 nC-H, 128 GPUs, 280 nGPUs! 
• Storage: Tape at 3K nC-H per TB, Disk at 30K nC-H per TB! 
• PY July 1, 2014 through June 30, 2015 allocation! 
» 270,375,000 CPU nC-H! 
» 2,815,000 GPU nC-H 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
14
Sample Allocation Notification Letter 
Hello Professor,! 
I am setting the Allocations and configuring accounts for the USQCD 
Scientific Program Committee allocations at Fermilab for the program 
year 2014-2015. I have you listed as the PI contact for the following 
allocation.! 
Flavor Physics from B, D, and K Mesons on the 2+1+1-Flavor HISQ 
Ensembles! 
49.28 M cpu hours, 825 K gnu hours, 50 TB disk and 251 TB tape! 
If this does not match your information, please us know. We need two 
things from you, please:! 
1) Your choice of a project name ! 
2) A list of the users allowed to submit jobs to this project name 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
15
FNAL Progress Against Allocation Reports 
• We are using a series of homegrown PERL and Python scripts 
to generate our weekly progress against allocation report! 
- A summary of who used each sub-cluster! 
- A listing of specific credits and debits for the week! 
- A YTD summary of CPU cluster usage! 
- A YTD summary of GPU cluster usage! 
• We can include debits/credits against the allocation for several 
reasons (explained later): ! 
- Credits for reduced performance during load shed events! 
- Debits for storage: long-term (tapes) and on-line (disk)! 
- Credits for failed jobs (due to a systems failure)! 
- Debits for dedicated nodes set aside for a project 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
16
Sample of a CPU Sub-cluster Weekly Detail Report 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
17
Sample of GPU sub-cluster weekly detail report 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
18
Sample of the Debits / Credits - Quarterly Tape Usage 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
19
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 20 Learned 12 November 2014
Part I - The Report Header 
• The Header describes where we are in the program year.! 
• It also provides some explanation of the numbers in the report 
itself. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
21
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 22 Learned 12 November 2014
Part II - Allocation and Usage by Project (Weekly) 
• The left side of the weekly summary report has the summary 
usage for just this week but across all the sub clusters. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
23
Part II - Allocation and Usage by Project (Allocation) 
• The middle part of the weekly summary report has the 
allocation granted and used PYTD by project. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
24
Part II - Allocation and Usage by Project (PYTD) 
• The right side of the weekly summary report has the summary 
usage for the Program YTD across all the sub clusters. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
25
Lesson 2 - Adjustments to Batch Priorities (lower) 
• Project charmonium has just crossed over allocation.! 
- I will go into the configuration files of the batch scheduler and 
change the priority for this one project! 
- The new priority is set to a negative number! 
- This causes any jobs that this project puts into the queue to wait 
until the jobs of those projects who still have allocation remaining 
are allowed to run! 
• A sub-cluster that is not currently billable is configured so that 
all projects, regardless of their allocation, run at equal priority.! 
• Opportunistic running! 
- If there are nodes available, we allow these over allocation 
projects or a project that simply does not have an allocation to run 
on what would otherwise be idle nodes. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
26
Lesson 2 - Adjustments to Batch Priorities (increase) 
• There are occasions where we may increase the priority for 
one or more projects.! 
- To meet a deadline for a paper! 
- To generate simulation data that is needed as inputs to upcoming 
simulations that will run! 
• In some cases we may dedicate some number of nodes for a 
specific project. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
27
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 28 Learned 12 November 2014
Part III - Totals and Progress Against Allocation (Weekly) 
• The left side of the weekly summary report has the summary 
usage for just this week but across all the sub clusters. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
29
Part III - Totals and Progress Against Allocation (Allocation) 
• The middle part of the weekly summary report has the 
allocation granted and used PYTD by project. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
30
Part III - Totals and Progress Against Allocation (PYTD) 
• The right side of the weekly summary report has the summary 
usage for the Program YTD across all the sub clusters. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
31
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 32 Learned 12 November 2014
Lesson 3 - Credits for failed jobs 
• Occasionally a job will fail "softly". ! 
• It does not crash and it reports a successful completion. So it is 
billed against an allocation.! 
• When a soft failure is discovered, we will manually calculate a 
credit to the project to reimburse the previous charges 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
33
Lesson 4 - Credits for reduced performance 
• We have had occasions during the summer where our cooling 
equipment could not keep up with the heat being generated! 
• Our Facilities group will notify us of a potential load-shed event! 
• During a load-shed event, some number of the nodes in our 
clusters are simply turned off.! 
• The nodes that remain on-line have clock speeds reduced and 
they run at a decreased load and wall-time limits are increased! 
• All jobs during a load shed get a credit for extra time used.! 
! 
• Sorry, no sample. We have been able to avoid load shed 
events since July 1. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
34
Lesson 5 - Usefulness of burn rates 
• The burn rates allow us to notify a PI that the project may be 
using its allocation at a rate that is too high or low. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
35
Summary of Lessons Learned 
• Normalized core-hours ! 
- Using a Standard Unit Across Different Facilities! 
- CPU vs GPU: These truly are "Apples vs Oranges"! 
- Charges for Storage! 
• Adjustments to Batch Priorities for Fair Share! 
- Reduced priority for over allocation or un-allocated! 
- Increased priority or dedicate nodes where needed! 
• Charges for dedicated nodes! 
• Credits for failed jobs and for load shed events! 
- Failed jobs that are not the user's fault! 
- Load shed events that are driven by "mother nature"! 
• Usefulness of Burn Rates 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
36
Time for your questions. 
And I appreciate your patience with my hearing-loss. Please step to a 
microphone. 
Feel free to find me in the Hallway Track. I am here all week. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
37
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 38 Learned 12 November 2014

Mais conteúdo relacionado

Mais procurados

Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
Ryu Kobayashi
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
Ted Dunning
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
DataWorks Summit
 

Mais procurados (20)

February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 

Semelhante a HPC Resource Accounting

Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
inside-BigData.com
 
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
"Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A..."Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A...
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
Altair
 

Semelhante a HPC Resource Accounting (20)

Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
ICEOTOPE & OCF: Performance for Manufacturing
ICEOTOPE & OCF: Performance for Manufacturing ICEOTOPE & OCF: Performance for Manufacturing
ICEOTOPE & OCF: Performance for Manufacturing
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
Optimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack CloudsOptimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack Clouds
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
"Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A..."Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A...
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Intro to the CNCF Research User Group
Intro to the CNCF Research User GroupIntro to the CNCF Research User Group
Intro to the CNCF Research User Group
 

Último

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Último (20)

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

HPC Resource Accounting

  • 1. HPC Resource Accounting: Progress Against Allocation — Lessons Learned! Ken Schumacher! LISA 2014 - Seattle, WA.! 12 November 2014
  • 2. My Backround in Batch System Accounting In other words "Why should I listen to this guy?"! • I've been at Fermilab since Jan 1997, nearly 18 years! - I started supporting a few specific experiments. My group managed Fermilab's central Unix cluster. I later moved to batch farms using LSF as the batch scheduler.! - I was also a developer on the team that developed the first prototype for the Gartia Grid Accounting System (used by OSG, the Open Science Grid).! • For the last 5 years I have been part of the HPPC Group administering several InfiniBand based HPC clusters.! - I generate weekly resource accounting reports! - I work with Principle Investigators (PIs) to manage allocations! - I monitor our compliance with SLOs and Fermilab policies Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 2
  • 3. Why call it Resource Accounting? • First: Resources! - Wikipedia - A source or supply from which benefit is produced! - Compute clusters offering unique resources, designed around the needs of a particular group of researchers.! • LQCD Cluster - Actually 4 CPU and 2 GPU sub-clusters! • Cosmology Cluster ! • Accelerator Modeling Cluster! - Also offering shared on-line storage in our Lustre Clusters! - Access to offline storage service from DMS (Data Movement and Stroage) department! - And the ever present staff as a resource. But accounting for staff is outside the scope of my presentation. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 3
  • 4. Why do I call it Resource Accounting? • Next: Accounting ! - Noun: a report or description of an event or experience: a detailed account of what has been achieved.! • The Stake Holders that oversee and fund these collaborations and their research need to know several things! • More than just how their money was spent but what it accomplished in the form of:! • Availability/uptime of the computers, storage and services! • Usage by projects within the collaboration of the resources offered! • Papers and reports of progress on research being conducted.! • The usage reports allow for budgeting and planning for future projects including new hardware acquisition Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 4
  • 5. Current State of the Reporting System • The reporting tools that we use today are a work in progress! - Over the last four years there as been a great improvement in the workflow of generating (automating) the weekly report! - The scope of the reporting has been revised as the requirements have expanded! - There is a significant list of changes and improvements still needed. ! • I am here to share those things that became important (and useful) as the scope of our reports expanded.! - We now include additional types of resources (GPU and storage) as part of the allocation! - We added more detailed reporting of the usage by projects so we can adjust both quotas and batch submission priorities Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 5
  • 6. Who are my customers?! • The HPPC department supports several Massively Parallel Compute Clusters used by different groups! - Theoretical Particle Physicists associated with Lattice Quantum Chromodynamics or LQCD.! • The users within this collaboration are from all over the world.! • The collaboration has compute resources at several institutions! - Astrophysicists at Fermilab using our Cosmology Cluster! - Fermilab scientists, engineers and software developers doing Accelerator Modeling on the AMR cluster Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 6 Ds Cluster ! at FNAL 10q Cluster ! at JLab
  • 7. Disclaimer: I am not a Theoretical Particle Physicist Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 7
  • 8. Why we need High Performance Computing (HPC) clusters Discovered in the early 1970s, the theory of Quantum ! chromodynamics (QCD) consists of equations that describe the! strong force that causes quarks to clump together to form ! protons and other constituents ! of matter. For a long time solving ! these equations was a struggle. ! But in the last decade using ! powerful supercomputers ! theorists are now able to finally ! solve the equations of QCD ! with high precision. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 8
  • 9. Lattice Quantum Chromodynamics Computing at FNAL • Fermilab's LQCD Computing cluster is made up of a few sub-clusters based on similar configurations! - Sub-clusters of conventional CPU based nodes! • Jpsi cluster - decommissioned May 2014, has been our standard for normalized core-hours since 2010. 856 nodes, dual-socket quad-core Opteron 2352 (2.1 GHz) on DDR InfiniBand fabric! • Ds cluster - 420 nodes with quad-socket eight-core Opteron 6128 (2.0 GHz) on QDR InfiniBand fabric. This is 13,440 cores.! • Bc cluster - 224 nodes quad-socket eight-core Opteron 6320 (2.8 GHz) on QDR InfiniBand fabric. This is 7,168 cores.! • Pi0 cluster - 214 node dual-socket eight-core Intel E5-2650v2 "Ivy Bridge" (2.6 GHz) on QDR InfiniBand fabric. This is 3,424 cores.! ( continued . . .) Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 9
  • 10. Lattice Quantum Chromodynamics Computing • Fermilab's LQCD Computing cluster (. . . continued)! - Sub-clusters of nodes enhanced with GPU processors! • Dsg cluster - 76 nodes with quad-socket eight-core Opteron 6128 (2.0 GHz) with two NVidia Tesla M2050 GPUs on QDR InfiniBand fabric. ! • Pi0g cluster - 32 node dual-socket eight-core Intel E5-2650v2 "Ivy Bridge" (2.6 GHz) with four NVidia Tesla K40m GPUs on QDR fabric.! - On-line Disk based storage in a Lustre Cluster! • LQCD Lustre Cluster has 964 TB of on-line storage after our most recent expansion. ! • Cosmology Lustre Cluster has 129 TB of on-line storage! - Tape based storage in our SL8500 robotic tape libraries! • 1,617 LTO4 tapes (1,293.6 TB)! • 331 10KC tapes (1,655.0 TB) Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 10
  • 11. Other Compute Resources within USQCD • From the PY 2014-15 Call for Proposals! - Compute resources dedicated to Lattice QCD! • 71 M BG/Q core-hours at BNL! • 397 M Jpsi-core hours on clusters at FNAL and JLAB! • 8.9 M GPU-hours on GPU clusters at FNAL and JLAB! - Compute resource awards to USQCD from the DOE's INCITE program! • 100 M Cray XK7 core-hours at Oak Ridge OLCF! • 240 M BG/Q core-hours at Argonne ALCF Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 11
  • 12. Allocations at FNAL/QCD • USQCD Allocation Committee allocates time on the clusters in units of 'normalized core hours'.! • Program year is July 1 through June 30th! • Three classes of Allocations! - A - large allocations which to "support calculations of benefit for the whole USQCD Collaboration and/or addressing critical scientific needs."! - B - medium allocations (<2.5M c-h) "intended to support calculations in an early stage of development which address, or have the potential to address, scientific needs of the collaboration."! - C - small and/or short term allocations, to explore / test / benchmark calculations with the potential to address scientific needs of the collaboration. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 12
  • 13. Lesson 1 - We needed a normalized "currency" • The allocation is like a budget. ! • We base allocations on normalized core hours. ! • Normalized core hours are basically our currency.! • CPU performance and GPU performance are like Apples and Oranges.! • GPUs are designed for vector math or floating point calculations! • Some simulations rely more heavily on floating point.! • Code compiled for GPUs can not run on CPUs! • We had to develop new benchmarks for use with GPUs.! • So projects will get separate allocations for CPU and GPU processing. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 13
  • 14. Normalizing the USQCD Clusters at Fermilab • Our existing HPC clusters for LQCD! » Ds cluster factor of 1.33 nC-H, 13,440 cores, 17,875 nCores! » Bc cluster factor of 1.48 nC-H, 7,168 cores, 10,609 nCores! » Pi0 cluster factor of 3.14 nC-H, 3,424 cores, 10,751 nCores! » Ds GPU cluster factor of 1.0 nC-H, 152 GPUs, 152 nGPUs! » Pi0 GPU cluster factor of 2.2 nC-H, 128 GPUs, 280 nGPUs! • Storage: Tape at 3K nC-H per TB, Disk at 30K nC-H per TB! • PY July 1, 2014 through June 30, 2015 allocation! » 270,375,000 CPU nC-H! » 2,815,000 GPU nC-H Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 14
  • 15. Sample Allocation Notification Letter Hello Professor,! I am setting the Allocations and configuring accounts for the USQCD Scientific Program Committee allocations at Fermilab for the program year 2014-2015. I have you listed as the PI contact for the following allocation.! Flavor Physics from B, D, and K Mesons on the 2+1+1-Flavor HISQ Ensembles! 49.28 M cpu hours, 825 K gnu hours, 50 TB disk and 251 TB tape! If this does not match your information, please us know. We need two things from you, please:! 1) Your choice of a project name ! 2) A list of the users allowed to submit jobs to this project name Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 15
  • 16. FNAL Progress Against Allocation Reports • We are using a series of homegrown PERL and Python scripts to generate our weekly progress against allocation report! - A summary of who used each sub-cluster! - A listing of specific credits and debits for the week! - A YTD summary of CPU cluster usage! - A YTD summary of GPU cluster usage! • We can include debits/credits against the allocation for several reasons (explained later): ! - Credits for reduced performance during load shed events! - Debits for storage: long-term (tapes) and on-line (disk)! - Credits for failed jobs (due to a systems failure)! - Debits for dedicated nodes set aside for a project Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 16
  • 17. Sample of a CPU Sub-cluster Weekly Detail Report Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 17
  • 18. Sample of GPU sub-cluster weekly detail report Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 18
  • 19. Sample of the Debits / Credits - Quarterly Tape Usage Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 19
  • 20. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 20 Learned 12 November 2014
  • 21. Part I - The Report Header • The Header describes where we are in the program year.! • It also provides some explanation of the numbers in the report itself. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 21
  • 22. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 22 Learned 12 November 2014
  • 23. Part II - Allocation and Usage by Project (Weekly) • The left side of the weekly summary report has the summary usage for just this week but across all the sub clusters. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 23
  • 24. Part II - Allocation and Usage by Project (Allocation) • The middle part of the weekly summary report has the allocation granted and used PYTD by project. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 24
  • 25. Part II - Allocation and Usage by Project (PYTD) • The right side of the weekly summary report has the summary usage for the Program YTD across all the sub clusters. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 25
  • 26. Lesson 2 - Adjustments to Batch Priorities (lower) • Project charmonium has just crossed over allocation.! - I will go into the configuration files of the batch scheduler and change the priority for this one project! - The new priority is set to a negative number! - This causes any jobs that this project puts into the queue to wait until the jobs of those projects who still have allocation remaining are allowed to run! • A sub-cluster that is not currently billable is configured so that all projects, regardless of their allocation, run at equal priority.! • Opportunistic running! - If there are nodes available, we allow these over allocation projects or a project that simply does not have an allocation to run on what would otherwise be idle nodes. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 26
  • 27. Lesson 2 - Adjustments to Batch Priorities (increase) • There are occasions where we may increase the priority for one or more projects.! - To meet a deadline for a paper! - To generate simulation data that is needed as inputs to upcoming simulations that will run! • In some cases we may dedicate some number of nodes for a specific project. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 27
  • 28. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 28 Learned 12 November 2014
  • 29. Part III - Totals and Progress Against Allocation (Weekly) • The left side of the weekly summary report has the summary usage for just this week but across all the sub clusters. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 29
  • 30. Part III - Totals and Progress Against Allocation (Allocation) • The middle part of the weekly summary report has the allocation granted and used PYTD by project. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 30
  • 31. Part III - Totals and Progress Against Allocation (PYTD) • The right side of the weekly summary report has the summary usage for the Program YTD across all the sub clusters. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 31
  • 32. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 32 Learned 12 November 2014
  • 33. Lesson 3 - Credits for failed jobs • Occasionally a job will fail "softly". ! • It does not crash and it reports a successful completion. So it is billed against an allocation.! • When a soft failure is discovered, we will manually calculate a credit to the project to reimburse the previous charges Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 33
  • 34. Lesson 4 - Credits for reduced performance • We have had occasions during the summer where our cooling equipment could not keep up with the heat being generated! • Our Facilities group will notify us of a potential load-shed event! • During a load-shed event, some number of the nodes in our clusters are simply turned off.! • The nodes that remain on-line have clock speeds reduced and they run at a decreased load and wall-time limits are increased! • All jobs during a load shed get a credit for extra time used.! ! • Sorry, no sample. We have been able to avoid load shed events since July 1. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 34
  • 35. Lesson 5 - Usefulness of burn rates • The burn rates allow us to notify a PI that the project may be using its allocation at a rate that is too high or low. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 35
  • 36. Summary of Lessons Learned • Normalized core-hours ! - Using a Standard Unit Across Different Facilities! - CPU vs GPU: These truly are "Apples vs Oranges"! - Charges for Storage! • Adjustments to Batch Priorities for Fair Share! - Reduced priority for over allocation or un-allocated! - Increased priority or dedicate nodes where needed! • Charges for dedicated nodes! • Credits for failed jobs and for load shed events! - Failed jobs that are not the user's fault! - Load shed events that are driven by "mother nature"! • Usefulness of Burn Rates Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 36
  • 37. Time for your questions. And I appreciate your patience with my hearing-loss. Please step to a microphone. Feel free to find me in the Hallway Track. I am here all week. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 37
  • 38. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 38 Learned 12 November 2014