SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
glideinWMS for users


                    Monitoring and
                   troubleshooting
                 a glideinWMS-based
                    HTCondor pool
                     by Igor Sfiligoi (UCSD)



CERN, Dec 2012           glideinWMS monitoring   1
Scope of this talk


                        This talk describes what
                       information are available
                      when troubleshooting in a
                 glideinWMS-based HTCondor pool,
                     and what tools can you use
                              to mine them.


                 Reader is expected to already have a basic understanding of HTCondor and glideinWMS.



CERN, Dec 2012                                glideinWMS monitoring                                     2
HTCondor Architecture
 ●   As a reminder
                                          G.F.
                               +3
                 VO FE                                         Grid
                                          G.F.
                               +1
                                                            Execute node

                                     Central manager        Execute node
                 Submit node
                                                            Execute node
                                       Negotiator
                 Submit node
                                                            Execute node
                 Submit node
                                                            Execute node
                  Schedd                                      Condor




CERN, Dec 2012                      glideinWMS monitoring                  3
Typical user questions
                                 addressed in this talk




                 ●   Where is/was my job running?
                 ●   Why are my jobs
                     not starting?
                 ●   Why do my jobs
                     take forever to finish?




CERN, Dec 2012                  glideinWMS monitoring     4
Where is/was
                 my job running?




CERN, Dec 2012       glideinWMS monitoring   5
Job progress monitoring
 ●   HTCondor provides two basic means to monitor
     job progress
      ●   Querying the system for current status
           –     Using the cmdline condor_q/condor_history
      ●   Parsing the job event log
           –     Either plain text or XML formatted
           –     Starting with 7.9.1, condor_history can be used
                 to extract the last known state




CERN, Dec 2012                    glideinWMS monitoring            6
Job status
 ●   Each Job has a status associated with it
      ●   An integer attribute called
          JobStatus
           –     But has well known semantics
                 associated with each value
 ●   Jobs start in the Idle state
      ●   Become Running if everything works fine
      ●   Completed when they terminate
 ●   If anything goes wrong, a Job will go into Hold
 ●   If removed before completion, will be Removed
CERN, Dec 2012                   glideinWMS monitoring   7
Monitoring the Job Status
 ●   Idle/Running/Held jobs can be polled with
     condor_q
      ●   Will query the Schedd daemon
 ●   Once they terminate, or are removed,
     they leave the Schedd queue
      ●   Are put into a file on disk               One exception:
                                                    If a job was running when it
      ●   Can use condor_history                    was removed, but the execute node
                                                    does not confirm the job was
          to retrieve the last ClassAd              killed remotely, the job will be
                                                    kept in the Schedd.
 ●   The job event log
     has all the state transitions
     (of course)

CERN, Dec 2012              glideinWMS monitoring                                       8
So, where is the job running?
 ●   Easy to get the machine name and/or IP
      ●   Standard HTCondor attribute
          RemoteHost & StartdIpAddr
 ●   But may not necessary make sense
      ●   Do you recognize all network domains?
      ●   And they could be on a private network!




CERN, Dec 2012             glideinWMS monitoring    9
Getting glidein attributes
 ●   Glideins have many more attributes that
     describe them
      ●   e.g. symbolic site name
          GLIDEIN_CMSSite
 ●   However, by default, you
     do not get this info in the Job Classad
 ●   But easy to add
      ●   <my attr> = $$(<glidein attr>:Unknown)
           –     Will get the info in MATCH_EXP_<my attr>


CERN, Dec 2012                   glideinWMS monitoring      10
Standard attributes
●   Standard glideinWMS attributes
    ●   JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)"
    ●   JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"
    ●   JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"
    ●   JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"
                                                                                 Useful
    ●   JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)"                      for in-depth
                                                                                 debugging

    ●   JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"
    ●   JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"
●   Standard CMS glideinWMS attribute
    ●   JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"

                             Configured by the HTCondor admin,
                             no need for the user to do anything
                             SUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ...


CERN, Dec 2012              glideinWMS monitoring                                      11
Getting them in the event log
 ●   You (or the admins) can also propagate
     the attributes into the event log
     job_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, …
 ●   As a result you get “Job Ad” events
        ...
         ...
        001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>
        ... (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>
         001
         ...
        028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.
        TriggerEventTypeNumber = 12/03 00:46:33 Job ad information event triggered.
         028 (20327.002.000) 1
         TriggerEventTypeNumber = 1
        Cluster = 20327
         Cluster = 20327
        EventTypeNumber = 28
         EventTypeNumber = 28
        ExecuteHost = "<193.48.85.94:38749>"
         ExecuteHost = "<193.48.85.94:38749>"
        JOB_CMSSite = "T2_FR_IPHC"
         JOB_CMSSite = "T2_FR_IPHC"
        EventTime = "2012-12-03T00:46:33"
         EventTime = "2012-12-03T00:46:33"
        TriggerEventTypeName = "ULOG_EXECUTE"
         TriggerEventTypeName = "ULOG_EXECUTE"
        Proc = 2
         Proc = 2
        Subproc = 0
        CurrentTime 0= time()
         Subproc =
         CurrentTime = time()
        MyType = "ExecuteEvent"
         MyType = "ExecuteEvent"
        ...
         ...


CERN, Dec 2012                         glideinWMS monitoring                             12
Why is my job
                 not starting?




CERN, Dec 2012      glideinWMS monitoring   13
Troubleshooting process
 ●   First question
      ●   Do my jobs match any (logical) resource?
 ●   Once you are sure of that
      ●   Are there jobs from higher priority users?
      ●   Are desired sites just too busy?
      ●   Are there problems at desired site(s)?
 ●   If nothing gives a satisfying answer
      ●   It may be a glideinWMS misconfiguration,
          see help from VO FE admins

CERN, Dec 2012              glideinWMS monitoring      14
How do I know if my jobs match?
 ●   Good question!
      ●   Unfortunately, the answer is not trivial
 ●   The FE matching policy not “public”
      ●   And, of course, no tools to probe for it
 ●   You will have to rely on the FE admins to
     “explain” the policy
      ●   Hopefully in a human readable format
      ●   Hopefully without conversion errors!


CERN, Dec 2012             glideinWMS monitoring     15
An example FE policy
 ●   See the CMS FE talk for an actual
     high level view
 ●   The actual FE policy is a python expression
           A simple example – could be much more complex
            (glidein["attrs"]["GLIDEIN_CMSSite"]
             (glidein["attrs"]["GLIDEIN_CMSSite"]
               in job["DESIRED_Sites"].split(",")) and
                in job["DESIRED_Sites"].split(",")) and
            ((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True")
             ((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True")
             == (job.get("DESIRES_HTPC")==1))
              == (job.get("DESIRES_HTPC")==1))

 ●   And then there is the matching HTCondor one
  (stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&
   (stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&
  ((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))
   ((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))


CERN, Dec 2012                    glideinWMS monitoring           16
A word about HTCondor matching
 ●   Once glideins start, you can probe their policy
     condor_status -format '%s' START
            $ condor_status -format '%sn' START
            ( $( condor_statustrue ) && '%sn' START ( ( stringListMember(GLIDEIN_CMSSite,
                  true ) && ( -format ( true ) &&
            DESIRED_Sites,",") =?= )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite,
              ( ( true ) && ( true     &&     &&      GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR
              DESIRED_Sites,",")) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR
            ES_HTPC =?= true )     )    &&       ( ( ( GLIDEIN_Is_HTPC undefined =?=        Cur
              ES_HTPC < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur
            rentTime   =?= true ) ) ) ) &&
            ( rentTime) <&& ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,
               ( true     GLIDEIN_ToRetire ) ) )
            DESIRED_Sites,",") =?= &&
              ( ( true ) && ( true )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite,
                                              &&      GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR
            ES_HTPC =?= true ) ) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR
              DESIRED_Sites,",") )      &&       ( ( ( GLIDEIN_Is_HTPC undefined =?=        Cur
            rentTime < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur
              ES_HTPC =?= true ) ) ) ) &&
              rentTime < GLIDEIN_ToRetire ) ) )
            ...
              ...

 ●   But no tools to help you understand the M.M.
      ●   The closest is
          condor_q -analyze
           – But only looks at Job requirements
           – So, not really helping when all/most of the policy in glideins


CERN, Dec 2012                            glideinWMS monitoring                                    17
User priorities
 ●   So, jobs should be matching, but are not starting
      ●   And there are plenty matching glideins in the system
 ●   Likely there are other higher-priority jobs
     in the system
      ●   Possibly from a different user            Warning: Slow!
          condor_userio
      ●   Possibly on a different schedd
          condor_status -submitters
 ●   No tools to give you the easy answer
      ●   If you need the answer, you will have to investigate
CERN, Dec 2012              glideinWMS monitoring                    18
Unclaimed glideins
 ●   If you see plenty of Unclaimed glideins,
     but no matching jobs from other users
      ●   You have either reached the schedd limit
          MAX_JOBS_RUNNING
      ●   Or something bad is going on!
 ●   You can only ask yout FE admin for help
      ●   But first double check that your jobs should
          indeed be matching, at least on paper



CERN, Dec 2012              glideinWMS monitoring        19
Supported Sites
●    What should you do if there are
     no (new) glideins coming from an expected site?
●    First off, see if the site is even supported by the
     glideinWMS instance!
●    Each Entry has a ClassAd
     condor_status -any -const 'MyType==”glideresource”'
      ●   Look for the attributes your FE is matching on
          e.g. GLIDEIN_CMSSite
                                                        Site
                                                     not there?
                                                     Notify your
                                                     FE admin!

    CERN, Dec 2012           glideinWMS monitoring                 20
Is the FE even asking for them?
 ●   You are sure that your jobs should be
     matching?
      ●   But what if you are wrong?
 ●   Check it out
     … -format '%in' GlideFactoryMonitorRequestedIdle




                                                   But remember
                                                         it is
                                                   not just your
                                                        jobs.


CERN, Dec 2012             glideinWMS monitoring                   21
Maybe the site is just busy?
 ●   Glideins have to compete with other Grid jobs
     at most sites
      ●   Maybe the site is just busy?
 ●   Check if glideinWMS has put any glideins
     in the Grid queues
     … -format '%in' GlideFactoryMonitorStatusPending




                                                   If you find
                                                      zeros,
                                                   notify your
                                                   FE admin!
CERN, Dec 2012             glideinWMS monitoring             22
Site problems?
 ●   The glideins will validate the worker node
     before talking to the C.M.
      ●   If the test fails, the glidein will “waste” 20 mins on
          the node to prevent other jobs to fail on it again
 ●   You can check if there are “Running”
     glideins in glideinWMS, even though
     you see none (or few) in the C.M.
     … -format '%in' GlideFactoryMonitorStatusRunning

                                                           If you find
                                                         a discrepancy,
                                                           notify your
                                                           FE admin!
CERN, Dec 2012               glideinWMS monitoring                  23
Still no clue?

                 ●   If all your detective work fails
                     ●   Notify your VO FE admin
                 ●   They have access to information
                     you don't




CERN, Dec 2012                 glideinWMS monitoring    24
Why do my jobs
                 take forever to finish?




CERN, Dec 2012           glideinWMS monitoring   25
My jobs are running, but...
 ●   Great, your jobs are happily running
      ●   But you are getting no results back!
      ●   i.e. the jobs are not finishing in the expected time
 ●   Two main likely reasons
      ●   They are being restarted
      ●   You miscalculated the needed time




CERN, Dec 2012              glideinWMS monitoring                26
Jobs re-starting
 ●   HTCondor tries to be user friendly
      ●   If a job gets preempted, for almost any reason,
          it will try to re-start it with the hope it will finish
          on the next try
      ●   And will not ever give up! (by default)
 ●   You can easily check how many times it started
     condor_q -format '%in' NumJobStarts
 ●   You may want to cap the number with
     periodic_hold/remove
           http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-remove
           http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove


CERN, Dec 2012                                 glideinWMS monitoring                                           27
Why is it restarting?
 ●   OK, I now know it is restarting... but why?
 ●   Most likely, the glidein was killed
      ●   Was it due to your job “misbehaving”?
 ●   Most Grid sites have limits on resource use
      ●   Including CPU, memory and disk
      ●   If you exceed them, the glidein (and you) will be killed
 ●   Glideins should be configured to detect and
     hold/remove your job if you “misbehave”
      ●   Thus you would not be re-started
      ●   If you see many restart, notify your FE admin
                                              Likely there is a policy rule missing
CERN, Dec 2012              glideinWMS monitoring                                     28
What is my job doing?
 ●   What if it is not restarting... just running forever
     (or until hitting the time limit)
 ●   HTCondor allows for peeking at a running job
      ●   A cmdline tool called
          condor_ssh_to_job




      ●   Unfortunately, needs implicit permission from site
            –    And about half of the sites don't allow it
CERN, Dec 2012                           glideinWMS monitoring   29
The End




CERN, Dec 2012   glideinWMS monitoring   30
Pointers
 ●   glideinWMS Home Page
     http://tinyurl.com/glideinWMS
 ●   HTCondor Home Page
     http://research.cs.wisc.edu/htcondor/
 ●   HTCondor support
     htcondor-users@cs.wisc.edu
     htcondor-admin@cs.wisc.edu
 ●   glideinWMS support
     glideinwms-support@fnal.gov

CERN, Dec 2012         glideinWMS monitoring   31
Acknowledgments
 ●   The creation of this document was sponsored
     by grants from the US NSF and US DOE,
     and by the University of California system




CERN, Dec 2012       glideinWMS monitoring         32

Mais conteúdo relacionado

Semelhante a Monitoring and troubleshooting a glideinWMS-based HTCondor pool

JBoss Drools - Open-Source Business Logic Platform
JBoss Drools - Open-Source Business Logic PlatformJBoss Drools - Open-Source Business Logic Platform
JBoss Drools - Open-Source Business Logic Platform
elliando dias
 
Docker and Your Path to a Better Staging Environment - webinar by Gil Tayar
Docker and Your Path to a Better Staging Environment - webinar by Gil TayarDocker and Your Path to a Better Staging Environment - webinar by Gil Tayar
Docker and Your Path to a Better Staging Environment - webinar by Gil Tayar
Applitools
 
Buytaert kris tools
Buytaert kris toolsBuytaert kris tools
Buytaert kris tools
kuchinskaya
 

Semelhante a Monitoring and troubleshooting a glideinWMS-based HTCondor pool (20)

Matchmaking in glideinWMS in CMS
Matchmaking in glideinWMS in CMSMatchmaking in glideinWMS in CMS
Matchmaking in glideinWMS in CMS
 
glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012
 
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
 
Understanding priorities in HTCondor
Understanding priorities in HTCondorUnderstanding priorities in HTCondor
Understanding priorities in HTCondor
 
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
 
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
 
Glidein internals
Glidein internalsGlidein internals
Glidein internals
 
SWT Tech Sharing: Node.js + Redis
SWT Tech Sharing: Node.js + RedisSWT Tech Sharing: Node.js + Redis
SWT Tech Sharing: Node.js + Redis
 
Zurg part 1
Zurg part 1Zurg part 1
Zurg part 1
 
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM... glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 
Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012
 
glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012
 
Automating MySQL operations with Puppet
Automating MySQL operations with PuppetAutomating MySQL operations with Puppet
Automating MySQL operations with Puppet
 
Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)
 
JBoss Drools - Open-Source Business Logic Platform
JBoss Drools - Open-Source Business Logic PlatformJBoss Drools - Open-Source Business Logic Platform
JBoss Drools - Open-Source Business Logic Platform
 
Ganeti Hands-on Walk-thru (part 2) -- LinuxCon 2012
Ganeti Hands-on Walk-thru (part 2) -- LinuxCon 2012Ganeti Hands-on Walk-thru (part 2) -- LinuxCon 2012
Ganeti Hands-on Walk-thru (part 2) -- LinuxCon 2012
 
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
 
Docker and Your Path to a Better Staging Environment - webinar by Gil Tayar
Docker and Your Path to a Better Staging Environment - webinar by Gil TayarDocker and Your Path to a Better Staging Environment - webinar by Gil Tayar
Docker and Your Path to a Better Staging Environment - webinar by Gil Tayar
 
7 tools for your devops stack
7 tools for your devops stack7 tools for your devops stack
7 tools for your devops stack
 
Buytaert kris tools
Buytaert kris toolsBuytaert kris tools
Buytaert kris tools
 

Mais de Igor Sfiligoi

Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
Igor Sfiligoi
 

Mais de Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Monitoring and troubleshooting a glideinWMS-based HTCondor pool

  • 1. glideinWMS for users Monitoring and troubleshooting a glideinWMS-based HTCondor pool by Igor Sfiligoi (UCSD) CERN, Dec 2012 glideinWMS monitoring 1
  • 2. Scope of this talk This talk describes what information are available when troubleshooting in a glideinWMS-based HTCondor pool, and what tools can you use to mine them. Reader is expected to already have a basic understanding of HTCondor and glideinWMS. CERN, Dec 2012 glideinWMS monitoring 2
  • 3. HTCondor Architecture ● As a reminder G.F. +3 VO FE Grid G.F. +1 Execute node Central manager Execute node Submit node Execute node Negotiator Submit node Execute node Submit node Execute node Schedd Condor CERN, Dec 2012 glideinWMS monitoring 3
  • 4. Typical user questions addressed in this talk ● Where is/was my job running? ● Why are my jobs not starting? ● Why do my jobs take forever to finish? CERN, Dec 2012 glideinWMS monitoring 4
  • 5. Where is/was my job running? CERN, Dec 2012 glideinWMS monitoring 5
  • 6. Job progress monitoring ● HTCondor provides two basic means to monitor job progress ● Querying the system for current status – Using the cmdline condor_q/condor_history ● Parsing the job event log – Either plain text or XML formatted – Starting with 7.9.1, condor_history can be used to extract the last known state CERN, Dec 2012 glideinWMS monitoring 6
  • 7. Job status ● Each Job has a status associated with it ● An integer attribute called JobStatus – But has well known semantics associated with each value ● Jobs start in the Idle state ● Become Running if everything works fine ● Completed when they terminate ● If anything goes wrong, a Job will go into Hold ● If removed before completion, will be Removed CERN, Dec 2012 glideinWMS monitoring 7
  • 8. Monitoring the Job Status ● Idle/Running/Held jobs can be polled with condor_q ● Will query the Schedd daemon ● Once they terminate, or are removed, they leave the Schedd queue ● Are put into a file on disk One exception: If a job was running when it ● Can use condor_history was removed, but the execute node does not confirm the job was to retrieve the last ClassAd killed remotely, the job will be kept in the Schedd. ● The job event log has all the state transitions (of course) CERN, Dec 2012 glideinWMS monitoring 8
  • 9. So, where is the job running? ● Easy to get the machine name and/or IP ● Standard HTCondor attribute RemoteHost & StartdIpAddr ● But may not necessary make sense ● Do you recognize all network domains? ● And they could be on a private network! CERN, Dec 2012 glideinWMS monitoring 9
  • 10. Getting glidein attributes ● Glideins have many more attributes that describe them ● e.g. symbolic site name GLIDEIN_CMSSite ● However, by default, you do not get this info in the Job Classad ● But easy to add ● <my attr> = $$(<glidein attr>:Unknown) – Will get the info in MATCH_EXP_<my attr> CERN, Dec 2012 glideinWMS monitoring 10
  • 11. Standard attributes ● Standard glideinWMS attributes ● JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)" ● JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)" ● JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)" ● JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)" Useful ● JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)" for in-depth debugging ● JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)" ● JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)" ● Standard CMS glideinWMS attribute ● JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)" Configured by the HTCondor admin, no need for the user to do anything SUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ... CERN, Dec 2012 glideinWMS monitoring 11
  • 12. Getting them in the event log ● You (or the admins) can also propagate the attributes into the event log job_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, … ● As a result you get “Job Ad” events ... ... 001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749> ... (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749> 001 ... 028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered. TriggerEventTypeNumber = 12/03 00:46:33 Job ad information event triggered. 028 (20327.002.000) 1 TriggerEventTypeNumber = 1 Cluster = 20327 Cluster = 20327 EventTypeNumber = 28 EventTypeNumber = 28 ExecuteHost = "<193.48.85.94:38749>" ExecuteHost = "<193.48.85.94:38749>" JOB_CMSSite = "T2_FR_IPHC" JOB_CMSSite = "T2_FR_IPHC" EventTime = "2012-12-03T00:46:33" EventTime = "2012-12-03T00:46:33" TriggerEventTypeName = "ULOG_EXECUTE" TriggerEventTypeName = "ULOG_EXECUTE" Proc = 2 Proc = 2 Subproc = 0 CurrentTime 0= time() Subproc = CurrentTime = time() MyType = "ExecuteEvent" MyType = "ExecuteEvent" ... ... CERN, Dec 2012 glideinWMS monitoring 12
  • 13. Why is my job not starting? CERN, Dec 2012 glideinWMS monitoring 13
  • 14. Troubleshooting process ● First question ● Do my jobs match any (logical) resource? ● Once you are sure of that ● Are there jobs from higher priority users? ● Are desired sites just too busy? ● Are there problems at desired site(s)? ● If nothing gives a satisfying answer ● It may be a glideinWMS misconfiguration, see help from VO FE admins CERN, Dec 2012 glideinWMS monitoring 14
  • 15. How do I know if my jobs match? ● Good question! ● Unfortunately, the answer is not trivial ● The FE matching policy not “public” ● And, of course, no tools to probe for it ● You will have to rely on the FE admins to “explain” the policy ● Hopefully in a human readable format ● Hopefully without conversion errors! CERN, Dec 2012 glideinWMS monitoring 15
  • 16. An example FE policy ● See the CMS FE talk for an actual high level view ● The actual FE policy is a python expression A simple example – could be much more complex (glidein["attrs"]["GLIDEIN_CMSSite"] (glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and in job["DESIRED_Sites"].split(",")) and ((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") ((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1)) == (job.get("DESIRES_HTPC")==1)) ● And then there is the matching HTCondor one (stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) && (stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) && ((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True)) ((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True)) CERN, Dec 2012 glideinWMS monitoring 16
  • 17. A word about HTCondor matching ● Once glideins start, you can probe their policy condor_status -format '%s' START $ condor_status -format '%sn' START ( $( condor_statustrue ) && '%sn' START ( ( stringListMember(GLIDEIN_CMSSite, true ) && ( -format ( true ) && DESIRED_Sites,",") =?= )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite, ( ( true ) && ( true && && GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR DESIRED_Sites,",")) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR ES_HTPC =?= true ) ) && ( ( ( GLIDEIN_Is_HTPC undefined =?= Cur ES_HTPC < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur rentTime =?= true ) ) ) ) && ( rentTime) <&& ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite, ( true GLIDEIN_ToRetire ) ) ) DESIRED_Sites,",") =?= && ( ( true ) && ( true )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite, && GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR ES_HTPC =?= true ) ) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR DESIRED_Sites,",") ) && ( ( ( GLIDEIN_Is_HTPC undefined =?= Cur rentTime < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur ES_HTPC =?= true ) ) ) ) && rentTime < GLIDEIN_ToRetire ) ) ) ... ... ● But no tools to help you understand the M.M. ● The closest is condor_q -analyze – But only looks at Job requirements – So, not really helping when all/most of the policy in glideins CERN, Dec 2012 glideinWMS monitoring 17
  • 18. User priorities ● So, jobs should be matching, but are not starting ● And there are plenty matching glideins in the system ● Likely there are other higher-priority jobs in the system ● Possibly from a different user Warning: Slow! condor_userio ● Possibly on a different schedd condor_status -submitters ● No tools to give you the easy answer ● If you need the answer, you will have to investigate CERN, Dec 2012 glideinWMS monitoring 18
  • 19. Unclaimed glideins ● If you see plenty of Unclaimed glideins, but no matching jobs from other users ● You have either reached the schedd limit MAX_JOBS_RUNNING ● Or something bad is going on! ● You can only ask yout FE admin for help ● But first double check that your jobs should indeed be matching, at least on paper CERN, Dec 2012 glideinWMS monitoring 19
  • 20. Supported Sites ● What should you do if there are no (new) glideins coming from an expected site? ● First off, see if the site is even supported by the glideinWMS instance! ● Each Entry has a ClassAd condor_status -any -const 'MyType==”glideresource”' ● Look for the attributes your FE is matching on e.g. GLIDEIN_CMSSite Site not there? Notify your FE admin! CERN, Dec 2012 glideinWMS monitoring 20
  • 21. Is the FE even asking for them? ● You are sure that your jobs should be matching? ● But what if you are wrong? ● Check it out … -format '%in' GlideFactoryMonitorRequestedIdle But remember it is not just your jobs. CERN, Dec 2012 glideinWMS monitoring 21
  • 22. Maybe the site is just busy? ● Glideins have to compete with other Grid jobs at most sites ● Maybe the site is just busy? ● Check if glideinWMS has put any glideins in the Grid queues … -format '%in' GlideFactoryMonitorStatusPending If you find zeros, notify your FE admin! CERN, Dec 2012 glideinWMS monitoring 22
  • 23. Site problems? ● The glideins will validate the worker node before talking to the C.M. ● If the test fails, the glidein will “waste” 20 mins on the node to prevent other jobs to fail on it again ● You can check if there are “Running” glideins in glideinWMS, even though you see none (or few) in the C.M. … -format '%in' GlideFactoryMonitorStatusRunning If you find a discrepancy, notify your FE admin! CERN, Dec 2012 glideinWMS monitoring 23
  • 24. Still no clue? ● If all your detective work fails ● Notify your VO FE admin ● They have access to information you don't CERN, Dec 2012 glideinWMS monitoring 24
  • 25. Why do my jobs take forever to finish? CERN, Dec 2012 glideinWMS monitoring 25
  • 26. My jobs are running, but... ● Great, your jobs are happily running ● But you are getting no results back! ● i.e. the jobs are not finishing in the expected time ● Two main likely reasons ● They are being restarted ● You miscalculated the needed time CERN, Dec 2012 glideinWMS monitoring 26
  • 27. Jobs re-starting ● HTCondor tries to be user friendly ● If a job gets preempted, for almost any reason, it will try to re-start it with the hope it will finish on the next try ● And will not ever give up! (by default) ● You can easily check how many times it started condor_q -format '%in' NumJobStarts ● You may want to cap the number with periodic_hold/remove http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-remove http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove CERN, Dec 2012 glideinWMS monitoring 27
  • 28. Why is it restarting? ● OK, I now know it is restarting... but why? ● Most likely, the glidein was killed ● Was it due to your job “misbehaving”? ● Most Grid sites have limits on resource use ● Including CPU, memory and disk ● If you exceed them, the glidein (and you) will be killed ● Glideins should be configured to detect and hold/remove your job if you “misbehave” ● Thus you would not be re-started ● If you see many restart, notify your FE admin Likely there is a policy rule missing CERN, Dec 2012 glideinWMS monitoring 28
  • 29. What is my job doing? ● What if it is not restarting... just running forever (or until hitting the time limit) ● HTCondor allows for peeking at a running job ● A cmdline tool called condor_ssh_to_job ● Unfortunately, needs implicit permission from site – And about half of the sites don't allow it CERN, Dec 2012 glideinWMS monitoring 29
  • 30. The End CERN, Dec 2012 glideinWMS monitoring 30
  • 31. Pointers ● glideinWMS Home Page http://tinyurl.com/glideinWMS ● HTCondor Home Page http://research.cs.wisc.edu/htcondor/ ● HTCondor support htcondor-users@cs.wisc.edu htcondor-admin@cs.wisc.edu ● glideinWMS support glideinwms-support@fnal.gov CERN, Dec 2012 glideinWMS monitoring 31
  • 32. Acknowledgments ● The creation of this document was sponsored by grants from the US NSF and US DOE, and by the University of California system CERN, Dec 2012 glideinWMS monitoring 32