SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
glideinWMS training



          Solving Grid problems
        through glidein monitoring
            i.e. The Grid debugging part of G.Factory operations

                           by Igor Sfiligoi (UCSD)




glideinWMS training              Grid debugging                    1
Glidein Factory Operations
●    Factory node operations
●    Serving VO Frontend Admin requests
●    Keeping up with changes in the Grid
●    Debugging Grid problems
      ●   The most time consuming part
      ●   Effectively we help solve Grid problems,
          through glidein monitoring

    glideinWMS training      Grid debugging      2
Reminder - Glideins
 ●   A glidein is a properly configured Condor startd
     daemon submitted as a Grid job


                                    Submit node                    Worker node

     Frontend                                                        glidein
                         Monitor
                         Condor    Central manager
                                                                      Startd
                      Match
                                                              CE           Job
              Request
              glideins
                                   Factory         Submit
                                                   glideins

glideinWMS training                          Grid debugging                      3
What can go wrong in the Grid?
 ●   Many places where thing can go wrong
      ●   Essentially at any of the arrows below


                       Submit node                    Worker node
                                                        glidein
                      Central manager
                                                         Startd
                                                 CE           Job


                      Factory



glideinWMS training             Grid debugging                      4
What can go wrong in the Grid?
 ●   In particular
      ●   CE may refuse to accept glideins



                        Submit node                   Worker node
                                                        glidein
                       Central manager
                                                         Startd
                                                 CE           Job


                      Factory



glideinWMS training             Grid debugging                      5
What can go wrong in the Grid?
 ●   In particular
      ●   CE may not start glideins
      ●   Or fail to tell us what
          the status of the job is
                          Submit node                   Worker node
                                                          glidein
                        Central manager
                                                           Startd
                                                   CE           Job


                        Factory



glideinWMS training               Grid debugging                      6
What can go wrong in the Grid?
 ●   In particular
      ●   The worker node may be broken/misconfigured
            –   Thus validation
                will fail
      ●   Many               Submit node                    Worker node
          reasons
                                                              glidein
                            Central manager
                                                               Startd
                                                       CE           Job


                            Factory



glideinWMS training                   Grid debugging                      7
What can go wrong in the Grid?
 ●   In particular
      ●   The WAN networking may not work properly
      ●   The CM never hears
          from the Startd
      ●   Or Schedd     Submit node                   Worker node
          cannot                                        glidein
          talk to     Central manager
                                                         Startd
          Startd
                                                 CE           Job
      ●   Can be selective
                      Factory



glideinWMS training             Grid debugging                      8
What can go wrong in the Grid?
 ●   In particular
      ●   Or the security infrastructure could be broken
            –   CAs missing
            –   Time discrepancies
            –   Etc.         Submit node                    Worker node
                                                              glidein
                            Central manager
                                                               Startd
                                                       CE           Job


                            Factory



glideinWMS training                   Grid debugging                      9
What can go wrong in the Grid?
 ●   In particular
      ●   The site may refuse to start the user job
            –   e.g. glexec


                               Submit node                    Worker node
                                                                glidein
                              Central manager
                                                                 Startd
                                                         CE           Job


                              Factory



glideinWMS training                     Grid debugging                      10
What can go wrong with glideins?
 ●   And there are also non-Grid problems
      ●   Jobs not matching
 ●   But that's
     beyond
     the scope         Submit node                    Worker node
     of this                                            glidein
     document         Central manager
                                                         Startd
                                                 CE           Job


                      Factory



glideinWMS training             Grid debugging                      11
Problem classification
 ●   Most often we see WN problems               Typically easy
                                                  to diagnose
      ●   Followed by CEs refusing glideins
 ●   Then there are misbehaving CEs
      ●   Very hard to diagnose!
 ●   Everything else quite rare
      ●   But usually hard to diagnose as well




glideinWMS training           Grid debugging                      12
Grid debugging




                      Validation problems
                       i.e. Problems on Worker Nodes




glideinWMS training              Grid debugging        13
WN problems
 ●   The glidein startup script runs
     a list of validation scripts
      ●   If any of them fails, the WN is considered broken
      ●   This way user jobs never get to broken WNs
 ●   Two sources of tests
      ●   Glidein Factory
      ●   VO Frontend
 ●   Of course, if the validation script cannot be fetched
     from either Web server, it is considered a failure
     as well
glideinWMS training          Grid debugging                   14
Types of tests
 ●   The glideinWMS SW comes with a set of
     standard tests (provided by the factory):
      ●   Grid environment present (e.g. CAs)
      ●   Some free disk on $PWD and on /tmp
      ●   Enough FE-provided proxy lifetime remaining
      ●   gLExec related tests
      ●   OS type
 ●   Each VO may have its own needs, e.g.:
      ●   Is VO SW pre-installed and accessible?

glideinWMS training          Grid debugging             15
Discovering the problems
 ●   Any error message printed out by the validation
     script will be delivered back to the factory
      ●   After the glidein terminates
 ●   Most validation scripts provide clear indication
     what went wrong
      ●   And we strive to get all to do it!
 ●   New machine readable format being introduced
      ●   With v2_6_2


glideinWMS training            Grid debugging           16
Typical ops
 ●   Noticing that a large fraction of glideins for a
     site are failing is easy
      ●   Just look at the monitoring
      ●   And we are getting a daily email as well
 ●   Discovering what exactly is broken not too
     difficult either
      ●   Just parse the logs
      ●   Will get even easier when all scripts
          return machine readable information
                                                 With appropriate tools

glideinWMS training             Grid debugging                            17
Action items
                                                        Unless this is
 ●   Not much we can do directly                        the result of a
                                                       misconfiguration
                                                          on our part
 ●   Typically, we open a ticket with the site
      ●   Provide the list of nodes where it happens
          (rare to have the whole site broken)
      ●   A concise but complete error report
          essential for a speedy resolution
 ●   In minority of cases we have to contact the
     VO FE admin, e.g.
      ●   Unclear error messages
      ●   Non-WN specific validation errors

glideinWMS training                Grid debugging                 18
Black hole nodes
 ●   There is one further WM problem
      ●   Black hole WNs
      ●   WNs that accept glidein jobs, but don't execute them
 ●   glidein_startup never has the chance
     to log anything
      ●   Not even the node it is running on
      ●   Thus, empty log files!
 ●   We can infer we have a black hole node at a site
     by looking at job timing (in Condor-G logs)
      ●   Good jobs run for at least 20 mins

glideinWMS training          Grid debugging                 19
Grid debugging




                CE refusing the glideins




glideinWMS training       Grid debugging   20
CE Refusing the glideins
 ●   CE admin has the right to refuse anyone
      ●   But usually does not change his mind overnight
      ●   First time accessing a site an issue on its own
            –   Not covered here
 ●   When things go wrong, the typical reason is
      ●   CE service down,
      ●   Problems in the Security/Auth infrastructure,
      ●   CE seriously misconfigured/broken


glideinWMS training                Grid debugging           21
Expected vs Unexpected
 ●   Some “problems” are expected
      ●   e.g. the CE is down for scheduled maintenance
      ●   Nothing to do in this case!
            –   Just a monitoring issue
      ●   So, checking the maintenance DB important!
 ●   If not, we have to notify the site
      ●   The VO FEs are not getting the CPU slots
          they are asking for



glideinWMS training                Grid debugging         22
Discovering the problem
 ●   Condor-G reacts in two different ways
      ●   Does nothing – We still have monitoring showing
          the job did not progress from Waiting→Pending
      ●   Puts the job on Hold
 ●   The G.Factory will react on Held jobs
      ●   Releasing them a few time → Condor-G retries
      ●   Removing them after a while
            –   Just to be replaced with identical glideins
                                            For most non-trivial problems
                                            the problem does not solve by itself

glideinWMS training                 Grid debugging                                 23
Action items
                            (for unexpected problems)



 ●   Most of the time, not much we can do directly
      ●   Will just open ticket with site
      ●   If any useful info in the HoldReason, we pass it on
      ●   DN of the proxy the most valuable info
 ●   But it could be our problem, too
      ●   Found many Condor-G problems in the past
      ●   Comparing the behavior of many G.Factory
          instances can confirm or exclude this
                         Ah-hoc solutions needed
                         if this is the case


glideinWMS training             Grid debugging                  24
Grid debugging



                        CE not properly
                      handling the glideins



glideinWMS training           Grid debugging   25
Problematic CE
 ●   Three basic types of problems:
      ●   Glideins not starting
      ●   Improper monitoring information
      ●   Output files not being delivered to client
 ●   And there is two more
      ●   Unexpected policies that kill glideins




glideinWMS training           Grid debugging           26
Glideins not starting
 ●   The CE scheduling policy is not available to us
      ●   So often not obvious if we are just low priority or
          something else is going on
      ●   GF/Condor-G does not see it as an error condition
 ●   We usually don't act on it, unless
      ●   The VO FE admin complains, or
      ●   We have been given explicit guidance of the
          expected startup rates
 ●   Not much for us to investigate
      ●   Just tell the site admin “Jobs are not starting”
glideinWMS training            Grid debugging                   27
Glideins being killed by the site
 ●   Ideally, our glideins should fit within
     the policies of the site            But getting this info
                                                is not trivial, remember?
      ●   But sometimes they don't
      ●   So they get killed hard
 ●   Discovering this from our side very hard
      ●   We often just notice empty log files
      ●   Not an error for Condor-G
      ●   Often learn of this because the VO complains
 ●   If and when we understand the problem,
     we can deal with it ourselves
      ●   i.e. we config the glideins to stay within the limits
glideinWMS training            Grid debugging                               28
Preemption
 ●   Some site will preempt our glideins
     if higher priority jobs get into the queue
      ●   Effectively killing our glideins
 ●   Not an actual error
      ●   Sites have the right to do it!
 ●   But it can mess up with our monitoring/ops
      ●   We may see killed glideins, or
      ●   We may see glideins that seem to run for
          a very long time (when automatically rescheduled on the CE)
 ●   We have to efficiently filter these events out
glideinWMS training                  Grid debugging                     29
Improper monitoring info from CE
 ●   A CE may not provide reliable information
 ●   Each VO FE provides us with monitoring
     information about its central manager
      ●   By comparing what it tells us, with what
          the CE tells us, we can infer if there are problems
 ●   A large, consistent discrepancy typically signals
     problems in the CE monitoring
 ●   Very difficult to figure out what is going on
      ●   We have no direct detailed data to act upon
      ●   Mostly ad-hoc detective work, prodding the black box
      ●   Often inconclusive
glideinWMS training           Grid debugging                    30
Lack of output files
●    The glidein output files contain
      ●   Accounting information
      ●   Detail logging
●    Without other problems, mostly an annoyance
●    But much more often paired with glideins failing
      ●   Making failure diagnostics close to impossible
●    Extremely hard to diagnose the root cause
      ●   Sometimes we may infer it (black holes, killed glideins, ...)
      ●   For actual CE problems it requires help from many
          parties, including us, the site admins and SW developers
    glideinWMS training          Grid debugging                    31
Grid debugging




                      Networking problems




glideinWMS training          Grid debugging   32
Glideins are network heavy
 ●   Each glidein opens several
     long‑lived TCP connections (in CCB mode)
      ●   Can overwhelm networking gear
            –   e.g. NATs can run out of spare ports
 ●   Problems can have non-linear behavior
      ●   Will work fine on small scale
      ●   Will degrade after a while
            –   Not necessarily a step function, though
                                                           Although straight out
                                                          denials due to firewalls
                                                            are also a problem

glideinWMS training                Grid debugging                            33
Diagnostics and action items
 ●   Not trivial to detect
      ●   Errors often in the glidein logs
                                                       And we are lacking
      ●   But difficult to interpret                 tools for automatically
                                                         detecting this.
 ●   Not much we can do directly
      ●   A problem between the VO services and the site
            –   So we notify both
 ●   However
      ●   we usually end up assisting as experts


glideinWMS training                 Grid debugging                     34
Grid debugging




               Authentication problems




glideinWMS training       Grid debugging   35
Security is delicate stuff
 ●   Grid security mechanisms paranoid by design
      ●   “Availability” is the last to be considered
      ●   The main focus is keeping the “bad guys” out
 ●   So they are extremely delicate
      ●   If any piece of the chain breaks, everything breaks
 ●   Things that can go wrong (non exhaustive list):
      ●   Missing CA(s)
      ●   Expired CRLs
      ●   Expired glidein proxy
      ●   Wrong system time (clock skew)
glideinWMS training             Grid debugging                  36
Diagnostics and action items
 ●   Finding the root cause usually hard
                                                                  And we are lacking
      ●   Errors are in the glidein logs                        tools for automatically
                                                                    detecting this.
      ●   But usually do not provide enough info
          (to avoid giving up too much info to a hypothetical attacker)
 ●   Have to distinguish between
     site problems and VO problems, too
      ●   Only obvious if only a fraction fails (→ WN problem)
      ●   Else, may need to get both sides involved to
          properly diagnose the root cause


glideinWMS training                Grid debugging                              37
Grid debugging




                      Job startup problems




glideinWMS training           Grid debugging   38
gLExec              (1)



 ●   The biggest source of problems, by far,
     is gLExec refusing to accept a user proxy
      ●   Resulting in jobs not starting
      ●   BTW, Condor is not good at handling gLExec denials
 ●   We can only partially test gLExec
     during validation
      ●   May behave differently based on the proxy used
      ●   Its behavior can change in time
 ●   And final users may be the source of the problem
      ●   e.g. by letting the proxy expire            Condor could catch
                                                     these, and hopefully
                                                          soon will
glideinWMS training           Grid debugging                         39
gLExec               (2)



 ●   Non trivial to detect
      ●   Errors are in the glidein logs
      ●   But we miss the tools to extract them
 ●   Finding the root cause impossible without
     site admin help
      ●   gLExec policies are a site secret
      ●   We thus just notify the site,
          providing the failing user DN



glideinWMS training           Grid debugging         40
Configuration problems
 ●   Condor can be configured to run a wrapper
     around the user job
      ●   To customize the user environment
      ●   Usually provided by the VO FE
 ●   If that fails, the user job fails with it
 ●   Luckily, failures are rare
      ●   If we notice them, we notify the VO FE admins
      ●   However, they often notice before we do


glideinWMS training           Grid debugging              41
Other job startup problems
 ●   By default, we validate the node
     only at glidein startup
      ●   WN conditions may change by the time a job
          is scheduled to run             We should do better.
            –   e.g. the disk fills up                      Condor supports
                                                           periodic validation
 ●   The errors are usually only                          tests, we just don't
                                                          use them right now.
     seen by the final users
      ●   So we hardly ever notice
          these kind of problems


glideinWMS training                      Grid debugging                          42
Summary
●   The Grid world is a good approximation of
    a chaotic system
     ●   There are thus many failure modes
●   The pilot paradigm hides most of the failures
    from the final users
     ●   But the failures are still there
     ●   Resulting in wasted/underused CPU cycles
●   The G.Factory operators are in the best position to
    diagnose the root cause of the failures
     ●   By having a global view
     ●   However, they cannot solve the problems by themselves
    glideinWMS training       Grid debugging                43
Acknowledgments
 ●   This document was sponsored by grants from
     the US NSF and US DOE,
     and by the UC system




glideinWMS training        Grid debugging         44

Mais conteúdo relacionado

Semelhante a Solving Grid problems through glidein monitoring

glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012Igor Sfiligoi
 
Glidein Factory Operations
Glidein Factory OperationsGlidein Factory Operations
Glidein Factory OperationsIgor Sfiligoi
 
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM... glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...Igor Sfiligoi
 
glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012Igor Sfiligoi
 
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolMonitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolIgor Sfiligoi
 
Introduction to glideinWMS
Introduction to glideinWMSIntroduction to glideinWMS
Introduction to glideinWMSIgor Sfiligoi
 
glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012Igor Sfiligoi
 
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Igor Sfiligoi
 
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...Igor Sfiligoi
 
O futuro do cloud deployment
O futuro do cloud deploymentO futuro do cloud deployment
O futuro do cloud deploymentSidnei Da Silva
 
An argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceAn argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceIgor Sfiligoi
 

Semelhante a Solving Grid problems through glidein monitoring (12)

Glidein internals
Glidein internalsGlidein internals
Glidein internals
 
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
 
Glidein Factory Operations
Glidein Factory OperationsGlidein Factory Operations
Glidein Factory Operations
 
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM... glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 
glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012
 
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolMonitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
 
Introduction to glideinWMS
Introduction to glideinWMSIntroduction to glideinWMS
Introduction to glideinWMS
 
glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012
 
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
 
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
 
O futuro do cloud deployment
O futuro do cloud deploymentO futuro do cloud deployment
O futuro do cloud deployment
 
An argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceAn argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS Experience
 

Mais de Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 

Mais de Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 

Último

Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5DianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 

Último (20)

Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 

Solving Grid problems through glidein monitoring

  • 1. glideinWMS training Solving Grid problems through glidein monitoring i.e. The Grid debugging part of G.Factory operations by Igor Sfiligoi (UCSD) glideinWMS training Grid debugging 1
  • 2. Glidein Factory Operations ● Factory node operations ● Serving VO Frontend Admin requests ● Keeping up with changes in the Grid ● Debugging Grid problems ● The most time consuming part ● Effectively we help solve Grid problems, through glidein monitoring glideinWMS training Grid debugging 2
  • 3. Reminder - Glideins ● A glidein is a properly configured Condor startd daemon submitted as a Grid job Submit node Worker node Frontend glidein Monitor Condor Central manager Startd Match CE Job Request glideins Factory Submit glideins glideinWMS training Grid debugging 3
  • 4. What can go wrong in the Grid? ● Many places where thing can go wrong ● Essentially at any of the arrows below Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 4
  • 5. What can go wrong in the Grid? ● In particular ● CE may refuse to accept glideins Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 5
  • 6. What can go wrong in the Grid? ● In particular ● CE may not start glideins ● Or fail to tell us what the status of the job is Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 6
  • 7. What can go wrong in the Grid? ● In particular ● The worker node may be broken/misconfigured – Thus validation will fail ● Many Submit node Worker node reasons glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 7
  • 8. What can go wrong in the Grid? ● In particular ● The WAN networking may not work properly ● The CM never hears from the Startd ● Or Schedd Submit node Worker node cannot glidein talk to Central manager Startd Startd CE Job ● Can be selective Factory glideinWMS training Grid debugging 8
  • 9. What can go wrong in the Grid? ● In particular ● Or the security infrastructure could be broken – CAs missing – Time discrepancies – Etc. Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 9
  • 10. What can go wrong in the Grid? ● In particular ● The site may refuse to start the user job – e.g. glexec Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 10
  • 11. What can go wrong with glideins? ● And there are also non-Grid problems ● Jobs not matching ● But that's beyond the scope Submit node Worker node of this glidein document Central manager Startd CE Job Factory glideinWMS training Grid debugging 11
  • 12. Problem classification ● Most often we see WN problems Typically easy to diagnose ● Followed by CEs refusing glideins ● Then there are misbehaving CEs ● Very hard to diagnose! ● Everything else quite rare ● But usually hard to diagnose as well glideinWMS training Grid debugging 12
  • 13. Grid debugging Validation problems i.e. Problems on Worker Nodes glideinWMS training Grid debugging 13
  • 14. WN problems ● The glidein startup script runs a list of validation scripts ● If any of them fails, the WN is considered broken ● This way user jobs never get to broken WNs ● Two sources of tests ● Glidein Factory ● VO Frontend ● Of course, if the validation script cannot be fetched from either Web server, it is considered a failure as well glideinWMS training Grid debugging 14
  • 15. Types of tests ● The glideinWMS SW comes with a set of standard tests (provided by the factory): ● Grid environment present (e.g. CAs) ● Some free disk on $PWD and on /tmp ● Enough FE-provided proxy lifetime remaining ● gLExec related tests ● OS type ● Each VO may have its own needs, e.g.: ● Is VO SW pre-installed and accessible? glideinWMS training Grid debugging 15
  • 16. Discovering the problems ● Any error message printed out by the validation script will be delivered back to the factory ● After the glidein terminates ● Most validation scripts provide clear indication what went wrong ● And we strive to get all to do it! ● New machine readable format being introduced ● With v2_6_2 glideinWMS training Grid debugging 16
  • 17. Typical ops ● Noticing that a large fraction of glideins for a site are failing is easy ● Just look at the monitoring ● And we are getting a daily email as well ● Discovering what exactly is broken not too difficult either ● Just parse the logs ● Will get even easier when all scripts return machine readable information With appropriate tools glideinWMS training Grid debugging 17
  • 18. Action items Unless this is ● Not much we can do directly the result of a misconfiguration on our part ● Typically, we open a ticket with the site ● Provide the list of nodes where it happens (rare to have the whole site broken) ● A concise but complete error report essential for a speedy resolution ● In minority of cases we have to contact the VO FE admin, e.g. ● Unclear error messages ● Non-WN specific validation errors glideinWMS training Grid debugging 18
  • 19. Black hole nodes ● There is one further WM problem ● Black hole WNs ● WNs that accept glidein jobs, but don't execute them ● glidein_startup never has the chance to log anything ● Not even the node it is running on ● Thus, empty log files! ● We can infer we have a black hole node at a site by looking at job timing (in Condor-G logs) ● Good jobs run for at least 20 mins glideinWMS training Grid debugging 19
  • 20. Grid debugging CE refusing the glideins glideinWMS training Grid debugging 20
  • 21. CE Refusing the glideins ● CE admin has the right to refuse anyone ● But usually does not change his mind overnight ● First time accessing a site an issue on its own – Not covered here ● When things go wrong, the typical reason is ● CE service down, ● Problems in the Security/Auth infrastructure, ● CE seriously misconfigured/broken glideinWMS training Grid debugging 21
  • 22. Expected vs Unexpected ● Some “problems” are expected ● e.g. the CE is down for scheduled maintenance ● Nothing to do in this case! – Just a monitoring issue ● So, checking the maintenance DB important! ● If not, we have to notify the site ● The VO FEs are not getting the CPU slots they are asking for glideinWMS training Grid debugging 22
  • 23. Discovering the problem ● Condor-G reacts in two different ways ● Does nothing – We still have monitoring showing the job did not progress from Waiting→Pending ● Puts the job on Hold ● The G.Factory will react on Held jobs ● Releasing them a few time → Condor-G retries ● Removing them after a while – Just to be replaced with identical glideins For most non-trivial problems the problem does not solve by itself glideinWMS training Grid debugging 23
  • 24. Action items (for unexpected problems) ● Most of the time, not much we can do directly ● Will just open ticket with site ● If any useful info in the HoldReason, we pass it on ● DN of the proxy the most valuable info ● But it could be our problem, too ● Found many Condor-G problems in the past ● Comparing the behavior of many G.Factory instances can confirm or exclude this Ah-hoc solutions needed if this is the case glideinWMS training Grid debugging 24
  • 25. Grid debugging CE not properly handling the glideins glideinWMS training Grid debugging 25
  • 26. Problematic CE ● Three basic types of problems: ● Glideins not starting ● Improper monitoring information ● Output files not being delivered to client ● And there is two more ● Unexpected policies that kill glideins glideinWMS training Grid debugging 26
  • 27. Glideins not starting ● The CE scheduling policy is not available to us ● So often not obvious if we are just low priority or something else is going on ● GF/Condor-G does not see it as an error condition ● We usually don't act on it, unless ● The VO FE admin complains, or ● We have been given explicit guidance of the expected startup rates ● Not much for us to investigate ● Just tell the site admin “Jobs are not starting” glideinWMS training Grid debugging 27
  • 28. Glideins being killed by the site ● Ideally, our glideins should fit within the policies of the site But getting this info is not trivial, remember? ● But sometimes they don't ● So they get killed hard ● Discovering this from our side very hard ● We often just notice empty log files ● Not an error for Condor-G ● Often learn of this because the VO complains ● If and when we understand the problem, we can deal with it ourselves ● i.e. we config the glideins to stay within the limits glideinWMS training Grid debugging 28
  • 29. Preemption ● Some site will preempt our glideins if higher priority jobs get into the queue ● Effectively killing our glideins ● Not an actual error ● Sites have the right to do it! ● But it can mess up with our monitoring/ops ● We may see killed glideins, or ● We may see glideins that seem to run for a very long time (when automatically rescheduled on the CE) ● We have to efficiently filter these events out glideinWMS training Grid debugging 29
  • 30. Improper monitoring info from CE ● A CE may not provide reliable information ● Each VO FE provides us with monitoring information about its central manager ● By comparing what it tells us, with what the CE tells us, we can infer if there are problems ● A large, consistent discrepancy typically signals problems in the CE monitoring ● Very difficult to figure out what is going on ● We have no direct detailed data to act upon ● Mostly ad-hoc detective work, prodding the black box ● Often inconclusive glideinWMS training Grid debugging 30
  • 31. Lack of output files ● The glidein output files contain ● Accounting information ● Detail logging ● Without other problems, mostly an annoyance ● But much more often paired with glideins failing ● Making failure diagnostics close to impossible ● Extremely hard to diagnose the root cause ● Sometimes we may infer it (black holes, killed glideins, ...) ● For actual CE problems it requires help from many parties, including us, the site admins and SW developers glideinWMS training Grid debugging 31
  • 32. Grid debugging Networking problems glideinWMS training Grid debugging 32
  • 33. Glideins are network heavy ● Each glidein opens several long‑lived TCP connections (in CCB mode) ● Can overwhelm networking gear – e.g. NATs can run out of spare ports ● Problems can have non-linear behavior ● Will work fine on small scale ● Will degrade after a while – Not necessarily a step function, though Although straight out denials due to firewalls are also a problem glideinWMS training Grid debugging 33
  • 34. Diagnostics and action items ● Not trivial to detect ● Errors often in the glidein logs And we are lacking ● But difficult to interpret tools for automatically detecting this. ● Not much we can do directly ● A problem between the VO services and the site – So we notify both ● However ● we usually end up assisting as experts glideinWMS training Grid debugging 34
  • 35. Grid debugging Authentication problems glideinWMS training Grid debugging 35
  • 36. Security is delicate stuff ● Grid security mechanisms paranoid by design ● “Availability” is the last to be considered ● The main focus is keeping the “bad guys” out ● So they are extremely delicate ● If any piece of the chain breaks, everything breaks ● Things that can go wrong (non exhaustive list): ● Missing CA(s) ● Expired CRLs ● Expired glidein proxy ● Wrong system time (clock skew) glideinWMS training Grid debugging 36
  • 37. Diagnostics and action items ● Finding the root cause usually hard And we are lacking ● Errors are in the glidein logs tools for automatically detecting this. ● But usually do not provide enough info (to avoid giving up too much info to a hypothetical attacker) ● Have to distinguish between site problems and VO problems, too ● Only obvious if only a fraction fails (→ WN problem) ● Else, may need to get both sides involved to properly diagnose the root cause glideinWMS training Grid debugging 37
  • 38. Grid debugging Job startup problems glideinWMS training Grid debugging 38
  • 39. gLExec (1) ● The biggest source of problems, by far, is gLExec refusing to accept a user proxy ● Resulting in jobs not starting ● BTW, Condor is not good at handling gLExec denials ● We can only partially test gLExec during validation ● May behave differently based on the proxy used ● Its behavior can change in time ● And final users may be the source of the problem ● e.g. by letting the proxy expire Condor could catch these, and hopefully soon will glideinWMS training Grid debugging 39
  • 40. gLExec (2) ● Non trivial to detect ● Errors are in the glidein logs ● But we miss the tools to extract them ● Finding the root cause impossible without site admin help ● gLExec policies are a site secret ● We thus just notify the site, providing the failing user DN glideinWMS training Grid debugging 40
  • 41. Configuration problems ● Condor can be configured to run a wrapper around the user job ● To customize the user environment ● Usually provided by the VO FE ● If that fails, the user job fails with it ● Luckily, failures are rare ● If we notice them, we notify the VO FE admins ● However, they often notice before we do glideinWMS training Grid debugging 41
  • 42. Other job startup problems ● By default, we validate the node only at glidein startup ● WN conditions may change by the time a job is scheduled to run We should do better. – e.g. the disk fills up Condor supports periodic validation ● The errors are usually only tests, we just don't use them right now. seen by the final users ● So we hardly ever notice these kind of problems glideinWMS training Grid debugging 42
  • 43. Summary ● The Grid world is a good approximation of a chaotic system ● There are thus many failure modes ● The pilot paradigm hides most of the failures from the final users ● But the failures are still there ● Resulting in wasted/underused CPU cycles ● The G.Factory operators are in the best position to diagnose the root cause of the failures ● By having a global view ● However, they cannot solve the problems by themselves glideinWMS training Grid debugging 43
  • 44. Acknowledgments ● This document was sponsored by grants from the US NSF and US DOE, and by the UC system glideinWMS training Grid debugging 44