This document provides an overview of common problems that can occur in a grid and how they are diagnosed and addressed through glidein monitoring. It discusses issues that may happen at various points such as compute elements refusing glideins, validation failures on worker nodes, authentication problems, and job startup failures due to issues like gLExec configuration. The document aims to help understand the debugging process for grid problems and how glidein monitoring plays a key role in solving grid issues.
How to Troubleshoot Apps for the Modern Connected Worker
Solving Grid problems through glidein monitoring
1. glideinWMS training
Solving Grid problems
through glidein monitoring
i.e. The Grid debugging part of G.Factory operations
by Igor Sfiligoi (UCSD)
glideinWMS training Grid debugging 1
2. Glidein Factory Operations
● Factory node operations
● Serving VO Frontend Admin requests
● Keeping up with changes in the Grid
● Debugging Grid problems
● The most time consuming part
● Effectively we help solve Grid problems,
through glidein monitoring
glideinWMS training Grid debugging 2
3. Reminder - Glideins
● A glidein is a properly configured Condor startd
daemon submitted as a Grid job
Submit node Worker node
Frontend glidein
Monitor
Condor Central manager
Startd
Match
CE Job
Request
glideins
Factory Submit
glideins
glideinWMS training Grid debugging 3
4. What can go wrong in the Grid?
● Many places where thing can go wrong
● Essentially at any of the arrows below
Submit node Worker node
glidein
Central manager
Startd
CE Job
Factory
glideinWMS training Grid debugging 4
5. What can go wrong in the Grid?
● In particular
● CE may refuse to accept glideins
Submit node Worker node
glidein
Central manager
Startd
CE Job
Factory
glideinWMS training Grid debugging 5
6. What can go wrong in the Grid?
● In particular
● CE may not start glideins
● Or fail to tell us what
the status of the job is
Submit node Worker node
glidein
Central manager
Startd
CE Job
Factory
glideinWMS training Grid debugging 6
7. What can go wrong in the Grid?
● In particular
● The worker node may be broken/misconfigured
– Thus validation
will fail
● Many Submit node Worker node
reasons
glidein
Central manager
Startd
CE Job
Factory
glideinWMS training Grid debugging 7
8. What can go wrong in the Grid?
● In particular
● The WAN networking may not work properly
● The CM never hears
from the Startd
● Or Schedd Submit node Worker node
cannot glidein
talk to Central manager
Startd
Startd
CE Job
● Can be selective
Factory
glideinWMS training Grid debugging 8
9. What can go wrong in the Grid?
● In particular
● Or the security infrastructure could be broken
– CAs missing
– Time discrepancies
– Etc. Submit node Worker node
glidein
Central manager
Startd
CE Job
Factory
glideinWMS training Grid debugging 9
10. What can go wrong in the Grid?
● In particular
● The site may refuse to start the user job
– e.g. glexec
Submit node Worker node
glidein
Central manager
Startd
CE Job
Factory
glideinWMS training Grid debugging 10
11. What can go wrong with glideins?
● And there are also non-Grid problems
● Jobs not matching
● But that's
beyond
the scope Submit node Worker node
of this glidein
document Central manager
Startd
CE Job
Factory
glideinWMS training Grid debugging 11
12. Problem classification
● Most often we see WN problems Typically easy
to diagnose
● Followed by CEs refusing glideins
● Then there are misbehaving CEs
● Very hard to diagnose!
● Everything else quite rare
● But usually hard to diagnose as well
glideinWMS training Grid debugging 12
13. Grid debugging
Validation problems
i.e. Problems on Worker Nodes
glideinWMS training Grid debugging 13
14. WN problems
● The glidein startup script runs
a list of validation scripts
● If any of them fails, the WN is considered broken
● This way user jobs never get to broken WNs
● Two sources of tests
● Glidein Factory
● VO Frontend
● Of course, if the validation script cannot be fetched
from either Web server, it is considered a failure
as well
glideinWMS training Grid debugging 14
15. Types of tests
● The glideinWMS SW comes with a set of
standard tests (provided by the factory):
● Grid environment present (e.g. CAs)
● Some free disk on $PWD and on /tmp
● Enough FE-provided proxy lifetime remaining
● gLExec related tests
● OS type
● Each VO may have its own needs, e.g.:
● Is VO SW pre-installed and accessible?
glideinWMS training Grid debugging 15
16. Discovering the problems
● Any error message printed out by the validation
script will be delivered back to the factory
● After the glidein terminates
● Most validation scripts provide clear indication
what went wrong
● And we strive to get all to do it!
● New machine readable format being introduced
● With v2_6_2
glideinWMS training Grid debugging 16
17. Typical ops
● Noticing that a large fraction of glideins for a
site are failing is easy
● Just look at the monitoring
● And we are getting a daily email as well
● Discovering what exactly is broken not too
difficult either
● Just parse the logs
● Will get even easier when all scripts
return machine readable information
With appropriate tools
glideinWMS training Grid debugging 17
18. Action items
Unless this is
● Not much we can do directly the result of a
misconfiguration
on our part
● Typically, we open a ticket with the site
● Provide the list of nodes where it happens
(rare to have the whole site broken)
● A concise but complete error report
essential for a speedy resolution
● In minority of cases we have to contact the
VO FE admin, e.g.
● Unclear error messages
● Non-WN specific validation errors
glideinWMS training Grid debugging 18
19. Black hole nodes
● There is one further WM problem
● Black hole WNs
● WNs that accept glidein jobs, but don't execute them
● glidein_startup never has the chance
to log anything
● Not even the node it is running on
● Thus, empty log files!
● We can infer we have a black hole node at a site
by looking at job timing (in Condor-G logs)
● Good jobs run for at least 20 mins
glideinWMS training Grid debugging 19
20. Grid debugging
CE refusing the glideins
glideinWMS training Grid debugging 20
21. CE Refusing the glideins
● CE admin has the right to refuse anyone
● But usually does not change his mind overnight
● First time accessing a site an issue on its own
– Not covered here
● When things go wrong, the typical reason is
● CE service down,
● Problems in the Security/Auth infrastructure,
● CE seriously misconfigured/broken
glideinWMS training Grid debugging 21
22. Expected vs Unexpected
● Some “problems” are expected
● e.g. the CE is down for scheduled maintenance
● Nothing to do in this case!
– Just a monitoring issue
● So, checking the maintenance DB important!
● If not, we have to notify the site
● The VO FEs are not getting the CPU slots
they are asking for
glideinWMS training Grid debugging 22
23. Discovering the problem
● Condor-G reacts in two different ways
● Does nothing – We still have monitoring showing
the job did not progress from Waiting→Pending
● Puts the job on Hold
● The G.Factory will react on Held jobs
● Releasing them a few time → Condor-G retries
● Removing them after a while
– Just to be replaced with identical glideins
For most non-trivial problems
the problem does not solve by itself
glideinWMS training Grid debugging 23
24. Action items
(for unexpected problems)
● Most of the time, not much we can do directly
● Will just open ticket with site
● If any useful info in the HoldReason, we pass it on
● DN of the proxy the most valuable info
● But it could be our problem, too
● Found many Condor-G problems in the past
● Comparing the behavior of many G.Factory
instances can confirm or exclude this
Ah-hoc solutions needed
if this is the case
glideinWMS training Grid debugging 24
25. Grid debugging
CE not properly
handling the glideins
glideinWMS training Grid debugging 25
26. Problematic CE
● Three basic types of problems:
● Glideins not starting
● Improper monitoring information
● Output files not being delivered to client
● And there is two more
● Unexpected policies that kill glideins
glideinWMS training Grid debugging 26
27. Glideins not starting
● The CE scheduling policy is not available to us
● So often not obvious if we are just low priority or
something else is going on
● GF/Condor-G does not see it as an error condition
● We usually don't act on it, unless
● The VO FE admin complains, or
● We have been given explicit guidance of the
expected startup rates
● Not much for us to investigate
● Just tell the site admin “Jobs are not starting”
glideinWMS training Grid debugging 27
28. Glideins being killed by the site
● Ideally, our glideins should fit within
the policies of the site But getting this info
is not trivial, remember?
● But sometimes they don't
● So they get killed hard
● Discovering this from our side very hard
● We often just notice empty log files
● Not an error for Condor-G
● Often learn of this because the VO complains
● If and when we understand the problem,
we can deal with it ourselves
● i.e. we config the glideins to stay within the limits
glideinWMS training Grid debugging 28
29. Preemption
● Some site will preempt our glideins
if higher priority jobs get into the queue
● Effectively killing our glideins
● Not an actual error
● Sites have the right to do it!
● But it can mess up with our monitoring/ops
● We may see killed glideins, or
● We may see glideins that seem to run for
a very long time (when automatically rescheduled on the CE)
● We have to efficiently filter these events out
glideinWMS training Grid debugging 29
30. Improper monitoring info from CE
● A CE may not provide reliable information
● Each VO FE provides us with monitoring
information about its central manager
● By comparing what it tells us, with what
the CE tells us, we can infer if there are problems
● A large, consistent discrepancy typically signals
problems in the CE monitoring
● Very difficult to figure out what is going on
● We have no direct detailed data to act upon
● Mostly ad-hoc detective work, prodding the black box
● Often inconclusive
glideinWMS training Grid debugging 30
31. Lack of output files
● The glidein output files contain
● Accounting information
● Detail logging
● Without other problems, mostly an annoyance
● But much more often paired with glideins failing
● Making failure diagnostics close to impossible
● Extremely hard to diagnose the root cause
● Sometimes we may infer it (black holes, killed glideins, ...)
● For actual CE problems it requires help from many
parties, including us, the site admins and SW developers
glideinWMS training Grid debugging 31
32. Grid debugging
Networking problems
glideinWMS training Grid debugging 32
33. Glideins are network heavy
● Each glidein opens several
long‑lived TCP connections (in CCB mode)
● Can overwhelm networking gear
– e.g. NATs can run out of spare ports
● Problems can have non-linear behavior
● Will work fine on small scale
● Will degrade after a while
– Not necessarily a step function, though
Although straight out
denials due to firewalls
are also a problem
glideinWMS training Grid debugging 33
34. Diagnostics and action items
● Not trivial to detect
● Errors often in the glidein logs
And we are lacking
● But difficult to interpret tools for automatically
detecting this.
● Not much we can do directly
● A problem between the VO services and the site
– So we notify both
● However
● we usually end up assisting as experts
glideinWMS training Grid debugging 34
35. Grid debugging
Authentication problems
glideinWMS training Grid debugging 35
36. Security is delicate stuff
● Grid security mechanisms paranoid by design
● “Availability” is the last to be considered
● The main focus is keeping the “bad guys” out
● So they are extremely delicate
● If any piece of the chain breaks, everything breaks
● Things that can go wrong (non exhaustive list):
● Missing CA(s)
● Expired CRLs
● Expired glidein proxy
● Wrong system time (clock skew)
glideinWMS training Grid debugging 36
37. Diagnostics and action items
● Finding the root cause usually hard
And we are lacking
● Errors are in the glidein logs tools for automatically
detecting this.
● But usually do not provide enough info
(to avoid giving up too much info to a hypothetical attacker)
● Have to distinguish between
site problems and VO problems, too
● Only obvious if only a fraction fails (→ WN problem)
● Else, may need to get both sides involved to
properly diagnose the root cause
glideinWMS training Grid debugging 37
39. gLExec (1)
● The biggest source of problems, by far,
is gLExec refusing to accept a user proxy
● Resulting in jobs not starting
● BTW, Condor is not good at handling gLExec denials
● We can only partially test gLExec
during validation
● May behave differently based on the proxy used
● Its behavior can change in time
● And final users may be the source of the problem
● e.g. by letting the proxy expire Condor could catch
these, and hopefully
soon will
glideinWMS training Grid debugging 39
40. gLExec (2)
● Non trivial to detect
● Errors are in the glidein logs
● But we miss the tools to extract them
● Finding the root cause impossible without
site admin help
● gLExec policies are a site secret
● We thus just notify the site,
providing the failing user DN
glideinWMS training Grid debugging 40
41. Configuration problems
● Condor can be configured to run a wrapper
around the user job
● To customize the user environment
● Usually provided by the VO FE
● If that fails, the user job fails with it
● Luckily, failures are rare
● If we notice them, we notify the VO FE admins
● However, they often notice before we do
glideinWMS training Grid debugging 41
42. Other job startup problems
● By default, we validate the node
only at glidein startup
● WN conditions may change by the time a job
is scheduled to run We should do better.
– e.g. the disk fills up Condor supports
periodic validation
● The errors are usually only tests, we just don't
use them right now.
seen by the final users
● So we hardly ever notice
these kind of problems
glideinWMS training Grid debugging 42
43. Summary
● The Grid world is a good approximation of
a chaotic system
● There are thus many failure modes
● The pilot paradigm hides most of the failures
from the final users
● But the failures are still there
● Resulting in wasted/underused CPU cycles
● The G.Factory operators are in the best position to
diagnose the root cause of the failures
● By having a global view
● However, they cannot solve the problems by themselves
glideinWMS training Grid debugging 43
44. Acknowledgments
● This document was sponsored by grants from
the US NSF and US DOE,
and by the UC system
glideinWMS training Grid debugging 44