Solving Grid problems through glidein monitoring

glideinWMS training

Solving Grid problems
through glidein monitoring
i.e. The Grid debugging part of G.Factory operations

by Igor Sfiligoi (UCSD)

glideinWMS training Grid debugging 1

Glidein Factory Operations
● Factory node operations
● Serving VO Frontend Admin requests
● Keeping up with changes in the Grid
● Debugging Grid problems
● The most time consuming part
● Effectively we help solve Grid problems,
through glidein monitoring


Reminder - Glideins
● A glidein is a properly configured Condor startd
daemon submitted as a Grid job

Submit node Worker node

Frontend glidein
Monitor
Condor Central manager
Startd
Match
CE Job
Request
glideins
Factory Submit
glideins


What can go wrong in the Grid?
● Many places where thing can go wrong
● Essentially at any of the arrows below

glidein
Central manager
Startd
CE Job

Factory


● In particular
● CE may refuse to accept glideins

glidein
Central manager
Startd
CE Job

Factory


● In particular
● CE may not start glideins
● Or fail to tell us what
the status of the job is
glidein
Central manager
Startd
CE Job

Factory


● In particular
● The worker node may be broken/misconfigured
– Thus validation
will fail
● Many Submit node Worker node
reasons
glidein
Central manager
Startd
CE Job

Factory


● In particular
● The WAN networking may not work properly
● The CM never hears
from the Startd
● Or Schedd Submit node Worker node
cannot glidein
talk to Central manager
Startd
Startd
CE Job
● Can be selective
Factory


● In particular
● Or the security infrastructure could be broken
– CAs missing
– Time discrepancies
– Etc. Submit node Worker node
glidein
Central manager
Startd
CE Job

Factory


● In particular
● The site may refuse to start the user job
– e.g. glexec

glidein
Central manager
Startd
CE Job

Factory


What can go wrong with glideins?
● And there are also non-Grid problems
● Jobs not matching
● But that's
beyond
the scope Submit node Worker node
of this glidein
document Central manager
Startd
CE Job

Factory


Problem classification
● Most often we see WN problems Typically easy
to diagnose
● Followed by CEs refusing glideins
● Then there are misbehaving CEs
● Very hard to diagnose!
● Everything else quite rare
● But usually hard to diagnose as well


Grid debugging

Validation problems
i.e. Problems on Worker Nodes


WN problems
● The glidein startup script runs
a list of validation scripts
● If any of them fails, the WN is considered broken
● This way user jobs never get to broken WNs
● Two sources of tests
● Glidein Factory
● VO Frontend
● Of course, if the validation script cannot be fetched
from either Web server, it is considered a failure
as well

Types of tests
● The glideinWMS SW comes with a set of
standard tests (provided by the factory):
● Grid environment present (e.g. CAs)
● Some free disk on $PWD and on /tmp
● Enough FE-provided proxy lifetime remaining
● gLExec related tests
● OS type
● Each VO may have its own needs, e.g.:
● Is VO SW pre-installed and accessible?


Discovering the problems
● Any error message printed out by the validation
script will be delivered back to the factory
● After the glidein terminates
● Most validation scripts provide clear indication
what went wrong
● And we strive to get all to do it!
● New machine readable format being introduced
● With v2_6_2


Typical ops
● Noticing that a large fraction of glideins for a
site are failing is easy
● Just look at the monitoring
● And we are getting a daily email as well
● Discovering what exactly is broken not too
difficult either
● Just parse the logs
● Will get even easier when all scripts
return machine readable information
With appropriate tools


Action items
Unless this is
● Not much we can do directly the result of a
misconfiguration
on our part
● Typically, we open a ticket with the site
● Provide the list of nodes where it happens
(rare to have the whole site broken)
● A concise but complete error report
essential for a speedy resolution
● In minority of cases we have to contact the
VO FE admin, e.g.
● Unclear error messages
● Non-WN specific validation errors


Black hole nodes
● There is one further WM problem
● Black hole WNs
● WNs that accept glidein jobs, but don't execute them
● glidein_startup never has the chance
to log anything
● Not even the node it is running on
● Thus, empty log files!
● We can infer we have a black hole node at a site
by looking at job timing (in Condor-G logs)
● Good jobs run for at least 20 mins


Grid debugging

CE refusing the glideins


CE Refusing the glideins
● CE admin has the right to refuse anyone
● But usually does not change his mind overnight
● First time accessing a site an issue on its own
– Not covered here
● When things go wrong, the typical reason is
● CE service down,
● Problems in the Security/Auth infrastructure,
● CE seriously misconfigured/broken


Expected vs Unexpected
● Some “problems” are expected
● e.g. the CE is down for scheduled maintenance
● Nothing to do in this case!
– Just a monitoring issue
● So, checking the maintenance DB important!
● If not, we have to notify the site
● The VO FEs are not getting the CPU slots
they are asking for


Discovering the problem
● Condor-G reacts in two different ways
● Does nothing – We still have monitoring showing
the job did not progress from Waiting→Pending
● Puts the job on Hold
● The G.Factory will react on Held jobs
● Releasing them a few time → Condor-G retries
● Removing them after a while
– Just to be replaced with identical glideins
For most non-trivial problems
the problem does not solve by itself


Action items
(for unexpected problems)

● Most of the time, not much we can do directly
● Will just open ticket with site
● If any useful info in the HoldReason, we pass it on
● DN of the proxy the most valuable info
● But it could be our problem, too
● Found many Condor-G problems in the past
● Comparing the behavior of many G.Factory
instances can confirm or exclude this
Ah-hoc solutions needed
if this is the case


Grid debugging

CE not properly
handling the glideins


Problematic CE
● Three basic types of problems:
● Glideins not starting
● Improper monitoring information
● Output files not being delivered to client
● And there is two more
● Unexpected policies that kill glideins


Glideins not starting
● The CE scheduling policy is not available to us
● So often not obvious if we are just low priority or
something else is going on
● GF/Condor-G does not see it as an error condition
● We usually don't act on it, unless
● The VO FE admin complains, or
● We have been given explicit guidance of the
expected startup rates
● Not much for us to investigate
● Just tell the site admin “Jobs are not starting”

Glideins being killed by the site
● Ideally, our glideins should fit within
the policies of the site But getting this info
is not trivial, remember?
● But sometimes they don't
● So they get killed hard
● Discovering this from our side very hard
● We often just notice empty log files
● Not an error for Condor-G
● Often learn of this because the VO complains
● If and when we understand the problem,
we can deal with it ourselves
● i.e. we config the glideins to stay within the limits

Preemption
● Some site will preempt our glideins
if higher priority jobs get into the queue
● Effectively killing our glideins
● Not an actual error
● Sites have the right to do it!
● But it can mess up with our monitoring/ops
● We may see killed glideins, or
● We may see glideins that seem to run for
a very long time (when automatically rescheduled on the CE)
● We have to efficiently filter these events out

Improper monitoring info from CE
● A CE may not provide reliable information
● Each VO FE provides us with monitoring
information about its central manager
● By comparing what it tells us, with what
the CE tells us, we can infer if there are problems
● A large, consistent discrepancy typically signals
problems in the CE monitoring
● Very difficult to figure out what is going on
● We have no direct detailed data to act upon
● Mostly ad-hoc detective work, prodding the black box
● Often inconclusive

Lack of output files
● The glidein output files contain
● Accounting information
● Detail logging
● Without other problems, mostly an annoyance
● But much more often paired with glideins failing
● Making failure diagnostics close to impossible
● Extremely hard to diagnose the root cause
● Sometimes we may infer it (black holes, killed glideins, ...)
● For actual CE problems it requires help from many
parties, including us, the site admins and SW developers

Grid debugging

Networking problems


Glideins are network heavy
● Each glidein opens several
long‑lived TCP connections (in CCB mode)
● Can overwhelm networking gear
– e.g. NATs can run out of spare ports
● Problems can have non-linear behavior
● Will work fine on small scale
● Will degrade after a while
– Not necessarily a step function, though
Although straight out
denials due to firewalls
are also a problem


Diagnostics and action items
● Not trivial to detect
● Errors often in the glidein logs
And we are lacking
● But difficult to interpret tools for automatically
detecting this.
● Not much we can do directly
● A problem between the VO services and the site
– So we notify both
● However
● we usually end up assisting as experts


Grid debugging

Authentication problems


Security is delicate stuff
● Grid security mechanisms paranoid by design
● “Availability” is the last to be considered
● The main focus is keeping the “bad guys” out
● So they are extremely delicate
● If any piece of the chain breaks, everything breaks
● Things that can go wrong (non exhaustive list):
● Missing CA(s)
● Expired CRLs
● Expired glidein proxy
● Wrong system time (clock skew)

Diagnostics and action items
● Finding the root cause usually hard
And we are lacking
● Errors are in the glidein logs tools for automatically
detecting this.
● But usually do not provide enough info
(to avoid giving up too much info to a hypothetical attacker)
● Have to distinguish between
site problems and VO problems, too
● Only obvious if only a fraction fails (→ WN problem)
● Else, may need to get both sides involved to
properly diagnose the root cause


Grid debugging

Job startup problems


gLExec (1)

● The biggest source of problems, by far,
is gLExec refusing to accept a user proxy
● Resulting in jobs not starting
● BTW, Condor is not good at handling gLExec denials
● We can only partially test gLExec
during validation
● May behave differently based on the proxy used
● Its behavior can change in time
● And final users may be the source of the problem
● e.g. by letting the proxy expire Condor could catch
these, and hopefully
soon will

gLExec (2)

● Non trivial to detect
● Errors are in the glidein logs
● But we miss the tools to extract them
● Finding the root cause impossible without
site admin help
● gLExec policies are a site secret
● We thus just notify the site,
providing the failing user DN


Configuration problems
● Condor can be configured to run a wrapper
around the user job
● To customize the user environment
● Usually provided by the VO FE
● If that fails, the user job fails with it
● Luckily, failures are rare
● If we notice them, we notify the VO FE admins
● However, they often notice before we do


Other job startup problems
● By default, we validate the node
only at glidein startup
● WN conditions may change by the time a job
is scheduled to run We should do better.
– e.g. the disk fills up Condor supports
periodic validation
● The errors are usually only tests, we just don't
use them right now.
seen by the final users
● So we hardly ever notice
these kind of problems


Summary
● The Grid world is a good approximation of
a chaotic system
● There are thus many failure modes
● The pilot paradigm hides most of the failures
from the final users
● But the failures are still there
● Resulting in wasted/underused CPU cycles
● The G.Factory operators are in the best position to
diagnose the root cause of the failures
● By having a global view
● However, they cannot solve the problems by themselves

Acknowledgments
● This document was sponsored by grants from
the US NSF and US DOE,
and by the UC system


Solving Grid problems through glidein monitoring

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Solving Grid problems through glidein monitoring

Semelhante a Solving Grid problems through glidein monitoring (12)

Mais de Igor Sfiligoi

Mais de Igor Sfiligoi (20)

Último

Último (20)

Solving Grid problems through glidein monitoring