Descripton of how to write custom validation scripts in glideinWMS, with an emphasis on the VO Frontend operations.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Automating Google Workspace (GWS) & more with Apps Script
glideinWMS validation scirpts - glideinWMS Training Jan 2012
1. glideinWMS Training @ UCSD
GlideinWMS
Validation scripts
by Igor Sfiligoi (UCSD)
UCSD Jan 18th 2012 Validation Scripts 1
2. Overview
● Why validation scripts
● Anatomy of validation scripts
● Types of validation scripts
UCSD Jan 18th 2012 Validation Scripts 2
3. Reminder - Glideins
● A glidein is just a properly configured Condor
execution node submitted as a Grid job
● glideinWMS Central manager
provides Collector CREAM
glidein
Execution node
automation glidein
Execution node
Negotiator
Submit node
Submit node
glidein
Execution node
Submit node
Execution node
glidein
Schedd Startd
Globus
Job
glideinWMS
UCSD Jan 18th 2012 Validation Scripts 3
4. Reminder – Glidein script
● Glidein startup script just a empty shell that:
● Downloads scripts, parameters and Condor bins
● Runs the scripts in order
● Does the final cleanup
● Two types of script:
If any of these fail,
● Node validation Condor will never be started
● Condor configuration and startup
Once Condor starts,
glideinWMS is out of the way
UCSD Jan 18th 2012 Validation Scripts 4
5. As a consequence
If validation scripts finds a bad WN
Condor will not be started
No user jobs will ever fail here
UCSD Jan 18th 2012 Validation Scripts 5
6. Is validating at glidein startup
a good idea?
● Advantages: Users happy
● User jobs never land on “broken” nodes
● Failures logged Factory admins can act on this info,
notifying sites (who can fix the problem)
● Limitations: Condor provides
● Tested only at glidein startup cron-like capabilities
for this
– If node “goes bad” after Condor startup,
user jobs will still be fetched and will fail Can be solved by
● Problems: passing the test
and setting attributes
●
Failed validation → wasted CPU
– Some jobs may still succeed, But this will
hide problem
even if validation failed from Factory
UCSD Jan 18th 2012 Validation Scripts 6
7. Anatomy of
a validation script
UCSD Jan 18th 2012 Validation Scripts 7
8. Validation scripts 101
● Any executable will do!
● There are no restrictions
● Can be compiled binary or a shell script
● Exit code checked
● ==0 - Success
● !=0 - Failure
● And, to the first approximation, this is all
UCSD Jan 18th 2012 Validation Scripts 8
9. Validation scripts - I/O
● You may want to:
● Get some input
● Have some output
● Both handled through a dashboard file
● Filename passed as the only argument
to the validation scripts
UCSD Jan 18th 2012 Validation Scripts 9
10. Dashboard file
● Simple list of (key, value) pairs
● One per line Newline not allowed in either key or value
● Space separated Space not allowed in the key
● Hash (#) can be used for comments
GLIDEIN_Factory UCSD
GLIDEIN_Factory UCSD
GLIDEIN_Name Production_v4_2
GLIDEIN_Name Production_v4_2
GLIDEIN_Entry_Name CMS_T2_US_UCSD_gw2
GLIDEIN_Entry_Name CMS_T2_US_UCSD_gw2
GLIDECLIENT_Name UCSD-v5_3.main
GLIDECLIENT_Name UCSD-v5_3.main
GLIDEIN_WORK_DIR /data10/condor_local/execute/dir_22668/glide_B22745/main
GLIDEIN_WORK_DIR /data10/condor_local/execute/dir_22668/glide_B22745/main
GLIDEIN_Glexec_Use OPTIONAL
GLIDEIN_Glexec_Use OPTIONAL
X509_CERT_DIR /wn-client/globus/TRUSTED_CA
X509_CERT_DIR /wn-client/globus/TRUSTED_CA
GLIDEIN_Site UCSD
GLIDEIN_Site UCSD
# This was calculated on the fly
# This was calculated on the fly
CCB_ADDRESS glidein-collector.t2.ucsd.edu:9822
CCB_ADDRESS glidein-collector.t2.ucsd.edu:9822
http://tinyurl.com/glideinWMS/doc.prd/factory/custom_scripts.html#glidein_config
UCSD Jan 18th 2012 Validation Scripts 10
11. Reading input
● Dashboard file as the first argument
● Then just look for the key and split on space
# here is my dashboard file
# here is my dashboard file
glidein_config=$1
glidein_config=$1
# I expect only one key and no space in the value
# I expect only one key and no space in the value
glexec_bin=`awk '/^GLEXEC_BIN /{print $2}' $glidein_config`
glexec_bin=`awk '/^GLEXEC_BIN /{print $2}' $glidein_config`
if [ -z "$glexec_bin" ]; then
if [ -z "$glexec_bin" ]; then
exit 1
exit 1
fi
fi
…
…
exit 0
exit 0
UCSD Jan 18th 2012 Validation Scripts 11
12. Writing output
● You can just append to the file
● Just make sure it is properly formatted
# here is my dashboard file
# here is my dashboard file
glidein_config=$1
glidein_config=$1
…
…
# tell condor to use glexec
# tell condor to use glexec
echo 'GLEXEC_JOB True' >> $glidein_config
echo 'GLEXEC_JOB True' >> $glidein_config
exit 0
exit 0
● You should also make sure
the same key is not already defined
UCSD Jan 18th 2012 Validation Scripts 12
13. Helper function
● glideinWMS provides a helper BASH function to
avoid duplicate keys
● External SH file, referenced as
ADD_CONFIG_LINE_SOURCE
● The function name inside is
add_config_line
# here is my dashboard file (MUST be called glidein_config)
# here is my dashboard file (MUST be called glidein_config)
glidein_config=$1
glidein_config=$1
# get helper function
# get helper function
add_config_line_source=
add_config_line_source=
`awk '/^ADD_CONFIG_LINE_SOURCE /{print $2}' $glidein_config`
`awk '/^ADD_CONFIG_LINE_SOURCE /{print $2}' $glidein_config`
source $add_config_line_source
source $add_config_line_source
…
…
# tell condor to use glexec
# tell condor to use glexec
add_config_line 'GLEXEC_JOB' 'True'
add_config_line 'GLEXEC_JOB' 'True'
UCSD Jan 18th 2012 Validation Scripts 13
14. Influencing Condor behavior
● By default, keys in dashboard file ignored by
Condor startup/configuration script
● Anything you write into it, it is just for your
consumption (e.g. for other scripts of yours)
● A special whitelist file lists the keys
that should be passed to Condor
● Referenced as
CONDOR_VARS_FILE Again, source
ADD_CONFIG_LINE_SOURCE
● Helper function available
add_condor_vars_line
UCSD Jan 18th 2012 Validation Scripts 14
15. Condor Vars file
● Each line contains a key
● Seven fields, space (or tab) separated
● Key
● Type - I – Integer, S – String, C – Expr.
● Default value - “-” for no default
● Condor Name - “+” = Key name Useful when others
have to define it
● Is it required? - Y|N
● Should be exported to ClassAd? - Y|N
● Should be exported to job environment?
- “-” no, “+” Key name, “@” Condor Name
http://tinyurl.com/glideinWMS/doc.prd/factory/custom_scripts.html#condor_vars
UCSD Jan 18th 2012 Validation Scripts 15
16. Example
# here is my dashboard file (MUST be called glidein_config)
# here is my dashboard file (MUST be called glidein_config)
glidein_config=$1
glidein_config=$1
# extract where to find the vars file
# extract where to find the vars file
# (MUST be called condor_vars_file)
# (MUST be called condor_vars_file)
condor_vars_file=
condor_vars_file=
`awk '/^CONDOR_VARS_FILE /{print $2}' $glidein_config`
`awk '/^CONDOR_VARS_FILE /{print $2}' $glidein_config`
# get helper function
# get helper function
add_config_line_source=
add_config_line_source=
`awk '/^ADD_CONFIG_LINE_SOURCE /{print $2}' $glidein_config`
`awk '/^ADD_CONFIG_LINE_SOURCE /{print $2}' $glidein_config`
source $add_config_line_source
source $add_config_line_source
…
…
# This should already have been set
# This should already have been set
add_condor_vars_line "GLEXEC_BIN" "C" "-" "GLEXEC" "Y" "N" "-"
add_condor_vars_line "GLEXEC_BIN" "C" "-" "GLEXEC" "Y" "N" "-"
# tell condor to use glexec
# tell condor to use glexec
add_config_line 'GLEXEC_JOB' 'True'
add_config_line 'GLEXEC_JOB' 'True'
add_condor_vars_line "GLEXEC_JOB" "C" "True" "+" "Y" "Y" "-"
add_condor_vars_line "GLEXEC_JOB" "C" "True" "+" "Y" "Y" "-"
# tell user where is the TMPDIR
# tell user where is the TMPDIR
add_config_line 'GLEXEC_TMP' $TMPDIR
add_config_line 'GLEXEC_TMP' $TMPDIR
add_condor_vars_line "GLEXEC_TMP" "S" "-" "+" "Y" "Y" "+"
add_condor_vars_line "GLEXEC_TMP" "S" "-" "+" "Y" "Y" "+"
UCSD Jan 18th 2012 Validation Scripts 16
17. Error messages
● Your script found a problem
● Now what?
● You definitely want to exit with errno !=0
● But, please, also print an error message!
● With enough information to understand
why the script failed
● Will allow the Factory admins to act on it
UCSD Jan 18th 2012 Validation Scripts 17
18. Planned improvements
(still speculation at this point)
● Current error codes and messages arbitrary
● Mostly good enough for manual debugging
● But cannot really automatically act on them
● Want to add some more structure
● Based on OSG Common Output Format proposal
https://twiki.grid.iu.edu/bin/view/SoftwareTools/CommonTestFormat#Alain_s_proposal_Version_4_evolu
● In addition to exit code, If file not present,
scripts expected to write a status file will assume
“Error unknown”
● Which will be read and interpreted by the caller
and propagated to the Factory Now we can start thinking about
automatically acting on errors!
UCSD Jan 18th 2012 Validation Scripts 18
19. Standardized error reasons
(preliminary - still speculation at this point)
● To allow for automated feedback, need
standardized error reasons
● This is what I currently envision:
● Config - e.g. Impossible combinations
● Corruption - e.g. SHA1 check failed
● WN Resource - e.g. Disk full or glexec not found
● Network - e.g. Cannot talk to VO Collector
● VO Proxy - e.g. Proxy too short
● VO Data - e.g. VO SW not installed
UCSD Jan 18th 2012 Validation Scripts 19
20. Examples
(preliminary - still speculation at this point)
<?xml version="1.0"?>
<?xml version="1.0"?>
<OSGTestResult id="glideinWMS.check_disk" version="7.5.4">
<OSGTestResult id="glideinWMS.check_disk" version="7.5.4">
<result>
<result>
<status>OK</status>
<status>OK</status>
<metric name="diskspace" ts="2012-01-12T15:02:20"
<metric name="diskspace" ts="2012-01-12T15:02:20"
uri="local">/tmp/glidein_15432/</metric>
uri="local">/tmp/glidein_15432/</metric>
</result>
</result>
<detail>Enough disk space found.</detail>
<detail>Enough disk space found.</detail>
</OSGTestResult>
</OSGTestResult> <?xml version="1.0"?>
<?xml version="1.0"?>
<OSGTestResult id="glideinWMS.check_proxy" version="7.5.4">
<OSGTestResult id="glideinWMS.check_proxy" version="7.5.4">
<result>
<result>
<status>FAILED</status>
<status>FAILED</status>
<metric name="failure" ts="..." uri="local">VO Proxy</metric>
<metric name="failure" ts="..." uri="local">VO Proxy</metric>
<metric name="proxy" ts="2012-01-12T15:02:21"
<metric name="proxy" ts="2012-01-12T15:02:21"
uri="local">/tmp/glidein_15432/proxy/a.proxy</metric>
uri="local">/tmp/glidein_15432/proxy/a.proxy</metric>
</result>
</result>
<detail>Proxy had less than 12h left.</detail>
<detail>Proxy had less than 12h left.</detail>
</OSGTestResult>
</OSGTestResult>
UCSD Jan 18th 2012 Validation Scripts 20
22. Why should you use VS?
● Of course: What we discussed until now
● Check for obviously broken nodes
● But also:
● To discover and advertise dynamic information
● Non-trivial configuration
● Site-specific customizations
UCSD Jan 18th 2012 Validation Scripts 22
23. Dynamic information
● Some information dynamic by nature
● E.g. location of VO software
● You want to discover at run-time where
it is located
● And fail, if you cannot find it!
● Makes life easier for the users
● Once discovered, good practice to advertise it
● In either/both the ClassAd and/or job environment
UCSD Jan 18th 2012 Validation Scripts 23
24. Example
# check if CMSSW installed locally
# check if CMSSW installed locally
if [ -f "$CMSSW" ]; then
if [ -f "$CMSSW" ]; then
source "$CMSSW"
source "$CMSSW"
If [ -z “$CMSSW_LIST” -o -z "$CMSSW_LOC" ]; then
If [ -z “$CMSSW_LIST” -o -z "$CMSSW_LOC" ]; then
echo "Corrupted CMSSW at $CMSSW!n" 1>&2
echo "Corrupted CMSSW at $CMSSW!n" 1>&2
exit 1
exit 1
fi
fi
else
else
echo "CMSSW not found!n" 1>&2
echo "CMSSW not found!n" 1>&2
exit 1
exit 1
fi
fi
# publish to user job env
# publish to user job env
add_config_line "CMSSW_LOC" "$CMSSW_LOC"
add_config_line "CMSSW_LOC" "$CMSSW_LOC"
add_condor_vars_line "CMSSW_LOC" "S" "-" "+" "Y" "N" "+"
add_condor_vars_line "CMSSW_LOC" "S" "-" "+" "Y" "N" "+"
# publish to Condor
# publish to Condor
add_config_line "CMSSW_LIST" "$CMSSW_LIST"
add_config_line "CMSSW_LIST" "$CMSSW_LIST"
add_condor_vars_line "CMSSW_LIST" "S" "-" "+" "Y" "Y" "-"
add_condor_vars_line "CMSSW_LIST" "S" "-" "+" "Y" "Y" "-"
exit 0
exit 0
UCSD Jan 18th 2012 Validation Scripts 24
25. Non-trivial configuration
(Not really a “validation” script)
● You may want to generate some data on the fly
● e.g. a random seed
let s=$RANDOM%123+17
let s=$RANDOM%123+17
add_config_line "MY_SEED" “$s”
add_config_line "MY_SEED" “$s”
add_condor_vars_line "MY_SEED" "I" "-" "+" "Y" "N" "+"
add_condor_vars_line "MY_SEED" "I" "-" "+" "Y" "N" "+"
● And sometimes it is just inconvenient to specify
some values in the frontend XML file
● e.g a long list
l="1"
l="1"
for ((i=2; $i<100; i++)); do
for ((i=2; $i<100; i++)); do
l="$l:$i"
l="$l:$i"
done
done
add_config_line "MY_LIST" “$l”
add_config_line "MY_LIST" “$l”
add_condor_vars_line "MY_LIST" "S" "-" "+" "Y" "N" "+"
add_condor_vars_line "MY_LIST" "S" "-" "+" "Y" "N" "+"
UCSD Jan 18th 2012 Validation Scripts 25
26. Site specific customization
● Currently, the frontend XML file does not allow
site-specific customizations
● Unless you want to have a group per site!
Limiting, since only one level of groups
● And there is the option for you to arrange for
the Factory to provide it for you
Maintenance will be a mess
● You can code the per-site config
into a “validation script”
Still not ideal, but may be
better than the alternative
Especially, if you can apply a rule with few exceptions
UCSD Jan 18th 2012 Validation Scripts 26
27. Example
glidein_config=$1
glidein_config=$1
site=`awk '/^GLIDEIN_CMSSITE /{print $2}' $glidein_config`
site=`awk '/^GLIDEIN_CMSSITE /{print $2}' $glidein_config`
country=`echo $site| awk '{print substr($1,8,2)}'`
country=`echo $site| awk '{print substr($1,8,2)}'`
if [ "$country" == "US" ]; then
if [ "$country" == "US" ]; then
myvar="OSG"
myvar="OSG"
elif [ "$country" == "IT" -o "$country" == "FR" ]; then
elif [ "$country" == "IT" -o "$country" == "FR" ]; then
myvar="EGI"
myvar="EGI"
else
else
echo "Cannot run in $country" 1>&2
echo "Cannot run in $country" 1>&2
exit 1
exit 1
fi
fi
add_config_line "MY_VAR" "$myvar"
add_config_line "MY_VAR" "$myvar"
add_condor_vars_line "MY_VAR" "I" "-" "+" "Y" "N" "+"
add_condor_vars_line "MY_VAR" "I" "-" "+" "Y" "N" "+"
UCSD Jan 18th 2012 Validation Scripts 27
29. Limits of validation scripts
● Whatever is discovered on the WN is
● Used by the script for its own testing
● At best, propagated to glidein ClassAd or job env
● The discovered info cannot be used for
Frontend matchmaking purposes!
● At best, for Negotiator matchmaking
● As a result, you may be requesting glideins that
will never run any user jobs If condition
common to
● Wither fail validation or do not match all WNs
UCSD Jan 18th 2012 Validation Scripts 29
30. What can you do?
● How do you notice it?
● If validation errors
– The Factory admins will likely contact you
● If Negotiator not matching jobs
– You will need to discover it yourself
● What to do afterwards? Maybe you were just too aggressive?
● Tune the script Pretty much a hack!
● Manually blacklist a site is your frontend XML
● Or convince the Factory admins to advertise
VO specific info
Can be hard to maintain long term
UCSD Jan 18th 2012 Validation Scripts 30
32. Pointers
● The official glideinWMS project Web page is
http://tinyurl.com/glideinWMS
● glideinWMS development team is reachable at
glideinwms-support@fnal.gov
● The OSG glidein factory is reachable at
osg-gfactory-support@physics.ucsd.edu
UCSD Jan 18th 2012 Validation Scripts 32
33. Acknowledgments
● The glideinWMS is a CMS-led project
developed mostly at FNAL, with contributions
from UCSD and ISI
● The glideinWMS factory operations at UCSD is
sponsored by OSG
● The funding comes from NSF, DOE and the
UC system
UCSD Jan 18th 2012 Validation Scripts 33