Pilot Factory uses schedd glideins to submit pilot jobs locally from a remote resource. A pilot generator program communicates with a database and periodically submits pilots with desired configurations to matchmake and run various job types. This allows bypassing Condor-G and its overhead for large job submissions while taking advantage of local scheduling on the remote resource. Future work includes integrating pilots more directly with Condor startds for additional functionality.
2. Problem to solve (1)
n Pilot
¨ Probe the resource (http, environment,
interpreter, other executables …etc)
¨ Pull jobs from remote server (e.g. Panda
server)
¨ Matchmaking
n Group jobs in different categories
E.g Production jobs, Analysis jobs (CHARMM …), Test jobs …
n Other criteria: Number of CPUs, RAM … etc
3. Problem to Solve (2)
n Current approach of pilot submissions
¨ Local pool : Vanilla
¨ Remote pool: Condor-G
n Largeamounts of user jobs (production
+ analysis) ~ large amount of Condor-G
pilot jobs ~ computational overhead on
gatekeepers
(e.g. large memory consumptions)
4. Solution
n Is there any way to bypass GRAM to
submit jobs to remote machines?
n Local submissions, but how?
¨ We need something that continuously
submit local pilot jobs on the gatekeeper
¨ Solution: Pilot Factory
5. Pilot Factory Overview
n Pilot Factory is an application that combines
the following ideas:
¨ schedd glidein
¨ pilot submission program (or pilot generator)
n What is glidein?
¨ Mini-Condor pool on a remote machine
n A complete Condor pool has at least 5 components:
i.e. master, startd, schedd, collector, negotiator
n Glidein: {master, startd}, {master, schedd}, … etc
¨ Properly configured condor daemons submitted as
batch job
8. Schedd Glidein
n Logics based on startd glidein (two Condor-G to set up )
n Usage: By running glidein schedd on gatekeeper, the schedd then
serves as a gateway between submit host and grid sites
n Mini Condor pool with schedd functionalities:
¨ Submit host
¨ Maintain persistent queue of jobs
¨ Communicate with native batch system and forward user jobs
n Condor, PBS, LSF, …etc
¨ Manipulate job queues through the followoing commands:
n condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio
¨ Security Features (GSI)
n Who is authorized to set up Pilot Factory?
9. Schedd Glidein Example (1)
n Command: // schedd glidein #1
condor_glidein -count 1 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork
gridgk01.racf.bnl.gov/jobmanager-fork -type schedd –forcesetup
Use fork since we want schedd
to be on gatekeeper!
n Command: // schedd glidein #2
condor_glidein -count 1 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork
gridgk02.racf.bnl.gov/jobmanager-fork -type schedd –forcesetup
n Command: // schedd glidein # 3, #4, #5
condor_glidein -count 3 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork
nostos.cs.wisc.edu/jobmanager-fork -type schedd –forcesetup
11. Pilot Submission Program (Generator)
n Communicate with a DB server that maintains
information about pilot jobs
¨ E.g. pilot_type, pilot_queue
n Pulls desired pilot script from an external
server
n Periodically submit pilot jobs (with pilot script
as executable)
¨ condor_submit
¨ qsub? No, not necessary, since …
12. Build Pilot Factory with Glidein
Grid Resource
n Schedd glidein installed and executed on
the gatekeeper
JobManager n User submit a Condor-C job with pilot
generator as the executable
¨ Generator runs on the gatekeeper as a local
LSF universe job supervised by the glidein
PBS schedd
master
n Generator submits pilots
schedd schedd ¨ Types, frequency adjustable by users
¨ Depending on the native batch system,
pilots can be submitted as grid universe
~ jobs
¨ Along with GAHP and related binaries,
Pilot generator
schedd has the ability to communicate
different batch systems
13. Pilot Factory
master
schedd
Cluster Worker Nodes
~
Pilot Factory
Connected to
Collector
Glidein request Submit Pilots
Submit Node
(Collector, Master, Gatekeeper with
Negotiator, Schedd) {Globus, Condor|
PBS|…}
14. Future Work
n Integrating pilot with Condor startd to implement startd-based
pilot
¨ the startd-based pilot retrieves the payload of a user job in the
same way as does the generic pilot but in addition, it also inherits
functionalities of Condor startd.
¨ Original intention was to run PFs with the startd-pilots on worker
nodes (too greedy, unacceptable?)
¨ Using Condor started makes it easier to integrate with gLexec
n Transform Generic PF (GPF) to Startd PF (SPF)