[gram-user] SEG not responding to PBS/Torque
Norman Ives
norman.ives at gmail.com
Fri Jun 13 10:54:39 CDT 2008
Hiya list
I'm struggling to get ws-gram to work with PBS. The service was
installed as part of the osg-0.8.0 CE kit. PBS was installed partly
by OSCAR and partly by fiddling.
The fork job manager is working correctly.
The gatekeeper service is running on a node called
osg.phys.wits.ac.za and the pbs_server is running on
container.phys.wits.ac.za.
Submitting directly to PBS via qsub is successful, from osg.phys or
container.phys.
Submissions via the pre-ws gatekeeper also appear to be working,
using /jobmanager-pbs
When I use the web service, the following happens:
[norm at osg ~]$ globusrun-ws -submit -s -F osg.phys.wits.ac.za:9443 -Ft
PBS -c /bin/cp /home/norm/x /home/norm/y
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:1e39aa56-3958-11dd-8d38-5906e67bdc5d
Termination time: 06/14/2008 14:50 GMT
Current job state: Unsubmitted
After the second last line, a few minutes pass before the last line
appears. At this point I cancel the job, because it is clear that PBS
has happily executed the job and moved on with its life.
I followed the (excellent) documentation on
http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/
developer-index.html#s-wsgram-developer-troubleshooting
Here are some notes (on the node osg.phys, where the gatekeeper runs):
>> cat $GLOBUS_LOCATION/etc/globus-pbs.conf
log_path=/var/spool/pbs/server_logs/
>> ls /var/spool/pbs/server_logs/
pbs_server.log
>> submit a job /bin/sleep 30
>> grep 'Received local job ID' container-real.log
Received local job ID 127.container.phys.wits.ac.za
>> qstat
Job id Name User Time Use S
Queue
------------------------- ---------------- --------------- -------- -
-----
127.container STDIN norm 0 R
workq
>> grep '127.container' /var/spool/pbs/server_logs/pbs_server.log
06/13/2008 17:35:56;0008;PBS_Server;Job;
127.container.phys.wits.ac.za;Job Queued at request of
norm at osg.phys.wits.ac.za, owner = norm at osg.phys.wits.ac.za, job name
= STDIN, queue = workq
06/13/2008 17:35:57;0008;PBS_Server;Job;
127.container.phys.wits.ac.za;Job Modified at request of
root at container.phys.wits.ac.za
06/13/2008 17:35:57;0008;PBS_Server;Job;
127.container.phys.wits.ac.za;Job Run at request of
root at container.phys.wits.ac.za
06/13/2008 17:35:57;0008;PBS_Server;Job;
127.container.phys.wits.ac.za;Job Modified at request of
root at container.phys.wits.ac.za
06/13/2008 17:36:27;0010;PBS_Server;Job;
127.container.phys.wits.ac.za;Exit_status=0
resources_used.cput=00:00:00 resources_used.mem=4652kb
resources_used.vmem=30520kb resources_used.walltime=00:00:30
If I try something like
>> $GLOBUS_LOCATION/libexec/globus-scheduler-event-generator -s pbs
-t 1202300131
I get no output. The script seems to complain if I set an invalid
log_path. It does not complain if I point log_path to an empty
directory. Also, if log_path is set to some empty location, I don't
see any complaints in container-real.log.
So my next guess is that gram is not looking at the pbs_server.log
file. I'd be very happy if someone could shed some light.
Just for completeness - /var/spool/pbs is mounted read-only, over nfs
(since the pbs_server is running on another node).
Regards
Norm
More information about the gram-user
mailing list