[gram-user] SEG not responding to PBS/Torque

Norman Ives norman.ives at gmail.com
Fri Jun 13 10:54:39 CDT 2008


Hiya list

I'm struggling to get ws-gram to work with PBS. The service was  
installed as part of the osg-0.8.0 CE kit. PBS was installed partly  
by OSCAR and partly by fiddling.

The fork job manager is working correctly.

The gatekeeper service is running on a node called  
osg.phys.wits.ac.za and the pbs_server is running on  
container.phys.wits.ac.za.

Submitting directly to PBS via qsub is successful, from osg.phys or  
container.phys.

Submissions via the pre-ws gatekeeper also appear to be working,  
using /jobmanager-pbs

When I use the web service, the following happens:

[norm at osg ~]$ globusrun-ws -submit -s -F osg.phys.wits.ac.za:9443 -Ft  
PBS -c /bin/cp /home/norm/x /home/norm/y
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:1e39aa56-3958-11dd-8d38-5906e67bdc5d
Termination time: 06/14/2008 14:50 GMT
Current job state: Unsubmitted

After the second last line, a few minutes pass before the last line  
appears. At this point I cancel the job, because it is clear that PBS  
has happily executed the job and moved on with its life.

I followed the (excellent) documentation on
http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/ 
developer-index.html#s-wsgram-developer-troubleshooting

Here are some notes (on the node osg.phys, where the gatekeeper runs):

 >> cat $GLOBUS_LOCATION/etc/globus-pbs.conf
log_path=/var/spool/pbs/server_logs/

 >> ls /var/spool/pbs/server_logs/
pbs_server.log

 >> submit a job /bin/sleep 30
 >>  grep 'Received local job ID' container-real.log
Received local job ID 127.container.phys.wits.ac.za
 >> qstat
Job id                    Name             User            Time Use S  
Queue
------------------------- ---------------- --------------- -------- -  
-----
127.container             STDIN            norm                   0 R  
workq

 >>  grep '127.container' /var/spool/pbs/server_logs/pbs_server.log
06/13/2008 17:35:56;0008;PBS_Server;Job; 
127.container.phys.wits.ac.za;Job Queued at request of  
norm at osg.phys.wits.ac.za, owner = norm at osg.phys.wits.ac.za, job name  
= STDIN, queue = workq
06/13/2008 17:35:57;0008;PBS_Server;Job; 
127.container.phys.wits.ac.za;Job Modified at request of  
root at container.phys.wits.ac.za
06/13/2008 17:35:57;0008;PBS_Server;Job; 
127.container.phys.wits.ac.za;Job Run at request of  
root at container.phys.wits.ac.za
06/13/2008 17:35:57;0008;PBS_Server;Job; 
127.container.phys.wits.ac.za;Job Modified at request of  
root at container.phys.wits.ac.za
06/13/2008 17:36:27;0010;PBS_Server;Job; 
127.container.phys.wits.ac.za;Exit_status=0  
resources_used.cput=00:00:00 resources_used.mem=4652kb  
resources_used.vmem=30520kb resources_used.walltime=00:00:30

If I try something like
 >>  $GLOBUS_LOCATION/libexec/globus-scheduler-event-generator -s pbs  
-t 1202300131

I get no output. The script seems to complain if I set an invalid  
log_path. It does not complain if I point log_path to an empty  
directory. Also, if log_path is set to some empty location, I don't  
see any complaints in container-real.log.

So my next guess is that gram is not looking at the pbs_server.log  
file. I'd be very happy if someone could shed some light.

Just for completeness - /var/spool/pbs is mounted read-only, over nfs  
(since the pbs_server is running on another node).

Regards
Norm




More information about the gram-user mailing list