[gt-user] How does the PBS jobmanager handle job dir clean-ups?

Jens-Soenke Voeckler voeckler at ISI.EDU
Sun Jun 25 20:17:50 CDT 2006


Hi,

I have a tricky problem with the PBS jobmanager. Every Sat->Sun  
midnight, my submit host loses all connectivity to its remote jobs. I  
am using Condor-G and its grid manager. The PBS job manager creates a  
directory $HOME/.globus/job/<gatekeeper>/<pid>.<utc>/ where it keeps  
files that PBS needs, e.g. PBS submit file etc. According to the  
complaints email from the PBS Mom, see below, these directories  
"disappear". I know that I don't have a cron job to do so, and I know  
that HPC does not, either. This leaves Globus or Condor-G, and Jaime  
is already looking into his part.

As Garrick suspects, could there be some entity in the PBS job  
manager calculating dates not using UTC stamps and thus falls into  
the Sat==6 -> Sun==0 trap?

AFAI can tell, the server is a Globus 4.0.1.

Begin forwarded message:

> Subject: Re: Are there special clean-up cron jobs for Globus dirs
>
> On Sun, Jun 25, 2006 at 09:35:16AM -0700, Jens-Soenke Voeckler  
> alleged:
>> Hi,
>>
>> I've collected various pieces of information on the problem, and put
>> it at
>>
>> http://www.isi.edu/~voeckler/PBS/
>>
>> The obvious symptom appears to be that the job's directory  
>> in .globus/
>> job/<host> disappears.
>
> There is nothing we have setup that would delete things from people's
> home directories.  I think the users would have us taken out and shot!
>
> I'd assume that something in globus/condor-g is removing it.
>
> Could this be a date calculation bug?  It suddenly finds day 0  
> (Sunday)
> to be less than the execution day 6 (Saturday)?
>
>
>> Jens.
>>
>> On Jun 25, 2006, at 24:47 , Jens-Soenke Voeckler wrote:
>>
>>> Hi Brian,
>>>
>>> to return to last Monday's discussion, the first "casualties" come
>>> fllocking in, see below. These could still be real jobs
>>> terminating, but I also noticed that Condor told me that one of my
>>> scoring jobs finished, while it is still active on PBS!
>>>
>>> Submitted at:        Sat Jun 24 22:15:56 2006
>>> Completed at:        Sun Jun 25 00:07:38 2006
>>> Real Time:             0 01:51:42
>>>
>>> I know it's still active, because it's still producing files. Now,
>>> if there was just some simple way to get its PBS job id... It will
>>> be done shortly.
>>>
>>> Actually, I cannot even find the job in the PBS queue! The only job
>>> with a similar wall-time requirement in the Q is not the one I am
>>> looking at. If it is, it will send the same email, because its
>>> job/... directory is gone from the Globus cache. I know the
>>> "missing job" is still alive (and not running on a login node).
>>>
>>> I see other jobs in my Condor-Q that were running/active quite
>>> well, but suddenly are marked as idle/unsubmitted again. All my
>>> workflows, which were happily churning a couple of minutes ago, are
>>> in various states between "running/stage-out" and "idle/
>>> unsubmitted" instead of mostly "running/active" on the submit side.
>>> Condor-G sees only 128 jobs, but PBS knows about 137 jobs.
>>>
>>> As I mentioned earlier, this behavior has consistently been
>>> observed the last three weeks shortly after midnight Sat -> Sun.
>>> Thus, I strongly suspect a (weekly) cronjob as culprit.
>>>
>>> Here are some messages PBS sent.  These may be truly done jobs, or
>>> half-life ones, I don't know:
>>>
>>>
>>> From: hpcc at hpc-master.usc.edu
>>> Date: June 25, 2006 12:14:29 AM PDT
>>> To: SCRAMBLED at SCRAMBLED
>>> Subject: PBS JOB 1206999.hpc-pbs.usc.edu
>>>
>>> PBS Job Id: 1206999.hpc-pbs.usc.edu
>>> Job Name:   STDIN
>>> An error has occurred processing your job, see below.
>>> Post job file processing error; job 1206999.hpc-pbs.usc.edu on host
>>> hpc0095/0
>>>
>>> Unable to copy file /var/spool/torque/spool/1206999.hpc.OU to /home/
>>> rcf-11/voeckler/.globus/job/hpc-master.usc.edu/26969.1151206053/ 
>>> stdout
>>> error from copy
>>> /bin/cp: cannot create regular file `/home/rcf-11/voeckler/.globus/
>>> job/hpc-master.usc.edu/26969.1151206053/stdout': No such file or
>>> directory
>>> end error output
>>> Output retained on that host in: /var/spool/torque/undelivered/
>>> 1206999.hpc.OU
>>>
>>> Unable to copy file /var/spool/torque/spool/1206999.hpc.ER to /home/
>>> rcf-11/voeckler/.globus/job/hpc-master.usc.edu/26969.1151206053/ 
>>> stderr
>>> error from copy
>>> /bin/cp: cannot create regular file `/home/rcf-11/voeckler/.globus/
>>> job/hpc-master.usc.edu/26969.1151206053/stderr': No such file or
>>> directory
>>> end error output
>>> Output retained on that host in: /var/spool/torque/undelivered/
>>> 1206999.hpc.ER

>>> From: hpcc at hpc-master.usc.edu
>>> Date: June 25, 2006 12:14:14 AM PDT
>>> To: SCRAMBLED at SCRAMBLED
>>> Subject: PBS JOB 1206998.hpc-pbs.usc.edu
>>>
>>> PBS Job Id: 1206998.hpc-pbs.usc.edu
>>> Job Name:   STDIN
>>> An error has occurred processing your job, see below.
>>> Post job file processing error; job 1206998.hpc-pbs.usc.edu on host
>>> hpc0095/0
>>>
>>> Unable to copy file /var/spool/torque/spool/1206998.hpc.OU to /home/
>>> rcf-11/voeckler/.globus/job/hpc-master.usc.edu/26970.1151206053/ 
>>> stdout
>>> error from copy
>>> /bin/cp: cannot create regular file `/home/rcf-11/voeckler/.globus/
>>> job/hpc-master.usc.edu/26970.1151206053/stdout': No such file or
>>> directory
>>> end error output
>>> Output retained on that host in: /var/spool/torque/undelivered/
>>> 1206998.hpc.OU
>>>
>>> Unable to copy file /var/spool/torque/spool/1206998.hpc.ER to /home/
>>> rcf-11/voeckler/.globus/job/hpc-master.usc.edu/26970.1151206053/ 
>>> stderr
>>> error from copy
>>> /bin/cp: cannot create regular file `/home/rcf-11/voeckler/.globus/
>>> job/hpc-master.usc.edu/26970.1151206053/stderr': No such file or
>>> directory
>>> end error output
>>> Output retained on that host in: /var/spool/torque/undelivered/
>>> 1206998.hpc.ER

>>> From: hpcc at hpc-master.usc.edu
>>> Date: June 25, 2006 12:14:00 AM PDT
>>> To: SCRAMBLED at SCRAMBLED
>>> Subject: PBS JOB 1206997.hpc-pbs.usc.edu
>>>
>>> PBS Job Id: 1206997.hpc-pbs.usc.edu
>>> Job Name:   STDIN
>>> An error has occurred processing your job, see below.
>>> Post job file processing error; job 1206997.hpc-pbs.usc.edu on host
>>> hpc0095/0
>>>
>>> Unable to copy file /var/spool/torque/spool/1206997.hpc.OU to /home/
>>> rcf-11/voeckler/.globus/job/hpc-master.usc.edu/26911.1151206044/ 
>>> stdout
>>> error from copy
>>> /bin/cp: cannot create regular file `/home/rcf-11/voeckler/.globus/
>>> job/hpc-master.usc.edu/26911.1151206044/stdout': No such file or
>>> directory
>>> end error output
>>> Output retained on that host in: /var/spool/torque/undelivered/
>>> 1206997.hpc.OU
>>>
>>> Unable to copy file /var/spool/torque/spool/1206997.hpc.ER to /home/
>>> rcf-11/voeckler/.globus/job/hpc-master.usc.edu/26911.1151206044/ 
>>> stderr
>>> error from copy
>>> /bin/cp: cannot create regular file `/home/rcf-11/voeckler/.globus/
>>> job/hpc-master.usc.edu/26911.1151206044/stderr': No such file or
>>> directory
>>> end error output
>>> Output retained on that host in: /var/spool/torque/undelivered/
>>> 1206997.hpc.ER
>>
>> Aloha,
>> Dipl.-Ing. Jens-S. V?ckler   voeckler at isi dot edu
>> University of Southern California Viterbi School of Engineering
>> Information Sciences Institute; 4676 Admiralty Way Ste 1001
>> Marina Del Rey, CA 90292-6611; USA; +1 310 448 8427
>> * You can rely on any shared filesystems for only one thing -  
>> don't! *
>>
>>
>>
>
> -- 
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California

Aloha,
Dipl.-Ing. Jens-S. Vöckler   voeckler at isi dot edu
University of Southern California Viterbi School of Engineering
Information Sciences Institute; 4676 Admiralty Way Ste 1001
Marina Del Rey, CA 90292-6611; USA; +1 310 448 8427
* You can rely on any shared filesystems for only one thing - don't! *




More information about the gt-user mailing list