[gram-dev] Fwd: Increased globus errors 17 and 43 since upgrade to osg 1.0

Steven Timm timm at fnal.gov
Thu Aug 7 10:22:09 CDT 2008


Full logs attached from a globus error 17 and a globus error 43.
These two errors were simultaneous.  3 of a cluster of 200
jobs held, the others completed normally.  (one globus error 10,
one globus error 17, one globus error 43.)

Steve Timm



On Thu, 7 Aug 2008, Charles Bacon wrote:

> On Aug 6, 2008, at 2:50 PM, Steven Timm wrote:
>
>> We were using GT 4.0.5 before.
>> Pre-ws gram in both cases.
>> These errors that are happening now are happening just as the
>> jobmanager is trying to either fetch or submit the executable.
>> 
>> All jobs are being submitted from Condor-G's condor_gridmanager and 
>> gahp_server and it appears that the globus-job-manager process
>> running on the gatekeeper is trying to call back to condor-g's
>> gahp_server on the client at the time of the failure, to fetch
>> the executable.
>
> Given this, and that they work if the job is released, it all suggests timing 
> problems to me, probably of an NFS-mounted GASS cache.  Why it would show up 
> more in OSG1.0 than OSG0.8, I don't know.  If you diff your condor.pm from 
> one installation to the other, do you find anything has changed in terms of 
> local modifications?  It seems like you might want to try adding:
>   $self->nfssync( $description->executable);
>
> into the section of nfssync() calls right before the line reading:
>   $self->log("About to submit condor job");
>
> And/or give it a couple second sync-up sleep there if the file doesn't exist 
> yet.
>
> So that's all 17.  43 is "stage out failed" - do you have any logs from jobs 
> where that happened?  It's odd that both of these are 
> communication-with-the-client failures - do the condor-G client logs have 
> anything interesting to say when these failures happen?
>
>
> Charles
>

-- 
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gram_job_mgr_9983.log
Type: application/octet-stream
Size: 21373 bytes
Desc: 
URL: <http://lists.globus.org/pipermail/gram-dev/attachments/20080807/2de5d0f2/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gram_job_mgr_10029.log
Type: application/octet-stream
Size: 19280 bytes
Desc: 
URL: <http://lists.globus.org/pipermail/gram-dev/attachments/20080807/2de5d0f2/attachment-0001.obj>


More information about the gram-dev mailing list