[gram-dev] Fwd: Increased globus errors 17 and 43 since upgrade to osg 1.0
Steven Timm
timm at fnal.gov
Thu Aug 7 10:22:09 CDT 2008
Full logs attached from a globus error 17 and a globus error 43.
These two errors were simultaneous. 3 of a cluster of 200
jobs held, the others completed normally. (one globus error 10,
one globus error 17, one globus error 43.)
Steve Timm
On Thu, 7 Aug 2008, Charles Bacon wrote:
> On Aug 6, 2008, at 2:50 PM, Steven Timm wrote:
>
>> We were using GT 4.0.5 before.
>> Pre-ws gram in both cases.
>> These errors that are happening now are happening just as the
>> jobmanager is trying to either fetch or submit the executable.
>>
>> All jobs are being submitted from Condor-G's condor_gridmanager and
>> gahp_server and it appears that the globus-job-manager process
>> running on the gatekeeper is trying to call back to condor-g's
>> gahp_server on the client at the time of the failure, to fetch
>> the executable.
>
> Given this, and that they work if the job is released, it all suggests timing
> problems to me, probably of an NFS-mounted GASS cache. Why it would show up
> more in OSG1.0 than OSG0.8, I don't know. If you diff your condor.pm from
> one installation to the other, do you find anything has changed in terms of
> local modifications? It seems like you might want to try adding:
> $self->nfssync( $description->executable);
>
> into the section of nfssync() calls right before the line reading:
> $self->log("About to submit condor job");
>
> And/or give it a couple second sync-up sleep there if the file doesn't exist
> yet.
>
> So that's all 17. 43 is "stage out failed" - do you have any logs from jobs
> where that happened? It's odd that both of these are
> communication-with-the-client failures - do the condor-G client logs have
> anything interesting to say when these failures happen?
>
>
> Charles
>
--
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
timm at fnal.gov http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gram_job_mgr_9983.log
Type: application/octet-stream
Size: 21373 bytes
Desc:
URL: <http://lists.globus.org/pipermail/gram-dev/attachments/20080807/2de5d0f2/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gram_job_mgr_10029.log
Type: application/octet-stream
Size: 19280 bytes
Desc:
URL: <http://lists.globus.org/pipermail/gram-dev/attachments/20080807/2de5d0f2/attachment-0001.obj>
More information about the gram-dev
mailing list