[gram-dev] Fwd: Increased globus errors 17 and 43 since upgrade to osg 1.0

Steven Timm timm at fnal.gov
Fri Aug 8 14:22:46 CDT 2008


I've added an extra print statement to a different file to try to
capture a second copy of the error message that Charles is
referring to.  On the other hand, after looking at the JobManager.pm
code and the section of the log where the error is, it appears
to me that we are getting the error code just fine, it is
just wrapped onto the next few lines.

I also added, at the suggestion of Jaime Frey, a section into
condor.pm to grab a copy of the scheduler_condor_submit_script
after a globus error 17, which would otherwise be deleted.  This
will let us know if we are somehow dealing with a malformed
condor submit script that is just asking for the wrong executable.

Steve Timm


On Thu, 7 Aug 2008, Charles Bacon wrote:

> Okay, here's something to look at for the error 43.  This is mainly being 
> handled by lib/perl/Globus/GRAM/JobManager.pm's "sub stage_in".  The error is 
> supposed to show up in the logs, but I only see the empty line in:
> 8/7 09:05:50 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE =
>
> I think the error might be getting swallowed up, so can you add some extra 
> lines to the section of JobManager.pm that looks like:
>       if ($rc != 0) {
>           $self->log("executable staging failed with $stderr");
>
>           $self->respond( {
>               'GT3_FAILURE_TYPE' => 'executable',
>               'GT3_FAILURE_MESSAGE' => $stderr,
>               'GT3_FAILURE_SOURCE' => $description->executable()
>           });
>
>           return Globus::GRAM::Error::STAGING_EXECUTABLE;
>       }
>
> to save out that $stderr value somewhere it can be inspected?
>
> Sadly, I think the nfssync() won't work to solve the error 17, because I see 
> that the stage_in subroutine is already trying:
>       $self->nfssync($local, 0);
> after the successful executable stage-in.
>
>
> Charles
>
> On Aug 7, 2008, at 10:22 AM, Steven Timm wrote:
>
>> Full logs attached from a globus error 17 and a globus error 43.
>> These two errors were simultaneous.  3 of a cluster of 200
>> jobs held, the others completed normally.  (one globus error 10,
>> one globus error 17, one globus error 43.)
>> 
>> Steve Timm
>> 
>> 
>> 
>> On Thu, 7 Aug 2008, Charles Bacon wrote:
>> 
>>> On Aug 6, 2008, at 2:50 PM, Steven Timm wrote:
>>> 
>>>> We were using GT 4.0.5 before.
>>>> Pre-ws gram in both cases.
>>>> These errors that are happening now are happening just as the
>>>> jobmanager is trying to either fetch or submit the executable.
>>>> All jobs are being submitted from Condor-G's condor_gridmanager and 
>>>> gahp_server and it appears that the globus-job-manager process
>>>> running on the gatekeeper is trying to call back to condor-g's
>>>> gahp_server on the client at the time of the failure, to fetch
>>>> the executable.
>>> 
>>> Given this, and that they work if the job is released, it all suggests 
>>> timing problems to me, probably of an NFS-mounted GASS cache.  Why it 
>>> would show up more in OSG1.0 than OSG0.8, I don't know.  If you diff your 
>>> condor.pm from one installation to the other, do you find anything has 
>>> changed in terms of local modifications?  It seems like you might want to 
>>> try adding:
>>> $self->nfssync( $description->executable);
>>> 
>>> into the section of nfssync() calls right before the line reading:
>>> $self->log("About to submit condor job");
>>> 
>>> And/or give it a couple second sync-up sleep there if the file doesn't 
>>> exist yet.
>>> 
>>> So that's all 17.  43 is "stage out failed" - do you have any logs from 
>>> jobs where that happened?  It's odd that both of these are 
>>> communication-with-the-client failures - do the condor-G client logs have 
>>> anything interesting to say when these failures happen?
>>> 
>>> 
>>> Charles
>>> 
>> 
>> -- 
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D  (630) 840-8525
>> timm at fnal.gov  http://home.fnal.gov/~timm/
>> Fermilab Computing Division, Scientific Computing Facilities,
>> Grid Facilities Department, FermiGrid Services Group, Assistant Group 
>> Leader.
>> <gram_job_mgr_9983.log><gram_job_mgr_10029.log>
>

-- 
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.




More information about the gram-dev mailing list