[gram-dev] Fwd: Increased globus errors 17 and 43 since upgrade to osg 1.0
feller at mcs.anl.gov
Wed Aug 6 14:45:26 CDT 2008
Hm, hard to tell what that might be.
Gram from what GT version did they run before?
Maybe a code comparison of these versions might provide a hint.
Charles Bacon wrote:
> Any idea why these failures would have increased with an upgrade to 4.0.7?
> Begin forwarded message:
>> From: Steven Timm <timm at fnal.gov>
>> Date: August 4, 2008 9:24:53 AM CDT
>> To: osg-sites at opensciencegrid.org
>> Subject: Increased globus errors 17 and 43 since upgrade to osg 1.0
>> Since upgrading most of FermiGrid to osg 1.0 we have seen an increased
>> number of globus error 17 (the job failed when the jobmanager
>> attempted to run it) and globus error 43 (failed to stage executable).
>> They seem more likely to happen when a couple hundred jobs are submitted
>> to the system at once but I can't prove that. The failures are
>> about at the 1% level and have been observed from a variety of different
>> submission clients, a variety of different gatekeepers, and a
>> variety of different users.
>> I have access to the logs of the condor_gridmanager (which is running
>> in D_FULLDEBUG) on the submission
>> node as well as ability to capture the gram_job_mgr log. This is the
>> typical error we see there.
>> If we condor_release the job which is held with globus error 17 or
>> 43 it has a very good chance of finishing correctly the second time
>> I am wondering if other people are starting to see more of these errors
>> as well.
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: About to submit condor job
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: I am the parent
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: submission failed!!!
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: Sent NFS sync for
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: Error file is not empty, and
>> submission fail
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: Error text is
>> ERROR: Executable file
>> srv1.fnal.gov:61445/farm/theory_stage01/skands/6.4.18ps/tun067.x does
>> not exist
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: Writing extended error information
>> to stderr
>> 8/4 06:41:18 JM: GT3 extended error message:
>> RROR: Executable file
>> rv1.fnal.gov:61445/farm/theory_stage01/skands/6.4.18ps/tun067.x does
>> not exist\n
>> 8/4 06:41:18 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE =
>> Executable file
>> al.gov:61445/farm/theory_stage01/skands/6.4.18ps/tun067.x does not
>> 8/4 06:41:18 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
>> 8/4 06:41:18 Job Manager State Machine (entering): GLOBUS_GRAM_JO
>> Past experience leads me to check for one end or the other running
>> out of globus or condor ports but in this case we are monitoring
>> them and the number of ports looks fine. In any case, all
>> the requests for syncing of files go back to the condor gahp_server
>> which only uses two ports on the client. There is nothing in the
>> logs of this server to indicate things are being overwhelmed,
>> nor are there any problems showing in the system logs at either
>> end of the transaction.
>> Steve Timm
>> Steven C. Timm, Ph.D (630) 840-8525
>> timm at fnal.gov http://home.fnal.gov/~timm/
>> Fermilab Computing Division, Scientific Computing Facilities,
>> Grid Facilities Department, FermiGrid Services Group, Assistant Group
More information about the gram-dev