[gram-dev] Fwd: Increased globus errors 17 and 43 since upgrade to osg 1.0
Martin Feller
feller at mcs.anl.gov
Wed Aug 6 14:45:26 CDT 2008
Hm, hard to tell what that might be.
Gram from what GT version did they run before?
Maybe a code comparison of these versions might provide a hint.
Martin
Charles Bacon wrote:
> Any idea why these failures would have increased with an upgrade to 4.0.7?
>
>
> Charles
>
> Begin forwarded message:
>
>> From: Steven Timm <timm at fnal.gov>
>> Date: August 4, 2008 9:24:53 AM CDT
>> To: osg-sites at opensciencegrid.org
>> Subject: Increased globus errors 17 and 43 since upgrade to osg 1.0
>>
>> Since upgrading most of FermiGrid to osg 1.0 we have seen an increased
>> number of globus error 17 (the job failed when the jobmanager
>> attempted to run it) and globus error 43 (failed to stage executable).
>> They seem more likely to happen when a couple hundred jobs are submitted
>> to the system at once but I can't prove that. The failures are
>> about at the 1% level and have been observed from a variety of different
>> submission clients, a variety of different gatekeepers, and a
>> variety of different users.
>>
>> I have access to the logs of the condor_gridmanager (which is running
>> in D_FULLDEBUG) on the submission
>> node as well as ability to capture the gram_job_mgr log. This is the
>> typical error we see there.
>>
>> If we condor_release the job which is held with globus error 17 or
>> 43 it has a very good chance of finishing correctly the second time
>> around.
>>
>> I am wondering if other people are starting to see more of these errors
>> as well.
>>
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: About to submit condor job
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: I am the parent
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: submission failed!!!
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: Sent NFS sync for
>> /grid/home/fnal_thy/.globu
>> s/job/fnpcosg1.fnal.gov/11758.1217850069/scheduler_condor_submit_stderr
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: Error file is not empty, and
>> submission fail
>> ed
>>
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: Error text is
>> ERROR: Executable file
>> /grid/home/fnal_thy//gram_scratch_NMGtQl66MN/https://fnpc
>> srv1.fnal.gov:61445/farm/theory_stage01/skands/6.4.18ps/tun067.x does
>> not exist
>>
>> Mon Aug 4 06:41:18 2008 JM_SCRIPT: Writing extended error information
>> to stderr
>> 8/4 06:41:18 JM: GT3 extended error message:
>> GRAM_SCRIPT_GT3_FAILURE_MESSAGE:\nE
>> RROR: Executable file
>> /grid/home/fnal_thy//gram_scratch_NMGtQl66MN/https://fnpcs
>> rv1.fnal.gov:61445/farm/theory_stage01/skands/6.4.18ps/tun067.x does
>> not exist\n
>> 8/4 06:41:18 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE =
>> \nERROR:
>> Executable file
>> /grid/home/fnal_thy//gram_scratch_NMGtQl66MN/https://fnpcsrv1.fn
>> al.gov:61445/farm/theory_stage01/skands/6.4.18ps/tun067.x does not
>> exist\n
>> 8/4 06:41:18 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
>> 8/4 06:41:18 Job Manager State Machine (entering): GLOBUS_GRAM_JO
>>
>> ----------
>> Past experience leads me to check for one end or the other running
>> out of globus or condor ports but in this case we are monitoring
>> them and the number of ports looks fine. In any case, all
>> the requests for syncing of files go back to the condor gahp_server
>> which only uses two ports on the client. There is nothing in the
>> logs of this server to indicate things are being overwhelmed,
>> nor are there any problems showing in the system logs at either
>> end of the transaction.
>>
>> Steve Timm
>>
>>
>>
>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D (630) 840-8525
>> timm at fnal.gov http://home.fnal.gov/~timm/
>> Fermilab Computing Division, Scientific Computing Facilities,
>> Grid Facilities Department, FermiGrid Services Group, Assistant Group
>> Leader.
>
More information about the gram-dev
mailing list