[gram-dev] Fwd: Increased globus errors 17 and 43 since upgrade to osg 1.0
bacon at mcs.anl.gov
Tue Aug 19 08:34:11 CDT 2008
The latest from fermi. See both the "Here's the latest from this"
entry and the one starting with "So I would like to see if we can
focus this ticket at all:".
> Here's the latest from this:
> One more followup to this ticket. the debugging has now captured
> the first return code from the globus-gass-cache -add command
> which is failing and causing the error 43.
> the error is 65280
> that's FF00 in hex. Does that tell you anything?
> Steven C. Timm, Ph.D (630) 840-8525
> timm at fnal.gov http://home.fnal.gov/~timm/
> Fermilab Computing Division, Scientific Computing Facilities,
> Grid Facilities Department, FermiGrid Services Group, Assistant
> Group Leader.
> On Mon, 18 Aug 2008, Steven Timm wrote:
>> On Fri, 15 Aug 2008, condor-support response tracking system wrote:
>>>> One of my biggest servers is now having the condor_gridmanager of
>>>> one of my biggest users crash every five minutes with the error:
>>>> 8/1 11:52:37  GAHP <- 'GRAM_JOB_CALLBACK_REGISTER 19
>>>> 8/1 11:52:37  GAHP -> EOF
>>>> 8/1 11:52:37  ERROR "Bad GRAM_JOB_CALLBACK_REGISTER
>>>> Request" at
>>>> line 2055 in file gahp-client.C
>>>> Any idea what might be causing this?
>>>> Background is that the user is trying to come back after a
>>>> situation where 1122 condor-g jobs were held due to an expired
>>>> It is not the same port that it is going after every time.
>>>> the 10 seconds of log in D_FULLDEBUG mode is attached.
>>>> This is output that came from the heavily debug-instrumented
>>>> gahp_server that Jaime sent us to try to debug the globus error 10.
>>>> Recently we have not seen many globus error 10 but we are seeing a
>>>> lot of globus error 17 and 43 now when the jobs try to start
>>>> at the remote side and call back to the GAHP_SERVER to fetch their
>>>> executable. There is also a definite increase in hung
>>>> globus-gass-cache-util executables, something which happened
>>>> under osg 0.6.0 but was very rare under osg 0.8.0, but now in osg
>>>> is back with a vengeance, 5 instances in the last week, on
>>> The gahp server is dying in the log snippet quoted above.
>>> In the longer gridmanager log at the end of your email,
>>> is failing repeatedly. Unfortunately, GRAM doesn't give any
>>> details on
>>> why authentication failed.
>>> One thing you can try is using the gahp_server process from OSG
>>> It should work fine with the rest of Condor in OSG 1.0.0. I'd be
>>> interested to know if that dramatically reduces the number of
>> Hi Jaime--the point is that we had upgraded to condor 7.0.3
>> (and the newer gahp_server that comes with it) considerably before
>> the upgrade to OSG 1.0.0. OSG 0.8.0 shipped with condor 6.8.8 but we
>> were already on condor 6.9 by that time. In any case the gahp_server
>> is the same version between condor 6.8.8, condor 6.8.5, condor 6.9.5,
>> and condor 7.0.3.
>> snowball.timm:/usr/local/osgclient-1.8.1/condor/sbin> ident
>> $GahpVersion: 1.0.15 Sep 13 2007 UW\ Gahp $
>> I am now running 1.0.16 since you gave me the debug version but
>> the errors are the same.
>> So I would like to see if we can focus this ticket at all:
>> 1) Globus error 17's that I've been reporting are coming
>> from a malformed scheduler_condor_submit_script. IT cannot
>> find the executable because it is just looking in the wrong
>> place. We need to understand why and how this can happen, and why
>> only in 2 or 3 jobs of a cluster of a couple of hundred.
>> 2) We still have no more idea why the globus error 43 happens
>> than we ever did, although we are pretty sure it is happening
>> in the transfer executable section of Jobmanager.pm, we are not
>> getting any meaningful standard error and the extra print
>> statements we've tried to add to date have told us nothing.
>> (said section is trying to do a globus-gass-cache -add when it
>> 3) we haven't been able to interpret the globus error 10 any better
>> than we did before even with the extra debugging in gahp_server.
>> What is the next step? We need to focus the effort and bring it
>> to a satisfactory solution pretty soon.
>> Steve Timm
>>> Thanks and regards,
>>> Jaime Frey
>>> UW-Madison Condor Team
>>> MESSAGE INFORMATION
>>> * From: Jaime Frey <jfrey at cs.wisc.edu>
>>> * Ticket Email List: timm at fnal.gov,
>>> OSG-TROUBLESHOOTING at opensciencegrid.org,osg-troubleshooting at opensciencegrid.org
>> Steven C. Timm, Ph.D (630) 840-8525
>> timm at fnal.gov http://home.fnal.gov/~timm/
>> Fermilab Computing Division, Scientific Computing Facilities,
>> Grid Facilities Department, FermiGrid Services Group, Assistant
>> Group Leader.
> MESSAGE INFORMATION
> * From: Steven Timm <timm at fnal.gov>
> * Ticket Email List: timm at fnal.gov, OSG-TROUBLESHOOTING at opensciencegrid.org
> ,osg-troubleshooting at opensciencegrid.org
More information about the gram-dev