[gram-dev] Fwd: Increased globus errors 17 and 43 since upgrade to osg 1.0
Charles Bacon
bacon at mcs.anl.gov
Wed Aug 27 16:27:40 CDT 2008
There have been two commits to the gass/ directory since the 4.0.5
release. One commit (http://tinyurl.com/gass1
) was to fix bug 5706 (http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5706
). The other commit (http://preview.tinyurl.com/gasstwo
) was to fix bug 5771 (http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5771
).
I don't have the experience to evaluate whether either of those two
patches would affect this behavior, but perhaps that's enough of a
clue for you to see if it's relevant.
Charles
On Aug 27, 2008, at 2:33 PM, Jaime Frey wrote:
> The gahp_server statically links in the the gram client and gass
> server-ez libraries (and their dependencies) from Globus 4.0.5. The
> code hasn't been changed since Condor 6.8.6, when we switched from
> linking against Globus 4.0.3.
>
> So a difference in behavior between OSG 0.8.0 and 1.0.0 can't be
> explained by changes in the GAHP server. Is it possible that in the
> version of Globus included in OSG 1.0.0, the tools trying to talk to
> the gahp_server are less patient if the gahp is very slow to respond?
>
> There is no documentation of the gahp's internals beyond the source
> code. The gahp does include modifications of the gass server-ez code
> from CERN that allows it to authenticate 20 connections in parallel
> instead of just one. Other than that, it's the stock Globus code.
>
> -- Jaime
>
> On Aug 19, 2008, at 9:00 PM, Steven Timm wrote:
>
>> All fingers seem to be pointing to gahp_server at the moment, but
>> it is a black box and a fairly small black box at that, which
>> doesn't appear to require any globus libraries. Is there
>> anything short of attaching to the gahp_server with gdb, or stracing
>> it, that we can do? Any documentation available short of
>> downloading the source code? We are looking for a needle in a
>> haystack here,
>> with only 2-3 errors happening per day. I have got almost all the
>> debugging into the perl script that I can put. condor_gridmanager
>> log is giving us nothing.
>>
>> Steve Timm
>>
>>
>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D (630) 840-8525
>> timm at fnal.gov http://home.fnal.gov/~timm/
>> Fermilab Computing Division, Scientific Computing Facilities,
>> Grid Facilities Department, FermiGrid Services Group, Assistant
>> Group Leader.
>>
>> On Tue, 19 Aug 2008, Charles Bacon wrote:
>>
>>> The latest from fermi. See both the "Here's the latest from this"
>>> entry and the one starting with "So I would like to see if we can
>>> focus this ticket at all:".
>>>
>>>
>>> Charles
>>>
>>>> Here's the latest from this:
>>>> One more followup to this ticket. the debugging has now captured
>>>> the first return code from the globus-gass-cache -add command
>>>> which is failing and causing the error 43.
>>>> the error is 65280
>>>> that's FF00 in hex. Does that tell you anything?
>>>> Steve
>>>> ------------------------------------------------------------------
>>>> Steven C. Timm, Ph.D (630) 840-8525
>>>> timm at fnal.gov http://home.fnal.gov/~timm/
>>>> Fermilab Computing Division, Scientific Computing Facilities,
>>>> Grid Facilities Department, FermiGrid Services Group, Assistant
>>>> Group Leader.
>>>> On Mon, 18 Aug 2008, Steven Timm wrote:
>>>>> On Fri, 15 Aug 2008, condor-support response tracking system
>>>>> wrote:
>>>>>>> One of my biggest servers is now having the condor_gridmanager
>>>>>>> of
>>>>>>> one of my biggest users crash every five minutes with the error:
>>>>>>> 8/1 11:52:37 [25123] GAHP[26126] <-
>>>>>>> 'GRAM_JOB_CALLBACK_REGISTER 19
>>>>>>> https://fermigridosg1.fnal.gov:40020/23358/1217559488/
>>>>>>> https://fnpcsrv1.fnal.gov:61442/'
>>>>>>> 8/1 11:52:37 [25123] GAHP[26126] -> EOF
>>>>>>> 8/1 11:52:37 [25123] ERROR "Bad GRAM_JOB_CALLBACK_REGISTER
>>>>>>> Request" at
>>>>>>> line 2055 in file gahp-client.C
>>>>>>> Any idea what might be causing this?
>>>>>>> Background is that the user is trying to come back after a
>>>>>>> situation where 1122 condor-g jobs were held due to an expired
>>>>>>> proxy.
>>>>>>> It is not the same port that it is going after every time.
>>>>>>> the 10 seconds of log in D_FULLDEBUG mode is attached.
>>>>>>> This is output that came from the heavily debug-instrumented
>>>>>>> gahp_server that Jaime sent us to try to debug the globus
>>>>>>> error 10.
>>>>>>> Recently we have not seen many globus error 10 but we are
>>>>>>> seeing a
>>>>>>> lot of globus error 17 and 43 now when the jobs try to start
>>>>>>> at the remote side and call back to the GAHP_SERVER to fetch
>>>>>>> their
>>>>>>> executable. There is also a definite increase in hung
>>>>>>> globus-gass-cache-util executables, something which happened
>>>>>>> occasionally
>>>>>>> under osg 0.6.0 but was very rare under osg 0.8.0, but now in
>>>>>>> osg
>>>>>>> 1.0.0
>>>>>>> is back with a vengeance, 5 instances in the last week, on
>>>>>>> different
>>>>>>> machines.
>>>>>> The gahp server is dying in the log snippet quoted above.
>>>>>> In the longer gridmanager log at the end of your email,
>>>>>> authentication
>>>>>> is failing repeatedly. Unfortunately, GRAM doesn't give any
>>>>>> details on
>>>>>> why authentication failed.
>>>>>> One thing you can try is using the gahp_server process from OSG
>>>>>> 0.8.0.
>>>>>> It should work fine with the rest of Condor in OSG 1.0.0. I'd
>>>>>> be very
>>>>>> interested to know if that dramatically reduces the number of
>>>>>> errors.
>>>>> Hi Jaime--the point is that we had upgraded to condor 7.0.3
>>>>> (and the newer gahp_server that comes with it) considerably before
>>>>> the upgrade to OSG 1.0.0. OSG 0.8.0 shipped with condor 6.8.8
>>>>> but we
>>>>> were already on condor 6.9 by that time. In any case the
>>>>> gahp_server
>>>>> is the same version between condor 6.8.8, condor 6.8.5, condor
>>>>> 6.9.5,
>>>>> and condor 7.0.3.
>>>>> snowball.timm:/usr/local/osgclient-1.8.1/condor/sbin> ident
>>>>> gahp_server
>>>>> gahp_server:
>>>>> $GahpVersion: 1.0.15 Sep 13 2007 UW\ Gahp $
>>>>> I am now running 1.0.16 since you gave me the debug version but
>>>>> the errors are the same.
>>>>> So I would like to see if we can focus this ticket at all:
>>>>> 1) Globus error 17's that I've been reporting are coming
>>>>> from a malformed scheduler_condor_submit_script. IT cannot
>>>>> find the executable because it is just looking in the wrong
>>>>> place. We need to understand why and how this can happen, and why
>>>>> only in 2 or 3 jobs of a cluster of a couple of hundred.
>>>>> 2) We still have no more idea why the globus error 43 happens
>>>>> than we ever did, although we are pretty sure it is happening
>>>>> in the transfer executable section of Jobmanager.pm, we are not
>>>>> getting any meaningful standard error and the extra print
>>>>> statements we've tried to add to date have told us nothing.
>>>>> (said section is trying to do a globus-gass-cache -add when it
>>>>> fails).
>>>>> 3) we haven't been able to interpret the globus error 10 any
>>>>> better
>>>>> than we did before even with the extra debugging in gahp_server.
>>>>> What is the next step? We need to focus the effort and bring it
>>>>> to a satisfactory solution pretty soon.
>
>
> Thanks and regards,
> Jaime Frey
> UW-Madison Condor Team
>
>
>
More information about the gram-dev
mailing list