[gram-dev] Fwd: Increased globus errors 17 and 43 since upgrade to osg 1.0

Charles Bacon bacon at mcs.anl.gov
Wed Aug 27 16:27:40 CDT 2008


There have been two commits to the gass/ directory since the 4.0.5  
release.  One commit (http://tinyurl.com/gass1
) was to fix bug 5706 (http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5706 
).  The other commit (http://preview.tinyurl.com/gasstwo
) was to fix bug 5771 (http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5771 
).

I don't have the experience to evaluate whether either of those two  
patches would affect this behavior, but perhaps that's enough of a  
clue for you to see if it's relevant.


Charles

On Aug 27, 2008, at 2:33 PM, Jaime Frey wrote:

> The gahp_server statically links in the the gram client and gass  
> server-ez libraries (and their dependencies) from Globus 4.0.5. The  
> code hasn't been changed since Condor 6.8.6, when we switched from  
> linking against Globus 4.0.3.
>
> So a difference in behavior between OSG 0.8.0 and 1.0.0 can't be  
> explained by changes in the GAHP server. Is it possible that in the  
> version of Globus included in OSG 1.0.0, the tools trying to talk to  
> the gahp_server are less patient if the gahp is very slow to respond?
>
> There is no documentation of the gahp's internals beyond the source  
> code. The gahp does include modifications of the gass server-ez code  
> from CERN that allows it to authenticate 20 connections in parallel  
> instead of just one. Other than that, it's the stock Globus code.
>
> -- Jaime
>
> On Aug 19, 2008, at 9:00 PM, Steven Timm wrote:
>
>> All fingers seem to be pointing to gahp_server at the moment, but
>> it is a black box and a fairly small black box at that, which  
>> doesn't appear to require any globus libraries.  Is there
>> anything short of attaching to the gahp_server with gdb, or stracing
>> it, that we can do?  Any documentation available short of  
>> downloading the source code?  We are looking for a needle in a  
>> haystack here,
>> with only 2-3 errors happening per day.  I have got almost all the
>> debugging into the perl script that I can put. condor_gridmanager
>> log is giving us nothing.
>>
>> Steve Timm
>>
>>
>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D  (630) 840-8525
>> timm at fnal.gov  http://home.fnal.gov/~timm/
>> Fermilab Computing Division, Scientific Computing Facilities,
>> Grid Facilities Department, FermiGrid Services Group, Assistant  
>> Group Leader.
>>
>> On Tue, 19 Aug 2008, Charles Bacon wrote:
>>
>>> The latest from fermi.  See both the "Here's the latest from this"  
>>> entry and the one starting with "So I would like to see if we can  
>>> focus this ticket at all:".
>>>
>>>
>>> Charles
>>>
>>>> Here's the latest from this:
>>>> One more followup to this ticket.  the debugging has now captured
>>>> the first return code from the globus-gass-cache -add command
>>>> which is failing and causing the error 43.
>>>> the error is 65280
>>>> that's FF00 in hex.  Does that tell you anything?
>>>> Steve
>>>> ------------------------------------------------------------------
>>>> Steven C. Timm, Ph.D  (630) 840-8525
>>>> timm at fnal.gov  http://home.fnal.gov/~timm/
>>>> Fermilab Computing Division, Scientific Computing Facilities,
>>>> Grid Facilities Department, FermiGrid Services Group, Assistant  
>>>> Group Leader.
>>>> On Mon, 18 Aug 2008, Steven Timm wrote:
>>>>> On Fri, 15 Aug 2008, condor-support response tracking system  
>>>>> wrote:
>>>>>>> One of my biggest servers is now having the condor_gridmanager  
>>>>>>> of
>>>>>>> one of my biggest users crash every five minutes with the error:
>>>>>>> 8/1 11:52:37 [25123] GAHP[26126] <-  
>>>>>>> 'GRAM_JOB_CALLBACK_REGISTER 19
>>>>>>> https://fermigridosg1.fnal.gov:40020/23358/1217559488/
>>>>>>> https://fnpcsrv1.fnal.gov:61442/'
>>>>>>> 8/1 11:52:37 [25123] GAHP[26126] -> EOF
>>>>>>> 8/1 11:52:37 [25123] ERROR "Bad GRAM_JOB_CALLBACK_REGISTER  
>>>>>>> Request" at
>>>>>>> line 2055 in file gahp-client.C
>>>>>>> Any idea what might be causing this?
>>>>>>> Background is that the user is trying to come back after a
>>>>>>> situation where 1122 condor-g jobs were held due to an expired  
>>>>>>> proxy.
>>>>>>> It is not the same port that it is going after every time.
>>>>>>> the 10 seconds of log in D_FULLDEBUG mode is attached.
>>>>>>> This is output that came from the heavily debug-instrumented
>>>>>>> gahp_server that Jaime sent us to try to debug the globus  
>>>>>>> error 10.
>>>>>>> Recently we have not seen many globus error 10 but we are  
>>>>>>> seeing a
>>>>>>> lot of globus error 17 and 43 now when the jobs try to start
>>>>>>> at the remote side and call back to the GAHP_SERVER to fetch  
>>>>>>> their
>>>>>>> executable.  There is also a definite increase in hung
>>>>>>> globus-gass-cache-util executables, something which happened
>>>>>>> occasionally
>>>>>>> under osg 0.6.0 but was very rare under osg 0.8.0, but now in  
>>>>>>> osg
>>>>>>> 1.0.0
>>>>>>> is back with a vengeance, 5 instances in the last week, on  
>>>>>>> different
>>>>>>> machines.
>>>>>> The gahp server is dying in the log snippet quoted above.
>>>>>> In the longer gridmanager log at the end of your email,  
>>>>>> authentication
>>>>>> is failing repeatedly. Unfortunately, GRAM doesn't give any  
>>>>>> details on
>>>>>> why authentication failed.
>>>>>> One thing you can try is using the gahp_server process from OSG  
>>>>>> 0.8.0.
>>>>>> It should work fine with the rest of Condor in OSG 1.0.0. I'd  
>>>>>> be very
>>>>>> interested to know if that dramatically reduces the number of  
>>>>>> errors.
>>>>> Hi Jaime--the point is that we had upgraded to condor 7.0.3
>>>>> (and the newer gahp_server that comes with it) considerably before
>>>>> the upgrade to OSG 1.0.0.  OSG 0.8.0 shipped with condor 6.8.8  
>>>>> but we
>>>>> were already on condor 6.9 by that time.  In any case the  
>>>>> gahp_server
>>>>> is the same version between condor 6.8.8, condor 6.8.5, condor  
>>>>> 6.9.5,
>>>>> and condor 7.0.3.
>>>>> snowball.timm:/usr/local/osgclient-1.8.1/condor/sbin> ident  
>>>>> gahp_server
>>>>> gahp_server:
>>>>> $GahpVersion: 1.0.15 Sep 13 2007 UW\ Gahp $
>>>>> I am now running 1.0.16 since you gave me the debug version but
>>>>> the errors are the same.
>>>>> So I would like to see if we can focus this ticket at all:
>>>>> 1) Globus error 17's that I've been reporting are coming
>>>>> from a malformed scheduler_condor_submit_script. IT cannot
>>>>> find the executable because it is just looking in the wrong
>>>>> place.  We need to understand why and how this can happen, and why
>>>>> only in 2 or 3 jobs of a cluster of a couple of hundred.
>>>>> 2) We still have no more idea why the globus error 43 happens
>>>>> than we ever did, although we are pretty sure it is happening
>>>>> in the transfer executable section of Jobmanager.pm, we are not
>>>>> getting any meaningful standard error and the extra print
>>>>> statements we've tried to add to date have told us nothing.
>>>>> (said section is trying to do a globus-gass-cache -add when it  
>>>>> fails).
>>>>> 3) we haven't been able to interpret the globus error 10 any  
>>>>> better
>>>>> than we did before even with the extra debugging in gahp_server.
>>>>> What is the next step?  We need to focus the effort and bring it
>>>>> to a satisfactory solution pretty soon.
>
>
> Thanks and regards,
> Jaime Frey
> UW-Madison Condor Team
>
>
>




More information about the gram-dev mailing list