[xio-user] RLS crash in globus_io call while running on sparc solaris 10

Robert Schuler schuler at isi.edu
Wed May 28 16:03:51 CDT 2008


Hello, XIO,

Got a crash in an RLS server running on a SPARC Solaris 10 box. This bug
looks to me like a globus_io (globus_io_xio_compat.c) bug.

http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=6085

Here's the gdb backtrace of the core dump:
"
------- Comment #3 From Scott Koranda 2008-05-22 16:09 [reply] ------- 
It was/is a Solaris 10 box:

[grid at ldas-cit skoranda]$ uname -a
SunOS ldas-cit 5.10 Generic_127111-11 sun4u sparc SUNW,Sun-Fire-880

The backtrace from the core file is 

(gdb) backtrace
#0  0xfe945b84 in _lwp_kill () from /lib/libc.so.1
#1  0xfe8e4bbc in raise () from /lib/libc.so.1
#2  0xfe8c10c0 in abort () from /lib/libc.so.1
#3  0xfef33078 in globus_silent_fatal () at globus_print.c:57
#4  0xfef33124 in globus_fatal (
    msg=0xfef4a248 "%s %s\n%s unknown error number: %d\n") at
globus_print.c:88
#5  0xfef39644 in globus_i_thread_report_bad_rc (rc=59,
    message=0xfef4a620 "GLOBUSTHREAD: pthread_mutex_lock() failed\n")
    at globus_thread_common.c:138
#6  0xfef3b21c in globus_mutex_lock (mut=0x27455b70)
    at globus_thread_pthreads.c:823
#7  0xff294928 in globus_io_register_writev (handle=0x2599c7b0,
    iov=0xfc5fb220, iovcnt=2, writev_callback=0xff2feaa8 <writevcb>,
    callback_arg=0xfc5fb160) at globus_io_xio_compat.c:3236
#8  0xff2fe36c in rrpc_writev (h=0x2599c7b0, iov=0xfc5fb220, iovcnt=2,
    nbw=0xfc5fb230, errmsg=0x259a07d4 "L-R-894811392-32.gwf,0") at
rpc.c:317
#9  0x0002ba7c in rrpc_error (c=0x2599c7b0, rc=12, fmt=0x3b6d0 "%s")
    at server.c:1376
#10 0x0002e14c in lrc_exists (c=0x2599c7b0, dbh=0xfc5fbf4c,
arglist=0xfc5fbf58)
    at server.c:1950
#11 0x0002a65c in procreq (a=0x0) at server.c:1054
#12 0xfe944998 in _lwp_start () from /lib/libc.so.1
#13 0xfe944998 in _lwp_start () from /lib/libc.so.1
---Type <return> to continue, or q <return> to quit---
Backtrace stopped: previous frame identical to this frame (corrupt
stack?)
"

The RLS code (replica/client/library/rpc.c) for the rrpc_writev(...)
call above is:
"
int
rrpc_writev(globus_io_handle_t *h, struct iovec *iov, globus_size_t
iovcnt,
        globus_size_t *nbw, char *errmsg)

{
  globus_result_t   r;
  IOMON         mon;
  struct timespec   ts;

  globus_mutex_init(&mon.mtx, GLOBUS_NULL);
  globus_cond_init(&mon.cond, GLOBUS_NULL);
  mon.done = GLOBUS_FALSE;
  mon.nb = 0;
  mon.rc = GLOBUS_RLS_SUCCESS;
  mon.errmsg = errmsg;
  mon.errmsglen = MAXERRMSG;
  r = globus_io_register_writev(h, iov, iovcnt, writevcb, &mon);
  if (r != GLOBUS_SUCCESS) {
    mon.done = GLOBUS_TRUE;
    mon.rc = rrpc_globuserr(errmsg, MAXERRMSG, r);
  }
...<snip>
"

It's basically just making a globus_io_register_writev(...) call on a
handle that was just previously accepted from a listening socket and
read from just before this write.

The io code (io/compat/globus_io_xio_compat.c) for the
globus_io_register_writev(...) call is here starting at line 3236 of the
file which corresponds to the failed mutex lock:

"
    globus_mutex_lock(&ihandle->pending_lock);   <<<==== line 3236
    {
        result = globus_xio_register_writev(
            ihandle->xio_handle,
            iov,
            iovcnt,
            nbytes,
            GLOBUS_NULL,
            globus_l_io_bounce_iovec_cb,
            bounce_info);
        if(result != GLOBUS_SUCCESS)
        {
            globus_mutex_unlock(&ihandle->pending_lock);
            goto error_register;
        }

        globus_l_io_cancel_insert(bounce_info);
    }
    globus_mutex_unlock(&ihandle->pending_lock);
"

You can tell me if user code can cause an inconsistent state for the
handle's pending_lock mutex. But it looks to me like this is an internal
io/xio issue.

Speak now or silently wait for a bugzilla record. :-)

Rob




More information about the xio-user mailing list