[xio-user] RLS crash in globus_io call while running on sparc solaris 10
bresnaha at mcs.anl.gov
bresnaha at mcs.anl.gov
Wed May 28 16:26:12 CDT 2008
Can you get a valgrind report or run under electric fence? Similar stack traces have ended up being memory corruption.
-----Original Message-----
From: Robert Schuler <schuler at isi.edu>
Sent: May 28, 2008 4:03 PM
To: xio-user at globus.org
Cc: Robert Schuler <schuler at isi.edu>; Scott Koranda <skoranda at gravity.phys.uwm.edu>
Subject: [xio-user] RLS crash in globus_io call while running on sparc solaris 10
Hello, XIO,
Got a crash in an RLS server running on a SPARC Solaris 10 box. This bug
looks to me like a globus_io (globus_io_xio_compat.c) bug.
http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=6085
Here's the gdb backtrace of the core dump:
"
------- Comment #3 From Scott Koranda 2008-05-22 16:09 [reply] -------
It was/is a Solaris 10 box:
[grid at ldas-cit skoranda]$ uname -a
SunOS ldas-cit 5.10 Generic_127111-11 sun4u sparc SUNW,Sun-Fire-880
The backtrace from the core file is
(gdb) backtrace
#0 0xfe945b84 in _lwp_kill () from /lib/libc.so.1
#1 0xfe8e4bbc in raise () from /lib/libc.so.1
#2 0xfe8c10c0 in abort () from /lib/libc.so.1
#3 0xfef33078 in globus_silent_fatal () at globus_print.c:57
#4 0xfef33124 in globus_fatal (
msg=0xfef4a248 "%s %s\n%s unknown error number: %d\n") at
globus_print.c:88
#5 0xfef39644 in globus_i_thread_report_bad_rc (rc=59,
message=0xfef4a620 "GLOBUSTHREAD: pthread_mutex_lock() failed\n")
at globus_thread_common.c:138
#6 0xfef3b21c in globus_mutex_lock (mut=0x27455b70)
at globus_thread_pthreads.c:823
#7 0xff294928 in globus_io_register_writev (handle=0x2599c7b0,
iov=0xfc5fb220, iovcnt=2, writev_callback=0xff2feaa8 <writevcb>,
callback_arg=0xfc5fb160) at globus_io_xio_compat.c:3236
#8 0xff2fe36c in rrpc_writev (h=0x2599c7b0, iov=0xfc5fb220, iovcnt=2,
nbw=0xfc5fb230, errmsg=0x259a07d4 "L-R-894811392-32.gwf,0") at
rpc.c:317
#9 0x0002ba7c in rrpc_error (c=0x2599c7b0, rc=12, fmt=0x3b6d0 "%s")
at server.c:1376
#10 0x0002e14c in lrc_exists (c=0x2599c7b0, dbh=0xfc5fbf4c,
arglist=0xfc5fbf58)
at server.c:1950
#11 0x0002a65c in procreq (a=0x0) at server.c:1054
#12 0xfe944998 in _lwp_start () from /lib/libc.so.1
#13 0xfe944998 in _lwp_start () from /lib/libc.so.1
---Type <return> to continue, or q <return> to quit---
Backtrace stopped: previous frame identical to this frame (corrupt
stack?)
"
The RLS code (replica/client/library/rpc.c) for the rrpc_writev(...)
call above is:
"
int
rrpc_writev(globus_io_handle_t *h, struct iovec *iov, globus_size_t
iovcnt,
globus_size_t *nbw, char *errmsg)
{
globus_result_t r;
IOMON mon;
struct timespec ts;
globus_mutex_init(&mon.mtx, GLOBUS_NULL);
globus_cond_init(&mon.cond, GLOBUS_NULL);
mon.done = GLOBUS_FALSE;
mon.nb = 0;
mon.rc = GLOBUS_RLS_SUCCESS;
mon.errmsg = errmsg;
mon.errmsglen = MAXERRMSG;
r = globus_io_register_writev(h, iov, iovcnt, writevcb, &mon);
if (r != GLOBUS_SUCCESS) {
mon.done = GLOBUS_TRUE;
mon.rc = rrpc_globuserr(errmsg, MAXERRMSG, r);
}
...<snip>
"
It's basically just making a globus_io_register_writev(...) call on a
handle that was just previously accepted from a listening socket and
read from just before this write.
The io code (io/compat/globus_io_xio_compat.c) for the
globus_io_register_writev(...) call is here starting at line 3236 of the
file which corresponds to the failed mutex lock:
"
globus_mutex_lock(&ihandle->pending_lock); <<<==== line 3236
{
result = globus_xio_register_writev(
ihandle->xio_handle,
iov,
iovcnt,
nbytes,
GLOBUS_NULL,
globus_l_io_bounce_iovec_cb,
bounce_info);
if(result != GLOBUS_SUCCESS)
{
globus_mutex_unlock(&ihandle->pending_lock);
goto error_register;
}
globus_l_io_cancel_insert(bounce_info);
}
globus_mutex_unlock(&ihandle->pending_lock);
"
You can tell me if user code can cause an inconsistent state for the
handle's pending_lock mutex. But it looks to me like this is an internal
io/xio issue.
Speak now or silently wait for a bugzilla record. :-)
Rob
More information about the xio-user
mailing list