[gridway-user] Failure detection in GridWay
Raül Sirvent
Raul.Sirvent at bsc.es
Fri Sep 21 11:33:23 CDT 2007
Dear all,
I was trying to simulate a job failure in GridWay by killing the running
binary in the remote machine. The strange thing is that when I did this,
the job got stuck at epilog state and XFER time increases. The job
first retries to stageout a result file, that hasn't been generated
cause of the failure, and then nothing new is seen in the job's log.
Last messages are:
Fri Sep 21 18:29:29 2007 [DM][I]: New state is EPILOG.
Fri Sep 21 18:29:29 2007 [TM][I]: Staging output files:
Fri Sep 21 18:29:29 2007 [TM][I]: Source:
gsiftp://bscgrid02.bsc.es/~/.gw_rsirvent_217/.
Fri Sep 21 18:29:29 2007 [TM][I]: Copying file
file:///home/rsirvent/mcarlo-worker/destGen.0.
Fri Sep 21 18:29:29 2007 [TM][I]: Copying file stdout.execution.
Fri Sep 21 18:29:29 2007 [TM][I]: Copying file stderr.execution.
Fri Sep 21 18:29:31 2007 [TM][I]: Retrying copy of file
file:///home/rsirvent/mcarlo-worker/destGen.0 in ~5 seconds.
Fri Sep 21 18:29:32 2007 [TM][I]: File stdout.execution copied.
Fri Sep 21 18:29:33 2007 [TM][I]: File stderr.execution copied.
Fri Sep 21 18:29:39 2007 [TM][I]: Retrying copy of file
file:///home/rsirvent/mcarlo-worker/destGen.0 in ~10 seconds.
Fri Sep 21 18:29:47 2007 [TM][E]: Copy of file
file:///home/rsirvent/mcarlo-worker/destGen.0 failed.
Fri Sep 21 18:29:47 2007 [TM][W]: Some output files were not copied,
will NOT remove remote directory.
What can be happening? Keep in mind that I use a special version of
GridWay, that allows to specifically cleanup that result after the
execution.
Thanks in advance. Best regards,
Raul.
More information about the gridway-user
mailing list