[gridway-user] Failure detection in GridWay

Raül Sirvent Raul.Sirvent at bsc.es
Fri Sep 21 11:33:23 CDT 2007


Dear all,

I was trying to simulate a job failure in GridWay by killing the running 
binary in the remote machine. The strange thing is that when I did this, 
the job got stuck at epilog state and XFER time increases.  The job 
first retries to stageout a result file, that hasn't been generated 
cause of the failure, and then nothing new is seen in the job's log. 
Last messages are:

Fri Sep 21 18:29:29 2007 [DM][I]: New state is EPILOG.
Fri Sep 21 18:29:29 2007 [TM][I]: Staging output files:
Fri Sep 21 18:29:29 2007 [TM][I]:       Source: 
gsiftp://bscgrid02.bsc.es/~/.gw_rsirvent_217/.
Fri Sep 21 18:29:29 2007 [TM][I]:       Copying file 
file:///home/rsirvent/mcarlo-worker/destGen.0.
Fri Sep 21 18:29:29 2007 [TM][I]:       Copying file stdout.execution.
Fri Sep 21 18:29:29 2007 [TM][I]:       Copying file stderr.execution.
Fri Sep 21 18:29:31 2007 [TM][I]:       Retrying copy of file 
file:///home/rsirvent/mcarlo-worker/destGen.0 in ~5 seconds.
Fri Sep 21 18:29:32 2007 [TM][I]:       File stdout.execution copied.
Fri Sep 21 18:29:33 2007 [TM][I]:       File stderr.execution copied.
Fri Sep 21 18:29:39 2007 [TM][I]:       Retrying copy of file 
file:///home/rsirvent/mcarlo-worker/destGen.0 in ~10 seconds.
Fri Sep 21 18:29:47 2007 [TM][E]:       Copy of file 
file:///home/rsirvent/mcarlo-worker/destGen.0 failed.
Fri Sep 21 18:29:47 2007 [TM][W]: Some output files were not copied, 
will NOT remove remote directory.

What can be happening? Keep in mind that I use a special version of 
GridWay, that allows to specifically cleanup that result after the 
execution.

Thanks in advance. Best regards,

Raul.




More information about the gridway-user mailing list