[gridway-user] Rescheduling doesn't work

manuelsobreira at portugalmail.pt manuelsobreira at portugalmail.pt
Thu Aug 31 06:25:24 CDT 2006


 
Hi,

Thanks for your response. But in my test, the process cannot be executed
correctly, because I have killed the process, with the 'kill' command on the
remote machine where it was running (wrap state). So, because I have killed it,
I think that gridway rescheduled the task. But the gridway  considers that the
the job has been executed well!
I don't know if my test is correct. But if I'm wrong, can you tell me in which
cases gridway reschedules tasks. 

Best Regards,

Manuel.


 Hi, 
 	Well if you look what happened with your job it is not "totally" failed.
 It was "successfully" executed and its exit code set. (note that at the end of
 
 the EPILOG_STD state the exit code is parsed to find out if there is an 
 error). Then your job enters the EPILOG state to stage out output files. And 
 then it fails because of "temp${TASK_ID}.mco". At this point GridWay does not
 
 know what caused the failure (you could misspelled  the filename, program 
 could seg. fault, remote server GridFTP could be down....).
 
 	However as the job has been successfully executed we decided not to 
 automatically re-scheduled this failures. So, that way is how we expect 
 GridWay to behave. Of course, failures in PROLOG/WRAPPER/EPILOG_STD 
 will make your job to be re-scheduled.
 
 	Note: The copy of a file is retried a number of times, we are currrently 
 developing a backoff mechanism to improve its reliability.
 
 Regards
 
 Ruben
 
 On Wednesday 30 August 2006 19:57, manuelsobreira at portugalmail.pt wrote:
 > Hi,
 >
 > I'm using gridway 5.0, and I have a problem with the rescheduling. During
 > my test a task have failed and in my job template I set
 > RESCHEDULE_ON_FAILURE=yes.
 >
 > But I don't know why, the task is not rescheduled.
 >
 > My job template is (in /usr/local/gw5.0/var/3/job.template):
 >
 > EXECUTABLE=mcml
 > ARGUMENTS=input${TASK_ID}.mci
 > INPUT_FILES=input${TASK_ID}.mci
 > OUTPUT_FILES=temp${TASK_ID}.mco output${TASK_ID}.mco
 > STDIN_FILE=
 > STDOUT_FILE=stdout_file.${TASK_ID}
 > STDERR_FILE=stderr_file.${TASK_ID}
 > REQUIREMENTS=
 > RANK=CPU_MHZ
 > RESCHEDULING_INTERVAL=0
 > RESCHEDULING_THRESHOLD=300
 > SUSPENSION_TIMEOUT=900
 > CPULOAD_THRESHOLD=50
 > RESCHEDULE_ON_FAILURE=1
 > NUMBER_OF_RETRIES=2
 > CHECKPOINT_INTERVAL=600
 > CHECKPOINT_URL=
 > WRAPPER=/usr/local/gw5.0/scripts/wrapper.sh
 > MONITOR=
 > PRE_WRAPPER=
 > PRE_WRAPPER_ARGUMENTS=
 >
 >
 > and the contents of job.log is :
 >
 > Wed Aug 30 19:36:28 2006 [DM][I]: ----------- Job configuration file
 > (mcml.jt) values -----------
 > Wed Aug 30 19:36:28 2006 [DM][I]:       EXECUTABLE             : mcml
 > Wed Aug 30 19:36:28 2006 [DM][I]:       ARGUMENTS              :
 > input${TASK_ID}.mci Wed Aug 30 19:36:28 2006 [DM][I]:       INPUT_FILES  
 > (Total 1):
 > Wed Aug 30 19:36:28 2006 [DM][I]:               (0) Local:
 > input${TASK_ID}.mci - Remote:
 > Wed Aug 30 19:36:28 2006 [DM][I]:       OUTPUT_FILES  (Total 1):
 > Wed Aug 30 19:36:28 2006 [DM][I]:               (0) Local:
 > temp${TASK_ID}.mco - Remote: output${TASK_ID}.mco
 > Wed Aug 30 19:36:28 2006 [DM][I]:       RESTART_FILES (Total 0):
 > Wed Aug 30 19:36:28 2006 [DM][I]:       STDIN_FILE             :
 > Wed Aug 30 19:36:28 2006 [DM][I]:       STDOUT_FILE            :
 > stdout_file.${TASK_ID}
 > Wed Aug 30 19:36:28 2006 [DM][I]:       STDERR_FILE            :
 > stderr_file.${TASK_ID}
 > Wed Aug 30 19:36:28 2006 [DM][I]:       REQUIREMENTS           :
 > Wed Aug 30 19:36:28 2006 [DM][I]:       RANK                   : CPU_MHZ
 > Wed Aug 30 19:36:28 2006 [DM][I]:       RESCHEDULING_INTERVAL  : 0
 > Wed Aug 30 19:36:28 2006 [DM][I]:       RESCHEDULING_THRESHOLD : 300
 > Wed Aug 30 19:36:28 2006 [DM][I]:       SUSPENSION_TIMEOUT     : 900
 > Wed Aug 30 19:36:28 2006 [DM][I]:       CPULOAD_THRESHOLD      : 50
 > Wed Aug 30 19:36:28 2006 [DM][I]:       RESCHEDULE_ON_FAILURE  : 1
 > Wed Aug 30 19:36:28 2006 [DM][I]:       NUMBER_OF_RETRIES      : 2
 > Wed Aug 30 19:36:28 2006 [DM][I]:       CHECKPOINT_INTERVAL    : 600
 > Wed Aug 30 19:36:28 2006 [DM][I]:       CHECKPOINT_URL         :
 > Wed Aug 30 19:36:28 2006 [DM][I]:       WRAPPER                :
 > /usr/local/gw5.0/scripts/wrapper.sh
 > Wed Aug 30 19:36:28 2006 [DM][I]:       MONITOR                :
 > Wed Aug 30 19:36:28 2006 [DM][I]:       PRE_WRAPPER            :
 > Wed Aug 30 19:36:28 2006 [DM][I]:       PRE_WRAPPER_ARGUMENTS  :
 > Wed Aug 30 19:36:28 2006 [DM][I]:
 > ----------------------------------------------------------
 > Wed Aug 30 19:36:28 2006 [DM][I]: New state is PENDING.
 > Wed Aug 30 19:36:51 2006 [DM][I]: New state is PROLOG.
 > Wed Aug 30 19:36:51 2006 [TM][I]: Creating remote job working directory:
 > Wed Aug 30 19:36:51 2006 [TM][I]:       Target url:
 > gsiftp://itppc220/~/.gw_gwadmin_3/.
 > Wed Aug 30 19:36:51 2006 [TM][I]:       Remote job directory created.
 > Wed Aug 30 19:36:51 2006 [TM][I]: Staging input files:
 > Wed Aug 30 19:36:51 2006 [TM][I]:       Source:
 > /usr/local/gw5.0/examples/mcml2. Wed Aug 30 19:36:51 2006 [TM][I]:      
 > Copying file input${TASK_ID}.mci. Wed Aug 30 19:36:51 2006 [TM][I]:      
 > Copying file
 > file:///usr/local/gw5.0/var/3/job.env.
 > Wed Aug 30 19:36:51 2006 [TM][I]:       Copying file mcml.
 > Wed Aug 30 19:36:51 2006 [TM][I]:       Copying file
 > file:///usr/local/gw5.0/scripts/wrapper.sh.
 > Wed Aug 30 19:36:52 2006 [TM][I]:       File input${TASK_ID}.mci copied.
 > Wed Aug 30 19:36:53 2006 [TM][I]:       File
 > file:///usr/local/gw5.0/var/3/job.env copied.
 > Wed Aug 30 19:36:54 2006 [TM][I]:       File mcml copied.
 > Wed Aug 30 19:36:54 2006 [TM][I]:       File
 > file:///usr/local/gw5.0/scripts/wrapper.sh copied.
 > Wed Aug 30 19:36:54 2006 [TM][I]: All input files copied.
 > Wed Aug 30 19:36:54 2006 [DM][I]: Prolog done:
 > Wed Aug 30 19:36:54 2006 [DM][I]:       Total time      : 3
 > Wed Aug 30 19:36:54 2006 [DM][I]: New state is WRAPPER.
 > Wed Aug 30 19:36:54 2006 [EM][I]: Submitting wrapper to itppc220/Fork, RSL
 > used is in /usr/local/gw5.0/var/3/job.rsl.0.
 > Wed Aug 30 19:37:29 2006 [EM][I]: New execution state is PENDING.
 > Wed Aug 30 19:38:52 2006 [EM][I]: New execution state is DONE.
 > Wed Aug 30 19:38:52 2006 [DM][I]: Wrapper done:
 > Wed Aug 30 19:38:52 2006 [DM][I]:       Active time     : 0
 > Wed Aug 30 19:38:52 2006 [DM][I]:       Suspension time : 118
 > Wed Aug 30 19:38:52 2006 [DM][I]:       Total time      : 118
 > Wed Aug 30 19:38:52 2006 [DM][I]: New state is EPILOG_STD.
 > Wed Aug 30 19:38:52 2006 [TM][I]: Staging output files:
 > Wed Aug 30 19:38:52 2006 [TM][I]:       Source:
 > gsiftp://itppc220/~/.gw_gwadmin_3/. Wed Aug 30 19:38:52 2006 [TM][I]:      
 > Copying file stdout.wrapper. Wed Aug 30 19:38:52 2006 [TM][I]:      
 > Copying file stderr.wrapper. Wed Aug 30 19:38:53 2006 [TM][I]:       File
 > stdout.wrapper copied. Wed Aug 30 19:38:53 2006 [TM][I]:       File
 > stderr.wrapper copied. Wed Aug 30 19:38:53 2006 [TM][I]: All output files
 > copied.
 > Wed Aug 30 19:38:53 2006 [DM][I]: New state is EPILOG.
 > Wed Aug 30 19:38:53 2006 [TM][I]: Staging output files:
 > Wed Aug 30 19:38:53 2006 [TM][I]:       Source:
 > gsiftp://itppc220/~/.gw_gwadmin_3/. Wed Aug 30 19:38:53 2006 [TM][I]:      
 > Copying file temp${TASK_ID}.mco. Wed Aug 30 19:38:53 2006 [TM][I]:      
 > Copying file stdout.execution. Wed Aug 30 19:38:53 2006 [TM][I]:      
 > Copying file stderr.execution. Wed Aug 30 19:38:54 2006 [TM][I]:      
 > Retrying copy of file temp${TASK_ID}.mco. Wed Aug 30 19:38:55 2006 [TM][I]:
 >       File stdout.execution copied. Wed Aug 30 19:38:56 2006 [TM][I]:      
 > File stderr.execution copied. Wed Aug 30 19:38:57 2006 [TM][E]:       Copy
 > of file temp${TASK_ID}.mco failed. Wed Aug 30 19:38:57 2006 [TM][W]: Some
 > output files were not copied, will NOT remove remore directory.
 > Wed Aug 30 19:38:57 2006 [DM][E]: Epilog failed:
 > Wed Aug 30 19:38:57 2006 [DM][E]:       Total time      : 5
 > Wed Aug 30 19:38:57 2006 [DM][I]: New state is FAILED.
 > Wed Aug 30 19:38:57 2006 [DM][I]: Job failed, history:
 > Wed Aug 30 19:38:57 2006 [DM][I]: ----------- Job history record
 > ----------- Wed Aug 30 19:38:57 2006 [IM][I]:       -------------- Host
 > info.  -------------- Wed Aug 30 19:38:57 2006 [IM][I]:       Name =
 > itppc220
 > Wed Aug 30 19:38:57 2006 [IM][I]:       OS   = Linux 2.6.12-1.1381_FC3smp
 > Wed Aug 30 19:38:57 2006 [IM][I]:       CPU  = x86 (x86) at 3193 MHz
 > Wed Aug 30 19:38:57 2006 [IM][I]:       Mem  = 254 of 946 MB
 > Wed Aug 30 19:38:57 2006 [IM][I]:       Disk = 68775 of 160221 MB
 > Wed Aug 30 19:38:57 2006 [IM][I]:       LRMS = fork (Fork) with 2 nodes
 > Wed Aug 30 19:38:57 2006 [IM][I]:                         NC FNC    MT  
 > MCT  MC MRJ MJQ
 > Wed Aug 30 19:38:57 2006 [IM][I]:       QUEUE= default  (  2   2     0 
 > 1000   0 5   0), enabled status, NULL type, 0 priority
 > Wed Aug 30 19:38:57 2006 [IM][I]:      
 > ----------------------------------------- Wed Aug 30 19:38:57 2006 [DM][I]:
 >       Host GRAM contact = itppc220/Fork Wed Aug 30 19:38:57 2006 [DM][I]:  
 >     Remote job dir    =
 > gsiftp://itppc220/~/.gw_gwadmin_3/
 > Wed Aug 30 19:38:57 2006 [DM][I]:       Host Rank         = 3193
 > Wed Aug 30 19:38:57 2006 [DM][I]:       Submission tries  = 1
 > Wed Aug 30 19:38:57 2006 [DM][I]:       Start time        = 1156963011
 > Wed Aug 30 19:38:57 2006 [DM][I]:       Exit Time         = 1156963137
 > Wed Aug 30 19:38:57 2006 [DM][I]:       Prolog Time       = 3
 > Wed Aug 30 19:38:57 2006 [DM][I]:       Wrapper Time      = 118
 > Wed Aug 30 19:38:57 2006 [DM][I]:       Epilog Time       = 5
 > Wed Aug 30 19:38:57 2006 [DM][I]:       Migration Time    = 0
 > Wed Aug 30 19:38:57 2006 [DM][I]:
 > ------------------------------------------
 >
 >
 > Finally the contents of job.state is:
 >
 >
 > 1156962988 PENDING
 > 1156963011 PROLOG
 > 1156963014 WRAPPER
 > 1156963132 EPILOG_STD
 > 1156963133 EPILOG
 > 1156963137 FAILED
 >
 > Best Regards.
 >
 > Manuel.
 >
 > __________________________________________________________
 > O email preferido dos portugueses agora com
 > 2 000 MB de espaço e acesso gratuito à Internet
 > http://www.portugalmail.pt/2000mb
 
 -- 
 +-----------------------------------------------------------+
  Dr. Ruben Santiago Montero
  Associate Professor
  Dpto. Arquitectura de Computadores y Automatica
  Facultad de Informatica
  Universidad Complutense      phone  : +34 91 394 75 38
  28040 Madrid                 fax    : +34 91 394 75 27
  Spain                        email  : rubensm at dacya.ucm.es
  http://asds.dacya.ucm.es/
 +-----------------------------------------------------------+
 
 GridWay, The Way to Grid! http://www.gridway.org
 



__________________________________________________________
Sabe quanto gasta com a sua ligação à Internet?
Verifique aqui: http://acesso.portugalmail.pt/contas




More information about the gridway-user mailing list