[gt-user] questions about SEG, lost SGE jobs, and general stability
Joseph Bester
bester at mcs.anl.gov
Tue Feb 7 15:37:22 CST 2012
On Feb 6, 2012, at 3:53 PM, Brian O'Connor wrote:
> Hi Joseph,
>
> Thanks very much for your email.
>
> We actually just had a failure (reboot) of out Globus box just a
> couple hours ago. So this gets at my question below about how to
> cleanup after a failure. When the machine rebooted I now see a ton of
> globus-job-managers running as my "seqware" user (the one that
> originally submitted the globus jobs).
> [seqware at sqwprod ~]$ ps aux | grep globus-job-manager | grep seqware | wc -l
> 1837
>
> So there are 1837 of these daemons running.
>
That's probably condor-g restarting job managers automatically. :)
> I can no longer submit a cluster job using:
>
> globus-job-run sqwprod/jobmanager-sge /bin/hostname
>
> It just hangs.
>
> I think this is because there is a lock file
> ~/.globus/job/sqwprod.hpc.oicr.on.ca/
>
> sge.4572dcea.lock
>
> My questions are 1) what's the proper way to reset here, kill all the
> globus-job-managers, remove the lock, and allow the job manager to
> repawn? 2) Why doesn't globus-job-manager (or the gateway) look at
> sge.4572dcea.pid and realize the previous globus-job-manager is dead?
> Shouldn't it detect this, cleanup it's state, and launch a single
> replacement?
>
> Thanks for your help. I really appreciate it!
If the home filesystem is a shared filesystem, perhaps there might be some issue with
lock state getting mixed up with the reboot? I thought 5.2.0 would put the lock file
in /var/lib/globus/gram_job_state/$LOGNAME. You might get success by adding
-globus-job-dir /var/lib/globus/gram_job_state in /etc/globus/globus-gram-job-manager.conf
to force that.
Joe
More information about the gt-user
mailing list