SGE HA
One nice thing about SGE is that it already has support HA. To do this
Stop SGE on both FE nodes first if it is running.
You need to share SGE_CELL directory to the whole cluster. SGE use file-base mechanism to detect and fail-over SGE qmaster. Just copy your /opt/gridengine/default to /share/apps/gridengine/default, suppose that you have place /share/apps on a centralize NAS storage.
- Now set your desired primary SGE Qmaster host.
echo “fe2.public” > /share/apps/gridengine/default/common/act_qmaster
- And set your shadow master. Shadow master are SGE daemon that will actively checking QMaster status and start QMaster if it failed.
echo ‘fe1.public” >> /share/apps/gridengine/default/common/shadow_masters
echo ‘fe2.public” >> /share/apps/gridengine/default/common/shadow_masters
- Now modify /etc/init.d/sgemaster on both FE. You need to modify 2 places
- By default sgemaster script will only start SGE Qmaster on Qmaster node (by default is ROCKS frontend). Which sometimes SGE will incorrectly set the qmaster node name to local name (fe1.local) instead of public name (fe1.public), which make the script failed. Modify the CheckIfQmasterHost() function to check only the first part of hostname to avoid this problem. Locate the function in sgemaster script and modify it like this
CheckIfQmasterHost()
{
host=$1
ACT_QMASTER_HOST=`cat $SGE_ROOT/$SGE_CELL/common/act_qmaster`
# Modified by Somsak Sriprayoonsakul
if [ "$host" = "$ACT_QMASTER_HOST" -o "$host" = `echo $ACT_QMASTER_HOST | cut -f 1 -d '.'` ]; then
echo true
else
echo false
fi
}
- Next, the sgemaster script will echo its hostname to act_qmaster file near the last line of the script. Remove or comment out the last line of your sgemaster script
#comment out to avoid shadow master host overwriting qmaster configuration #/bin/hostname --fqdn > /opt/gridengine/default/common/act_qmaster
- Now start SGE as usual on both nodes
/etc/init.d/sgemaster start
- You will notice that Qmaster is only started on fe2.public. Try qstat on any node in cluster to see if your SGE is still working
Testing
- You can test the fail-over by sending SIGKILL to qmaster server on primary SGE qmaster node (fe2.public in our case). You need to send SIGKILL or else shadowd will think that you really want to shutdown the server.
killall -9 sge_qmaster
Wait for a while (about 3-5 minutes). It takes a rather long time to fail-over, but in the end sge_qmaster will be started on secondary SGE qmaster (fe1.public)
Job state should still be safe in your cluster
If you want to migrate back, you need to run the following command on your primary FE
/etc/init.d/sgemaster -migrate
- Note that you need to manually migrate SGE back once it is fail-over.