Wednesday, October 10, 2007

SL4 Worker Node Migration at RALPP

Since I've now finished the migration my worker nodes to SL4 I thought I should describe the method used.

The basic decision was to try to keep running an SL3 service in parallel with the initial test SL4 service and then gradually migrate nodes to the new service once it was production quality. I already had split my Torque/Maui services off onto a separate node and wanted to keep that setup with the SL4 service but did not want to (a) duplicate the torque server or (b) create another 24 queues for all the VOs. To get round this I decided to:
  • Install a new "SL4" CE pointing to the production PBS node, this needed a different site info.def file with it named as the CE_HOST and the GlueOperatingSystem settings set for SL4 obviously
  • Create node properties on the SL3 and SL4 nodes to let the batch system route jobs based on OS
  • Hack the lcgpbs jobmanagers on the two CEs to apply requirements on the node properties as it submits the job
Running multiple CEs all pointing to the same torque server is fairly simple to do, there is a "BATCH_SERVER" setting in YAIM (3.1 and later, TORQUE_SERVER before that) that you just point at your the torque/maui server and that configures the CE to submit it jobs via that machine. Then there are a couple of other things you have to take care of:
  1. The gridmapdir has to be shared between all the CEs. Otherwise there is a possibility that either the same DN will be mapped to multiple pool accounts or worse that different DNs will be mapped to the same pool account by the different CEs.
  2. The worker nodes need to have the ssh host keys for all the CEs to be able to get the job data back but YAIM will only set one up. The fix is to edit the NODES line in "/opt/edg/etc/edg-pbs-knownhosts.conf" to add all the CEs and your torque server
  3. If the CEs are submitting the same worker nodes you might also want to mount the VO tag are across all the CEs so that VOs don't have to publish the same tags to all the CEs
Node properties are very easy either just edit the torque nodes file to add them or use "qmgr -c "set node $node properties += SL4". I also added "test" and "prod" properties to all the nodes but more of that below.

Finally I needed to change the job manager to require the properties to direct jobs going to the different CEs to different classes of workers based on the above properties. The lcgpbs jobmanager already writes a node requirement in into the job script it submits to torque and so it is easy to rewrite this to add node properties as well. If you look in "/opt/globus/setup/globus/lcgpbs.in" you'll see three places where it writes "#PBS -l nodes=" to set the requirement on the number of CPUs and you need to add :SL4 (or :SL3) to the end of the write.

After doing that, installing some SL4 worker nodes was very simple, about the only necessary change to the site-info.def file was to change the "GLOBUS_TCP_PORT_RANGE" to be space separated rather than comma separated.

With the above hacks in place I was able to leave my old CE happily submitting jobs to the SL3 nodes while I was testing the SL4 worker nodes then gradually move the worker nodes over to Sl4. Before Moving the final worker nodes over I modified the batch system information provider to report the queues as "Draining" whatever their real status. Once all the worker nodes were migrated to SL4 I could just remove the changes to the lcgpbs jobmanager changes and both CEs became equivalent.

No comments: