Monday, October 22, 2007

dCache Tuning

I've been having a few issues since the start of the CMS CSA07 data challenge with SAM test failures with what seem to be mostly timeouts against my dCache Storage Element so I've been looking at improving my setup.

One suggestion was to set up separate queues in dCache for local access (dcap, gsidcap and xrootd) and remote access (GridFTP).

In general this is supposed to help when local farm jobs are reading slowly from lots of files and blocking the queues preventing the short GridFTP jobs from starting. Which is not the current case on my Storage Element, but it might also help by limiting the number of concurrent GridFTP transfers, which are very resource hungry without limiting the local access which is not.

It was a very easy change to do requiring only changed to the /opt/d-cache/config/dCacheSetup file, not the indevidual batch files (on all the servers of course, though). I uncommented and set the following variables:

poolIoQueue=dcapQ,gftpQ
gsidcapIoQueue=dcapQ
dcapIoQueue=dcapQ
gsiftpIoQueue=gftpQ
remoteGsiftpIoQueue=gftpQ


The first variable sets up the two queues (the first queue is also the default on if no queue is specified).

Then the rest of the settings specify which queue the different doors use.

Unfortunately, the queue lengths are set per pool in the pool setup file so I had to edit a file for each pool on all the disk servers to change:

mover set max active NNNN

to:

mover set max active -queue=dcapQ 1000
mover set max active -queue=gftpQ 3

After the changes to the config files I then had to restart all the services to pick up the new config. I also took the opportunity to enable readonly xrootd access to the SE but adding:

XROOTD=yes

to /opt/d-cache/etc/node_config on all the nodes

and setting:

xrootdIsReadOnly=true

in the dCacheSetup file.

After the restart the new queues showed up in the queue info pages and the xrootd doors on all the nodes showed up on the Cell Services page.

I was also able to read files out from the xrootd door using standard babat tools (and was correctly blocked from writing data).

Wednesday, October 10, 2007

SL4 Worker Node Migration at RALPP

Since I've now finished the migration my worker nodes to SL4 I thought I should describe the method used.

The basic decision was to try to keep running an SL3 service in parallel with the initial test SL4 service and then gradually migrate nodes to the new service once it was production quality. I already had split my Torque/Maui services off onto a separate node and wanted to keep that setup with the SL4 service but did not want to (a) duplicate the torque server or (b) create another 24 queues for all the VOs. To get round this I decided to:
  • Install a new "SL4" CE pointing to the production PBS node, this needed a different site info.def file with it named as the CE_HOST and the GlueOperatingSystem settings set for SL4 obviously
  • Create node properties on the SL3 and SL4 nodes to let the batch system route jobs based on OS
  • Hack the lcgpbs jobmanagers on the two CEs to apply requirements on the node properties as it submits the job
Running multiple CEs all pointing to the same torque server is fairly simple to do, there is a "BATCH_SERVER" setting in YAIM (3.1 and later, TORQUE_SERVER before that) that you just point at your the torque/maui server and that configures the CE to submit it jobs via that machine. Then there are a couple of other things you have to take care of:
  1. The gridmapdir has to be shared between all the CEs. Otherwise there is a possibility that either the same DN will be mapped to multiple pool accounts or worse that different DNs will be mapped to the same pool account by the different CEs.
  2. The worker nodes need to have the ssh host keys for all the CEs to be able to get the job data back but YAIM will only set one up. The fix is to edit the NODES line in "/opt/edg/etc/edg-pbs-knownhosts.conf" to add all the CEs and your torque server
  3. If the CEs are submitting the same worker nodes you might also want to mount the VO tag are across all the CEs so that VOs don't have to publish the same tags to all the CEs
Node properties are very easy either just edit the torque nodes file to add them or use "qmgr -c "set node $node properties += SL4". I also added "test" and "prod" properties to all the nodes but more of that below.

Finally I needed to change the job manager to require the properties to direct jobs going to the different CEs to different classes of workers based on the above properties. The lcgpbs jobmanager already writes a node requirement in into the job script it submits to torque and so it is easy to rewrite this to add node properties as well. If you look in "/opt/globus/setup/globus/lcgpbs.in" you'll see three places where it writes "#PBS -l nodes=" to set the requirement on the number of CPUs and you need to add :SL4 (or :SL3) to the end of the write.

After doing that, installing some SL4 worker nodes was very simple, about the only necessary change to the site-info.def file was to change the "GLOBUS_TCP_PORT_RANGE" to be space separated rather than comma separated.

With the above hacks in place I was able to leave my old CE happily submitting jobs to the SL3 nodes while I was testing the SL4 worker nodes then gradually move the worker nodes over to Sl4. Before Moving the final worker nodes over I modified the batch system information provider to report the queues as "Draining" whatever their real status. Once all the worker nodes were migrated to SL4 I could just remove the changes to the lcgpbs jobmanager changes and both CEs became equivalent.