Monday, October 22, 2007

dCache Tuning

I've been having a few issues since the start of the CMS CSA07 data challenge with SAM test failures with what seem to be mostly timeouts against my dCache Storage Element so I've been looking at improving my setup.

One suggestion was to set up separate queues in dCache for local access (dcap, gsidcap and xrootd) and remote access (GridFTP).

In general this is supposed to help when local farm jobs are reading slowly from lots of files and blocking the queues preventing the short GridFTP jobs from starting. Which is not the current case on my Storage Element, but it might also help by limiting the number of concurrent GridFTP transfers, which are very resource hungry without limiting the local access which is not.

It was a very easy change to do requiring only changed to the /opt/d-cache/config/dCacheSetup file, not the indevidual batch files (on all the servers of course, though). I uncommented and set the following variables:

poolIoQueue=dcapQ,gftpQ
gsidcapIoQueue=dcapQ
dcapIoQueue=dcapQ
gsiftpIoQueue=gftpQ
remoteGsiftpIoQueue=gftpQ


The first variable sets up the two queues (the first queue is also the default on if no queue is specified).

Then the rest of the settings specify which queue the different doors use.

Unfortunately, the queue lengths are set per pool in the pool setup file so I had to edit a file for each pool on all the disk servers to change:

mover set max active NNNN

to:

mover set max active -queue=dcapQ 1000
mover set max active -queue=gftpQ 3

After the changes to the config files I then had to restart all the services to pick up the new config. I also took the opportunity to enable readonly xrootd access to the SE but adding:

XROOTD=yes

to /opt/d-cache/etc/node_config on all the nodes

and setting:

xrootdIsReadOnly=true

in the dCacheSetup file.

After the restart the new queues showed up in the queue info pages and the xrootd doors on all the nodes showed up on the Cell Services page.

I was also able to read files out from the xrootd door using standard babat tools (and was correctly blocked from writing data).

No comments: