Friday, December 19, 2008

Automount problems on torque server

We've been having a few problems with our torque server failing to automout disks randomly.

Most of the time the mounts succeeded but occasionally they would fail with just:

Dec 19 08:05:06 heplnx201 kernel: RPC: error 5 connecting to server nfsserver
Dec 19 08:05:06 heplnx201 automount[23438]: >> mount: nfsserver:/opt/ppd/mount: can't read superblock
Dec 19 08:05:06 heplnx201 automount[23438]: mount(nfs): nfs: mount failure nfsserver:/opt/ppd/mount on /net/mount
Dec 19 08:05:06 heplnx201 automount[23438]: failed to mount /net/mount
Dec 19 08:05:07 heplnx201 kernel: RPC: Can't bind to reserved port (98).
Dec 19 08:05:07 heplnx201 kernel: RPC: can't bind to reserved port.

With the wonders of Google I was able to find out that error 98 is address in use and that what is going on is that the client is unable to find a free port in it's port range to initiate the connection to the server.

The culprit seems to be torque, which when I checked with a netstat -a was using very single port from 600 to 1023, which quite neatly overlaid the nfs client port range of 600-1023.

Here Google failed me and I was unable to find anyway to limit the port range used by torque.

So for now I've taken the quick option of extending the nfs client port range down to port 300 with:

echo 300 > /proc/sys/sunrpc/min_resvport

I think I'd like to move the nfs client port range out of the priveledged port range altogether. I think this should be possible, the RFC says that it SHOULD use a port below 1023 but MAY use a higher port, but I'd like to test it a bit before I configure a major server like that.

No comments: