Monday, April 30, 2007

Multiple failures at Oxford explained

Oxford ran out of disk space on its DPM SE. This caused the rm SAM test to fail. This was due to ATLAS taking up all the available disk space on our SE. We managed to clear some space from dteam and this allowed us to start passing the tests again. The bigger problem remains , that as there is currently no quota mechanism in DPM, we can not prevent this happening again. We only have two (1.6TB) pools and both are assigned to all VO's. It is not possible to allocate a pool exclusively to ops, or to keep ATLAS on their own without completely removing all data and re designing the pools. This is a non starter.
When more disk space is added consideration will be given to allocate dedicated pools for some VOs.

Oxford then started failing other tests, this was caused by multiple worker nodes having either full /home or / partitions. This highlights the necessity of monitoring disk usage with Nagios.