SouthGrid: 2007

Tuesday, December 11, 2007

Oxford Gridpp Site becomes an NGS Affiliate

Not to be out done by Scotgrid, I should also point out that Oxford became an NGS affiliate at the same meeting (Dec 6th). See https://www.ngs.ac.uk/guide/affiliates/oxford-gridpp/

Oxford have added support for vo.southgrid.ac.uk, gridpp and supernemo.vo.eu-egee.org

Friday, December 07, 2007

Birmingham HV Network Upgrade

High Voltage Network Upgrade, over this weekend, means several systems will be off over the weekend.
It is hoped to keep the core service nodes up and running, but the number of worker nodes will be limited.

ALICE VO Box was not accessable to the users for a day, no problems were found by Yves.
Now reported as OK.

SouthGrid Update

Bristol:
Had some problems with LHCb users
EDFA-JET:
Upgraded WN's to SL4
Birmingham:
Disk failed on the se raid 5 disk array.
Oxford:
Upgraded the SL3 cluster to update 37. Some problems with the se, the DPM pool nodes had not had the latest lcg-vomscerts rpm applied. Secondly the site-info.def file on some of the nodes had an old entry for the ops vo which meant the gridmap file was not being created correctly.
This was changed to include:


VO_OPS_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/ops?/ops/'
                     'vomss://voms.cern.ch:8443/voms/ops?/ops/'"
VO_OPS_VOMSES="'ops lcg-voms.cern.ch 15009 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops'
               'ops voms.cern.ch     15009 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch     ops'"

The addition of voms.cern.ch being the important bit, (lcg-voms.cern.ch) was the old entry.

RALPPD:
The BDII failed on Monday 3rd. A reboot fixed this.

So now that Oxford is uptodate we can go ahead and add support for some new VOs,
SouthGrid, gridpp and supernemo.

Wednesday, December 05, 2007

Random rm failures at Oxford

Random SAM test failures for rm, and later complaints from ATLAS were traced to one of the DPM pool nodes not having had the latest VOMS certs applied.

Monday, October 22, 2007

dCache Tuning

I've been having a few issues since the start of the CMS CSA07 data challenge with SAM test failures with what seem to be mostly timeouts against my dCache Storage Element so I've been looking at improving my setup.

One suggestion was to set up separate queues in dCache for local access (dcap, gsidcap and xrootd) and remote access (GridFTP).

In general this is supposed to help when local farm jobs are reading slowly from lots of files and blocking the queues preventing the short GridFTP jobs from starting. Which is not the current case on my Storage Element, but it might also help by limiting the number of concurrent GridFTP transfers, which are very resource hungry without limiting the local access which is not.

It was a very easy change to do requiring only changed to the /opt/d-cache/config/dCacheSetup file, not the indevidual batch files (on all the servers of course, though). I uncommented and set the following variables:

poolIoQueue=dcapQ,gftpQ
gsidcapIoQueue=dcapQ
dcapIoQueue=dcapQ
gsiftpIoQueue=gftpQ
remoteGsiftpIoQueue=gftpQ

The first variable sets up the two queues (the first queue is also the default on if no queue is specified).

Then the rest of the settings specify which queue the different doors use.

Unfortunately, the queue lengths are set per pool in the pool setup file so I had to edit a file for each pool on all the disk servers to change:

mover set max active NNNN

to:

mover set max active -queue=dcapQ 1000
mover set max active -queue=gftpQ 3

After the changes to the config files I then had to restart all the services to pick up the new config. I also took the opportunity to enable readonly xrootd access to the SE but adding:

XROOTD=yes

to /opt/d-cache/etc/node_config on all the nodes

and setting:

xrootdIsReadOnly=true

in the dCacheSetup file.

After the restart the new queues showed up in the queue info pages and the xrootd doors on all the nodes showed up on the Cell Services page.

I was also able to read files out from the xrootd door using standard babat tools (and was correctly blocked from writing data).

Wednesday, October 10, 2007

SL4 Worker Node Migration at RALPP

Since I've now finished the migration my worker nodes to SL4 I thought I should describe the method used.

The basic decision was to try to keep running an SL3 service in parallel with the initial test SL4 service and then gradually migrate nodes to the new service once it was production quality. I already had split my Torque/Maui services off onto a separate node and wanted to keep that setup with the SL4 service but did not want to (a) duplicate the torque server or (b) create another 24 queues for all the VOs. To get round this I decided to:

Install a new "SL4" CE pointing to the production PBS node, this needed a different site info.def file with it named as the CE_HOST and the GlueOperatingSystem settings set for SL4 obviously
Create node properties on the SL3 and SL4 nodes to let the batch system route jobs based on OS
Hack the lcgpbs jobmanagers on the two CEs to apply requirements on the node properties as it submits the job

Running multiple CEs all pointing to the same torque server is fairly simple to do, there is a "BATCH_SERVER" setting in YAIM (3.1 and later, TORQUE_SERVER before that) that you just point at your the torque/maui server and that configures the CE to submit it jobs via that machine. Then there are a couple of other things you have to take care of:

The gridmapdir has to be shared between all the CEs. Otherwise there is a possibility that either the same DN will be mapped to multiple pool accounts or worse that different DNs will be mapped to the same pool account by the different CEs.
The worker nodes need to have the ssh host keys for all the CEs to be able to get the job data back but YAIM will only set one up. The fix is to edit the NODES line in "/opt/edg/etc/edg-pbs-knownhosts.conf" to add all the CEs and your torque server
If the CEs are submitting the same worker nodes you might also want to mount the VO tag are across all the CEs so that VOs don't have to publish the same tags to all the CEs

Node properties are very easy either just edit the torque nodes file to add them or use "qmgr -c "set node $node properties += SL4". I also added "test" and "prod" properties to all the nodes but more of that below.

Finally I needed to change the job manager to require the properties to direct jobs going to the different CEs to different classes of workers based on the above properties. The lcgpbs jobmanager already writes a node requirement in into the job script it submits to torque and so it is easy to rewrite this to add node properties as well. If you look in "/opt/globus/setup/globus/lcgpbs.in" you'll see three places where it writes "#PBS -l nodes=" to set the requirement on the number of CPUs and you need to add :SL4 (or :SL3) to the end of the write.

After doing that, installing some SL4 worker nodes was very simple, about the only necessary change to the site-info.def file was to change the "GLOBUS_TCP_PORT_RANGE" to be space separated rather than comma separated.

With the above hacks in place I was able to leave my old CE happily submitting jobs to the SL3 nodes while I was testing the SL4 worker nodes then gradually move the worker nodes over to Sl4. Before Moving the final worker nodes over I modified the batch system information provider to report the queues as "Draining" whatever their real status. Once all the worker nodes were migrated to SL4 I could just remove the changes to the lcgpbs jobmanager changes and both CEs became equivalent.

Monday, September 24, 2007

Oxford's Tier 2 Upgrade is joining the grid.

The 22 new worker nodes are starting to come on line now.
They are running SL4 32bit mode for now. They will provide an additional 431 K Spec Int 2000.

A second ce t2ce03.physics.ox.ac.uk has been setup to serve the SL4 WN's. We had some trouble with the BDII being on the original ce so have split that function off onto a new node (Well actually a VM ).

The upgrade also includes 4 heads nodes with dual PSUs, and mirrored systems disks, which can be used for service functions or as worker nodes. All the head nodes and disk servers are protected by UPS.

The 11 storage servers (9TB usable each) will be brought on line over the next week.

The two new (Viglen supplied) racks are on the right hand side, with the older Dell kit on the left.

Friday, September 14, 2007

Oxford Local Computer Room Goes Live

The local computer room was completed last Friday. All power is ready under each of the 21 rack positions. Each rack position has 4 CAT6 cables connected to the networking rack which can be seen. Other things completed were; ceiling lights, painting, smoke detection system, and door fitting.

On Monday 10th two existing compute racks were installed and two empty racks for the cluster upgrade arrived. A rack full of worker nodes for the existing grid cluster can be seen and is up and running.

Today the servers arrived from Viglen and installation has commenced.

Tuesday, August 21, 2007

Oxford Computer Room Update

Progress on wiring for Power and Networking is scheduled to be completed this week.

The Oxford Grid Cluster upgrade has been ordered, and should be delivered in early September, to be installed here.

Wednesday, July 18, 2007

SL4 progress

Cambridge has converted its DPM server to 64bit SL4. Plan to start migrating WNs next week.

Birmingham
Very easy to deploy, 32bit SL4, yum and yaim used. Using existing second ce to direct jobs to the WNs. Have passed tests from OPS. Babar farm switched off due to Air Conditioning problems.

RALPPD
dcache servers are running SL4
30 WN cpus now running 32bit SL4 there is a new ce to direct jobs to these. This will be advertised from next Monday (23rd July)

SouthGrid shared calendar setup in google to help coordinate holidays and meetings.

Tuesday, July 17, 2007

Oxford site ce swamped by Biomed jobs

At the end of last week the Oxford ce was swamped by hundreds of biomed jobs, the que was disabled, and the ce rebooted, but manual killing and tidying up was required before the ce stabilised.

Oxford DWB Computer room update

The floor is complete

External power boards ready, and live.

Walls painted, smoke detections systems installed (red pipes) and the ceiling is being installed this week.

Under floor electrical wiring and network cabling should start tomorrow.

Friday, June 22, 2007

Oxford local Computer room update

Work is progressing on the new local room, which is just as well as there are delays on the Begbroke room, which will not be ready till late summer/early autumn.
The floor has been sealed with Vinyl.

Electrical switching has been connected up.

And the false floor is being installed.

Southgrid Update

Bristol.
Plans under way to make use of the new HPC cluster. Meetings started to work out a strategy and solve technical problems.

Cambridge
DPM upgrade was a nightmare, with help from Grieg and Yves, Santanu has now got the se up upgraded to DPM 1.6.4

Birmingham
Problems publishing APEL data are under investigation

Oxford
Support for ngs.ac.uk enabled, tests by Steven Young, from NGS at Oxford are starting. Pete attended the NGS User forum and training event held in the OERC building in Oxford.

Tuesday, June 12, 2007

Rapid progress on Oxford's local computer room

This was the space allocated on level 1 just after the old offices had been cleared out on April 11th.

Since then the walls have been dry lined, the AC units and pipe work are in place.

Heavy electrical work is ongoing and the floor is being prepared.

Also the forth wall has been built.

We are hopeful that the room will be complete by the end of July.
The floor will be sealed this week prior to the false floor being installed. Electrical cabling will then commence.

Friday, May 25, 2007

Nagios Monitoring

Nagios is being setup at Oxford. So far all nodes are tested using ssh to check that they are up and running.
NRPE is being installed to allow check on disk space to be carried out.
Further instructions can be found in the talk by Chris Brew at HEPSYSMAN
http://hepwww.rl.ac.uk/sysman/may2007/agenda.html
or at the System management wiki
http://www.sysadmin.hep.ac.uk/wiki/Nagios

SouthGrid Dashboard

SouthGrid dashboard setup a la ScotGrid and North Grid.
See http://www.gridpp.ac.uk/wiki/Southgrid-Dashboard

Monday, April 30, 2007

Multiple failures at Oxford explained

Oxford ran out of disk space on its DPM SE. This caused the rm SAM test to fail. This was due to ATLAS taking up all the available disk space on our SE. We managed to clear some space from dteam and this allowed us to start passing the tests again. The bigger problem remains , that as there is currently no quota mechanism in DPM, we can not prevent this happening again. We only have two (1.6TB) pools and both are assigned to all VO's. It is not possible to allocate a pool exclusively to ops, or to keep ATLAS on their own without completely removing all data and re designing the pools. This is a non starter.
When more disk space is added consideration will be given to allocate dedicated pools for some VOs.

Oxford then started failing other tests, this was caused by multiple worker nodes having either full /home or / partitions. This highlights the necessity of monitoring disk usage with Nagios.

Tuesday, March 27, 2007

Oxford tries out MonAMI

During the gridpp collaboration meeting I was persuaded to give MonAMI a go.
Installing the rpm from the sourceforge web site was easy enough.
http://monami.sourceforge.net/
Also see the link from the gridpp wiki http://www.gridpp.ac.uk/wiki/MonAMI

As I already use ganglia the idea was that I'd run some checks on disk space and DPM and send the output to ganglia. The first thing we noticed was that in order for some of the features to work you need to be running at least v3 of ganglia. I was still running v2.5, a quick upgrade of the gmond rpms and a new gmond.conf was required.
You do also require mysql. (For the DPM plugin - more later)

The main configuration file is /etc/monami.conf, but this can read further files in /etc/monami.d, so we set about making a basic file to monitor the root file system.

[filesystem]
name=root-fs
location=/

[sample]
interval=1m
read = root-fs.blocks.free
write = ganglia

[snapshot]
name=simple-snapshot
filename=/tmp/monami-simple-snapshot

[ganglia]
multicast_ip_address = 239.2.11.95
multicast_port = 8656

more coming soon....

Thursday, March 08, 2007

CAMONT jobs successfuly running at Oxford

The CAMONT VO has now been working correctly at Oxford since Friday 2nd March.
Karl Harrison of Cambridge has been running jobs from Cambrdige.

In another Cambridge collaboration, LHCB software has been installed on a Windows server 2003 test node at Oxford by Ying Ying Li from Cambridge. They are testing the use of Windows for LHCb analysis code, and having tetsed at Cambridge were looking to prove it could work at other sites. Ideally they would like some more test nodes and 0.5 TB of disk space. This may be harder to find.

Cambridge ran the Atlas DPM ACL fix on Monday 5th when I (PDG) visited Santanu. Now all SouthGrid sites have run the required fix.

I took the opportunity to measure the power consumption of the new Dell 1950's (Intel 5150 cpus). Idle power consumtion is about 200W rising to 285 under load (4 cpu intensive jobs).

Thursday, March 01, 2007

Oxford and the ATLAS DPM ACL fix.

I tried to run the ATLAS patch program yesterday to fix the ACL's on the DPM server at Oxford.
This update has be provided as a binary from ATLAS that has to be run as root on the se. This was potentially dangerous and many sites had delayed running this, and objected to the fact that we don't really know what it is doing. Anyway the pragmatic approach seemed to be that most other sites had run it now so I would.
The configuration file has to be edited to match the local sites config.
I perfomed a normal file backup using the HFS software Tivolis Storage Manager.
dsmc incr
Then dumped the mysql data base.
mysqldump --user=root --password=****** --opt --all-databases | gzip -c > mysql-dump-280207.sql.gz
As our main DPM server was currently set readonly (To cope with the DPM bug of not sharing across pools properly) we decided to set it back to read/write for the update.
dpm-modifyfs --server t2se01.physics.ox.ac.uk --fs /storage --st 0
Then run the update program (refered to as a script in some docs):
./UpdateACLForMySQL
Unfortuneatly I had used the wrong password in the config file so it failed,
this is where a strange feature of the update program was discovered.
After it runs it removes several entries from the config file , the password and the gid entry, so after several attempts the correct config file was used and the update appears to have been successful.
dpns-getacl /dpm/physics.ox.ac.uk/home/atlas/dq2

Shows the acls.
# file: /dpm/physics.ox.ac.uk/home/atlas/dq2
# owner: atlas002
# group: atlas
user::rwx
group::rwx #effective:rwx
group:atlas/Role=production:rwx #effective:rwx
mask::rwx
other::r-x
default:user::rwx
default:group::rwx
default:group:atlas/Role=production:rwx
default:mask::rwx
default:other::r-x

I reset the main DPM server back to read only:
dpm-modifyfs --server t2se01.physics.ox.ac.uk --fs /storage --st RDONLY

The process was not simple or clear and I hope not to have to do more for other VO's...

Birmingham suffering from multiple hardware failures

The Babar cluster at Birmingham which is made up of older kit salvaged from QMUL and Bristol plus the original Birmingham cluster, is suffering from h/w problems.
7 worker node disks have died, some systems have kernel panics, and the globus MDS service is playing up. Yves is working hard to fix things but maybe we are just getting to the end of the useful life of much of this kit?

Tuesday, February 27, 2007

glite UI update fix

To resolve the missing dependancy on the ui for the last two updates of glite
in particular the glite-ui-config rpm required python-fpconst.
You can get this rpm from cern see:

http://linuxsoft.cern.ch/repository//python-fpconst.html

Use
wget http://linuxsoft.cern.ch/cern/SLC30X/i386/SL/RPMS/python-fpconst-0.6.0-3.noarch.rpm

to add this to your local repository.
Then yum -y update will work once again.

Monday, February 26, 2007

Another Workernode Hard drive failure at Oxford

t2wn37 Hard drive failed over the weekend. Dell will replace.
This follows on from t2wn04 last week, and the PSU in t2lfc01 a few weeks before. t2lfc01 was one of the gridpp supplied nodes from Streamline. Replacement of the PSU took several weeks.

Tuesday, February 20, 2007

Fusion jobs successfully running at Oxford

FUSION jobs have now run succesfully at Oxford.

Birmingham Network reconfiguration

Yves reports:
Our site has been down since yesterday morning 10am until today 1000am due to network problem which IS have linked to a faulty link with a campus switch.
....
IS have temporarily disabled the physics link to the library switch, one of our two links to the network, and this has fixed the connectivity problem from the outside world to our grid box.

They will re-instate the link (for resilience) when they've got to the bottom of the problem (faulty fibre, or whatever).
-->

So, it'be interesting to see the gridmon result in the current configuration while waiting for IS to understand the problem.

This may be the cause of the 33% UDP Packet loss we have been seeing to/from birmingham.

FUSION VO Problems at Oxford

At Oxford we had reports from Fusion of problems:
"We have checked that FUSION jobs fail at your site with the error "37 the provided RSL 'queue' parameter is invalid". This is because "fusion" is missing at the end of the file /opt/globus/share/globus_gram_job_manager/lcgpbs.rvf in your CE ("fusion" should be included in the list of Values of the attribute "queue"). We also noticed that the FUSION VOMS server certificate ([1]) is not installed at /etc/grid-security/vomsdir/ in your CE."

I down loaded the cert from :
http://swevo.ific.uv.es/vo/files/swevo.ific.uv.es-oct2006.pem

and ran
/opt/glite/yaim/scripts/run_function /root/yaim-conf/site-info.def config_globus
which made the 4 VO's I recently added appear in the files lcgpbs.rvf and pbs.rvf in
/opt/globus/share/globus_gram_job_manager/ .
I can only assume that we had had errors when we ran yaim the first time as the 4 new
VO's had not appeared the first time.

Monday, February 19, 2007

Problems with DNS style VO names

We have now discovered that adding the camont VO is not straight forward due to the new DNS style VO name.
The current yaim cannot handle the long format for VO names. The new yaim 3.1 which is not yet released should help but has not yet been tested.
Yves has had a look at it and it is very different from the current version.
"Hello all,

I got hold of the new version of yaim and there are some non-trivial differences with the production version. I think it would be ill advised to try the new version in production. I think we could enable this new vo style by configuring gip by hand and then perform the correct queue to group mapping for pbs/condor. But instead of all sites doing this (plus potential RB complications?), couldn't we revert to the current vo style (if running jobs is urgent), I do not understand while the new vo style should be implemented on production sites when it is still awaiting certification and has not even been tried on the pre-production service?

Thanks,

Yves
"

Thursday, February 15, 2007

New VO's added at Oxford

Support for Minos, Fusion, Geant 4 and camont were added yesterday at Oxford.

The new CA rpms were also installed so now we should be green again.

Monday, February 12, 2007

Latest glite update problem on UI

Got the below error on my UI when I tried to update to the latest rpms.
This has already been reported as a GGUS ticket no:
https://gus.fzk.de/ws/ticket_info.php?ticket=18358

gronbech@ppslgen:/var/local> ssh root@t2ui02 'yum -y update;pakiti'
Gathering header information file(s) from server(s)
Server: Oxford LCG Extras
Server: gLite packages
Server: gLite updated packages
Server: gLite updated packages
Server: LCG CA packages
Server: SL 3 errata
Server: SL 3 main
Finding updated packages
Downloading needed headers
Resolving dependencies
.....Unable to satisfy dependencies
Package SOAPpy needs python-fpconst >= 0.6.0, this is not available.

Steve Lloyds ATLAS Test jobs

Work was carried out at Oxford to find out why the Atlas test jobs were not working.
It seemed there were some old references to pool accounts of the format atlas0100 and upwards which should have been atlas100 upwards. Once all references to these were removed the jobs started working. The problem effected both the ce and the DPM server.

PDG requested that 12.0.5 be installed at Oxford via the web page:
https://atlas-install.roma1.infn.it/atlas_install/protected/rai.php
but wonders if he should have been using
https://atlas-install.roma1.infn.it/atlas_install/

The installation was complete by Friday 9th Feb.
Results for Oxford were all fine until the problems over the weekend.
http://hepwww.ph.qmul.ac.uk/~lloyd/atlas/atest.php

The problems at Bristol are caused by the worker having very small
home disk partitions. The atlas software can not be loaded as their is insufficent space to expand the tar file.

Oxford instabilities over the weekend

The Oxford site had trouble over the weekend due to the system disk on the ce getting full.
This was mainly due to large number of old log files. These have been migrated off to part of the software directory for now.
The dpm server was also in a bad state and services had to be restarted.

Meanwhile PDG is on the process of adding support for some new VO's; namely:
MINOS, FUSION, GEANT4 and CAMONT

While also on TPM duty this week.

RALPPD to get another upgrade

Chris Brew announced on Friday 9th Feb:
RALPPD have been awarded another chunck of money to be spent by March 31st.2007
This will allow them to purchase one rack of CPU's and one rack of Disks.
The CPU's will be equivalent to 275KSI2K bringing the total to about 600KSI2K, and the new disks will be 78TB,s bringing the total to 158TB,s.
This includes the 50TB's currently on loan to the T1 will be returned shortly.
The hardware will be identical to the recent T1 purchase.

Cambridge New Systems Arrive

Santanu announced on 19.1.07:

Just to let you know that all the new machines have arrived; just waiting for the rack to be delivered and the Dell engineer (that's actually the part of the contact) to come and switch it on.

When done, it's gonna give LCG/gLite another 128 CPUs and if our experiment with CamGrid and Condor succeeds, it will top up another ~500 CUPs. Now we can mount /experiment-software and LCG middleware area onto any CamGrid machine with any root permissions, WNs outbound connection is also sorted out. Now need to think about the stupid "WN pool account"

Intel call it as "Woodcrest". All the nodes are dual-core dual CPU, so 4 CPUs under the same roof
Dell Model : PE1950
Processor : Xeon 5150Ghz/4MB 1333FSB
Memory : 8*1GB dual rank DIMMs