Saturday, November 07, 2009

A week of upgrades for the RAL Tier 2 - Part 1 -The Network

Well it has been a long week at the RAL Tier 2. We've finally had our much postponed downtime to update our dCache installation (delayed once when one of the disk servers got a corrupt filesystem, then to avoid a CMS analysis test and finally to avoid an Atlas analysis test). The delays, however, did mean we could also include the long planned network upgrade in the downtime - this was probably a good thing.

So we had quite a programme of work for a five day downtime:
  1. Replace the PNFS namespace in dCache with Chimera
  2. Update dCache from 1.9.1 to the "Golden Release" 1.9.5
  3. Install a new network switch and set up a 10Gb/s link between the two halves of our farm
Indeed, heading into work on Friday with neither dCache nor the network working, I thought I would be extending the downtime into the next week but by lunchtime things had improved and we were able to come out of the downtime on time at 5pm - although despite a full suit of "OK" SAM tests GridView still has us down until nearly eight o'clock.

Taking the last of the upgrades first: before last week we had the two halves of our farm in two different rooms. Each half of the farm has it's own Nortel 55XX network stack. Most of the storage is in the room known as Lab 8 in the R1 office building with a 10Gb/s connection to site Router A, whilst most of the compute nodes are in the Atlas lower machine room, A5Lower, with a 2x1Gb/s connection to Site Router A. That 2x1Gb/s connection between the storage and compute nodes was our main bottleneck - it would regularly run at over 99% capacity for days during Atlas Hammercloud tests.

The Plan was to install a Nortel 5650 switch into the stack in A5Lower then set up a direct 10Gb/s fibre link from there to Lab 8 - cutting out the 2x1GB/s link and Router A. That sounded fairly trivial and when I went down with Networking on Thursday afternoon to set it up I expected to be back in a hour to carry on struggling with our, at that time, broken dCache.

Due to cabling issues we had to re-order the switches in the stack and I also had to swap out a 5510 I had borrowed from the Tier 1 and replace it with a new one. So we broke up the current stack and tried to stack the 5650 with one of the 5510s. According to everything we had read they should have see each other, the 5650 should have downloaded an updated version of the firmware and software to the older 5510 and then the should have joined together as a single switch. But ours did not talk to each other.

Well possibly the version of the software on the 5510s was too old, so we went to each switch in turn, set it up with an IP address downloaded a new version on the firmware and software and restarted it.

By the end of Thursday - we were more-or-less back where we had started - we had a stack of 5510s (still without the 5650) .

On Friday morning Nick found a setting on the 5650 to allow "hybid stack mode" and suddenly everything worked.

We soon had all the correct VLANs set up and the two halves of our network were talking over the new fast link.

Tuesday, October 20, 2009

Backing up MySQL databases

Oxford have installed a simple script to backup the DPM mysql db once a day at 6am.
The script was loosely based on Glasgow's example here .

In order to restrict the file names produced to just 7, I've opted to use the current day rather than date.

[root@t2se01 ~]# cat /root/mysql-dump-pdg.pl
#!/usr/bin/perl
#
# Loosely based on the Glasgow script but simplified.
#
# Select the current day only as we want to have just seven unique file names which will be overwritten
# thus reducing the total backup size.


@weekDays = qw(Sunday Monday Tuesday Wednesday Thursday Friday Saturday);
($second, $minute, $hour, $dayOfMonth, $month, $yearOffset, $dayOfWeek, $dayOfYear, $daylightSavings) = localtime();
$theTime = "$weekDays[$dayOfWeek]";
#print $theTime;

$backup_dir="/var/lib/mysqldumps";
$mysql_user="root";
$mysql_pw_file="/root/mysql-pw";
$keep_days=7;


# Read mysql password
open(PW, $mysql_pw_file) || die "Failed to open password file $mysql_pw_file: $!\n";
$mysql_pw=;
chomp $mysql_pw;
close PW;

# Dump the db now
chdir $backup_dir || die "Failed to change to backup directory $backup_dir: $!\n";

system "/usr/bin/mysqldump --user=$mysql_user --password=$mysql_pw --opt --all-databases | gzip -c > mysql-dump-$theTime.sql.gz";
die "Mysql failed died with exit code $?\n" if $? != 0;

This is run by /etc/cron.d/mysql-dump
PATH=/sbin:/bin:/usr/sbin:/usr/bin
0 6 * * * root /root/mysql-dump-pdg.pl

So far it seems to work in testing!

Monday, October 19, 2009

Oxford Grid now SL5

All but one worker node on the Oxford Grid site has been reinstalled running SL5.
Currently these are served by one ce, t2ce05, but more will be added shortly to offer resilience.

Wednesday, October 14, 2009

Quarterly Report DPM script

Each quarter we needs to report on disk usage at our sites.
This can be tricky but the following script will help at DPM sites:

#!/bin/bash

DAY=`date +%F`
echo $DAY
for zz in `dpns-ls /dpm/physics.ox.ac.uk/home/`;do
dpns-du -z -s /dpm/physics.ox.ac.uk/home/$zz>>Oxford-SE-Usage-$DAY;
done


You will need to modify it appropriately for your site.
Extra added 20.10.09
This makes use of the dpns-du command in the gridpp-dpm toolkit available from :
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/

Details of the other commands are on the wiki

Thursday, March 05, 2009

120 new cores for EFDA-JET

30 new Sunfire 2200 m2 servers have been incorporated into the EFDA-JET site. Each has dual processor dual core Opteron 2218 processors, so that increases the number of Worker Nodes cores by 120 up to 254. Each node has 8GB RAM.

Thursday, February 26, 2009

CMS at Oxford

Oxford was failing a ce CMS SAM test with a warning, probably due to some permissions problems in the se.
Following commands illuminated things:

This extract from /var/log/dpm/log
02/26 10:49:30 3869,24 dpm_srv_proc_put: processing request c75ce541-b2cd-4bdc-bf8f-c86ecb0be6ed from /C=UK/O=eScience/OU=CLRC/L=RAL/CN=chris cms brew
02/26 10:49:30 3869,24 dpm_srv_proc_put: calling Cns_stat
02/26 10:49:30 3869,24 dpm_srv_proc_put: calling Cns_creatx
02/26 10:49:30 3869,24 dpm_srv_proc_put: srm://t2se01.physics.ox.ac.uk:8446/srm/managerv2?SFN=/dpm/physics.ox.ac.uk/home/cms/store/user/test/oneEvt.root: DPM_FAILED (Permission denied)
02/26 10:49:30 3869,24 dpm_srv_proc_put: returns 0, status=DPM_FAILED (Permission denied)

Shows the test file creation failing

[root@t2se01 dpm]# dpns-ls -l /dpm/physics.ox.ac.uk/home/cms/store/
drwxrwxr-x 1 24135 1399 0 Jan 13 18:55 PhEDEx_Debug
drwxrwxr-x 2 24135 3490 0 Oct 13 12:15 PhEDEx_LoadTest07
drwxrwxr-x 0 24135 1399 0 Feb 26 12:20 brew
drwxrwxr-x 2 24135 1399 0 Jan 27 15:16 mc
drwxrwxr-x 2 24351 3422 0 Feb 06 18:57 unmerged
drwxrwxr-x 1 24352 3406 0 Jan 21 18:21 user
[root@t2se01 dpm]# dpns-listgrpmap |grep 1399
1399 cms
[root@t2se01 dpm]# dpns-listgrpmap |grep 3406
3406 cms/Role=lcgadmin
[root@t2se01 dpm]# dpns-getacl /dpm/physics.ox.ac.uk/home/cms/store/
# file: /dpm/physics.ox.ac.uk/home/cms/store/
# owner: /C=UK/O=eScience/OU=CLRC/L=RAL/CN=chris cms brew
# group: cms/Role=cmst1admin
user::rwx
group::rwx #effective:rwx
group:cms/Role=lcgadmin:rwx #effective:rwx
group:cms/Role=production:rwx #effective:rwx
mask::rwx
other::r-x
default:user::rwx
default:group::rwx
default:group:cms/Role=lcgadmin:rwx
default:group:cms/Role=production:rwx
default:mask::rwx
default:other::r-x
[root@t2se01 dpm]# dpns-getacl /dpm/physics.ox.ac.uk/home/cms/store/brew
# file: /dpm/physics.ox.ac.uk/home/cms/store/brew
# owner: /C=UK/O=eScience/OU=CLRC/L=RAL/CN=chris cms brew
# group: cms
user::rwx
group::rwx #effective:rwx
group:cms/Role=lcgadmin:rwx #effective:rwx
group:cms/Role=production:rwx #effective:rwx
mask::rwx
other::r-x
default:user::rwx
default:group::rwx
default:group:cms/Role=lcgadmin:rwx
default:group:cms/Role=production:rwx
default:mask::rwx
default:other::r-x
[root@t2se01 dpm]# dpns-ls -l /dpm/physics.ox.ac.uk/home/cms/store/
drwxrwxr-x 1 24135 1399 0 Jan 13 18:55 PhEDEx_Debug
drwxrwxr-x 2 24135 3490 0 Oct 13 12:15 PhEDEx_LoadTest07
drwxrwxr-x 0 24135 1399 0 Feb 26 12:20 brew
drwxrwxr-x 2 24135 1399 0 Jan 27 15:16 mc
drwxrwxr-x 2 24351 3422 0 Feb 06 18:57 unmerged
drwxrwxr-x 1 24352 3406 0 Jan 21 18:21 user
[root@t2se01 dpm]# dpns-ls -l /dpm/physics.ox.ac.uk/home/cms/store/user
drwxrwxr-x 1 24352 3406 0 Jan 21 18:21 test
[root@t2se01 dpm]# dpns-ls -l /dpm/physics.ox.ac.uk/home/cms/store/user/test
drwxrwxr-x 1 24352 3406 0 Jan 21 18:21 SAM-t2se01.physics.ox.ac.uk
[root@t2se01 dpm]# dpns-chgrp 1399 /dpm/physics.ox.ac.uk/home/cms/store/user
[root@t2se01 dpm]# dpns-ls -l /dpm/physics.ox.ac.uk/home/cms/store/user
drwxrwxr-x 1 24352 3406 0 Jan 21 18:21 test
[root@t2se01 dpm]# dpns-ls -l /dpm/physics.ox.ac.uk/home/cms/store/
drwxrwxr-x 1 24135 1399 0 Jan 13 18:55 PhEDEx_Debug
drwxrwxr-x 2 24135 3490 0 Oct 13 12:15 PhEDEx_LoadTest07
drwxrwxr-x 1 24135 1399 0 Feb 26 12:33 brew
drwxrwxr-x 2 24135 1399 0 Jan 27 15:16 mc
drwxrwxr-x 2 24351 3422 0 Feb 06 18:57 unmerged
drwxrwxr-x 1 24352 1399 0 Jan 21 18:21 user
[root@t2se01 dpm]# dpns-chgrp 1399 /dpm/physics.ox.ac.uk/home/cms/store/user/test
[root@t2se01 dpm]# dpns-ls -l /dpm/physics.ox.ac.uk/home/cms/store/brew
-rw-rw-r-- 1 24135 1399 4788418 Feb 26 12:34 oneEvt.root
[root@t2se01 dpm]# dpns-ls -l /dpm/physics.ox.ac.uk/home/cms/store/user/test
drwxrwxr-x 1 24352 3406 0 Jan 21 18:21 SAM-t2se01.physics.ox.ac.uk
-rw-rw-r-- 1 24135 1399 4788418 Feb 26 12:36 oneEvt.root