Saturday, November 07, 2009

A week of upgrades for the RAL Tier 2 - Part 1 -The Network

Well it has been a long week at the RAL Tier 2. We've finally had our much postponed downtime to update our dCache installation (delayed once when one of the disk servers got a corrupt filesystem, then to avoid a CMS analysis test and finally to avoid an Atlas analysis test). The delays, however, did mean we could also include the long planned network upgrade in the downtime - this was probably a good thing.

So we had quite a programme of work for a five day downtime:
  1. Replace the PNFS namespace in dCache with Chimera
  2. Update dCache from 1.9.1 to the "Golden Release" 1.9.5
  3. Install a new network switch and set up a 10Gb/s link between the two halves of our farm
Indeed, heading into work on Friday with neither dCache nor the network working, I thought I would be extending the downtime into the next week but by lunchtime things had improved and we were able to come out of the downtime on time at 5pm - although despite a full suit of "OK" SAM tests GridView still has us down until nearly eight o'clock.

Taking the last of the upgrades first: before last week we had the two halves of our farm in two different rooms. Each half of the farm has it's own Nortel 55XX network stack. Most of the storage is in the room known as Lab 8 in the R1 office building with a 10Gb/s connection to site Router A, whilst most of the compute nodes are in the Atlas lower machine room, A5Lower, with a 2x1Gb/s connection to Site Router A. That 2x1Gb/s connection between the storage and compute nodes was our main bottleneck - it would regularly run at over 99% capacity for days during Atlas Hammercloud tests.

The Plan was to install a Nortel 5650 switch into the stack in A5Lower then set up a direct 10Gb/s fibre link from there to Lab 8 - cutting out the 2x1GB/s link and Router A. That sounded fairly trivial and when I went down with Networking on Thursday afternoon to set it up I expected to be back in a hour to carry on struggling with our, at that time, broken dCache.

Due to cabling issues we had to re-order the switches in the stack and I also had to swap out a 5510 I had borrowed from the Tier 1 and replace it with a new one. So we broke up the current stack and tried to stack the 5650 with one of the 5510s. According to everything we had read they should have see each other, the 5650 should have downloaded an updated version of the firmware and software to the older 5510 and then the should have joined together as a single switch. But ours did not talk to each other.

Well possibly the version of the software on the 5510s was too old, so we went to each switch in turn, set it up with an IP address downloaded a new version on the firmware and software and restarted it.

By the end of Thursday - we were more-or-less back where we had started - we had a stack of 5510s (still without the 5650) .

On Friday morning Nick found a setting on the 5650 to allow "hybid stack mode" and suddenly everything worked.

We soon had all the correct VLANs set up and the two halves of our network were talking over the new fast link.

No comments: