Thursday, September 09, 2010

Tracing a Grid Job (A recap)

Just in case we should forget how to trace a grid job I record some steps below.

For example you discover via a CMS SAM page you are failing some test (could equally be any other SAM page such as LHCb) , you click on the detailed out put and see a reference to the job id:
on t2ce05 contains the string: sOFavxScVKU-GbSYaCmx-A
on t2ce05
grep sOFavxScVKU-GbSYaCmx-A /opt/edg/var/gatekeeper/grid-jobmap_20100906
reveals the batch system job id: lrmsID=2998805.t2torque02.physics.ox.ac.uk
on the batch server t2torque02 in our case, either:
tracejob 2998805

or

grep 2998805 /var/spool/pbs/server_logs/20100909

The tracejob option is easier!

This will let you know which worker node ran the job. You can then have a look at it to check for full disks, memory faults etc or segfaults in the log files......

Now in reverse

A job is misbehaving on your node and you need to see who is running it.
The special case here is that its an ATLAS pilot job, this does not have a normal grid job id.

Get the PID from top, use
pstree -H pid
to highlight the processes parents.
(Use pstree -A -H pid if on an putty window on Windows)

This reveals which pbs job it is
eg 3020508.t2torque02.physics.ox.ac.uk

The job can be traced on the panda monitor, using the search facility on the LH toolbar.
This gives the job details including the users name. A GGUS ticket could then be raised against ATLAS asking for the user to be informed.

Wednesday, September 01, 2010

APEL on ngsce-test

APEL was failing on ngsce-test with the following error.

java.io.FileNotFoundException: /var/spool/pbs/server_priv/accounting/20090522 (Too many open files)

The solution was to type:
ulimit -n 10240

I've added this to the /opt/glite/bin/apel-pbs-log-parser script.

A fix is in test, so a new version of APEL will fix it.
see GGUS ticket
https://gus.fzk.de/ws/ticket_info.php?ticket=60674

Friday, August 27, 2010

Argus Server at Oxford

We finally managed to install Argus server at Oxford with messy workaround. Installation and configuration was reasonably ok, and once policy structure was clear then writing and loading policy was also easy. Details are here http://www.gridpp.ac.uk/wiki/Oxford.

The main issue was host certificate issued by UK CA which contains an "emailAddress" and supposedly this is depreciated year(s) ago and most developers assume that there is no "emailAddress" in host certificate. Although still it is a bug in Argus and hopefully would be resolved in next release.
So the workaround
By default pap-admin command uses host certificate in /etc/grid-security/ if started from root but since there is a problem with host certificate so I copied my personal certificate proxy from UI and started pap-admin using that proxy. Then added ACE
pap-admin ace
"/C=UK/O=eScience/OU=Oxford/L=OeSC/CN=t2argus02.physics.ox.ac.uk/OID.1.2.840.113549.1.9.1=lcg_manager@physics.ox.ac.uk" ALL
This workaround was suggested by Andrea Ceccanti

The only issue is that if you want to restart pap service then first remove ACE using remove-ace command, restart pap and then add ACE again.

Wednesday, June 23, 2010

Oxford's blanking panels


Having just read Stuart's ScotGrid blog post about cooling in the top of racks I thought I'd let you know about the panels we use.

We have been specifying that all empty racks slots should be filled by blanking panels since our 2007 purchase. The they used to use metal blanking panels.

These days they tend to supply the 1U APC plastic clip in panels, as can be seen in the RH rack in the photo.
These cost £25-£30 per pack of 10 but we managed to get a bulk (200) purchase in 2008 which worked out at about £1.69 each.

http://www.apc.com/resource/include/techspec_index.cfm?base_sku=AR8136BLK200

Tuesday, May 18, 2010

Jobs with analysis role

It started with a ticket from dzero about job failure at creamce at oxford. On investigation it was found that these jobs were coming with /dzero/users/Role=analysis/Capability=NULL and expectantly lcmaps failing with this error "no entry found for /dzero/users/Role=NULL/Capability=NULL ".
But the jobs from the same user were running on lcg-CE so on further investigation it turn out that lcmaps-voms plugins were failing on lcg-CE too but as per lcmaps policy it runs lcmaps-poolacount plugin after voms plugin failure and lcmaps-poolaccount uses individual DN mapping from grid-mapfile. So lcg-CE was mapping correctly to dzero pool account but through wrong procedure.
creamce don't use edg-mkgridmap file for creating grid-mapfile so no individual mapping is defined in grid-mapfile.
Solution was quite easy and we have to just define MAP_WILDCARDS=yes in vo.d/dzero and rerunning yaim created a slightly different grid-mapfile and groupmapfile with wild-cards.

dzero/Role=lcgadmin/Capability=NULL" dzerosgm
"/dzero/Role=lcgadmin" dzerosgm
"/dzero/Role=production/Capability=NULL" dzeroprd
"/dzero/Role=production" dzeroprd
"/dzero/*/Role=*" .dzero
"/dzero/*" .dzero
"/dzero/Role=NULL/Capability=NULL" .dzero
"/dzero" .dzero

So any job coming with different Role would be mapped to normal pool account.
The issue was discussed in this ticket https://savannah.cern.ch/bugs/index.php?26990