Thursday, September 09, 2010

Tracing a Grid Job (A recap)

Just in case we should forget how to trace a grid job I record some steps below.

For example you discover via a CMS SAM page you are failing some test (could equally be any other SAM page such as LHCb) , you click on the detailed out put and see a reference to the job id:
on t2ce05 contains the string: sOFavxScVKU-GbSYaCmx-A
on t2ce05
grep sOFavxScVKU-GbSYaCmx-A /opt/edg/var/gatekeeper/grid-jobmap_20100906
reveals the batch system job id: lrmsID=2998805.t2torque02.physics.ox.ac.uk
on the batch server t2torque02 in our case, either:
tracejob 2998805

or

grep 2998805 /var/spool/pbs/server_logs/20100909

The tracejob option is easier!

This will let you know which worker node ran the job. You can then have a look at it to check for full disks, memory faults etc or segfaults in the log files......

Now in reverse

A job is misbehaving on your node and you need to see who is running it.
The special case here is that its an ATLAS pilot job, this does not have a normal grid job id.

Get the PID from top, use
pstree -H pid
to highlight the processes parents.
(Use pstree -A -H pid if on an putty window on Windows)

This reveals which pbs job it is
eg 3020508.t2torque02.physics.ox.ac.uk

The job can be traced on the panda monitor, using the search facility on the LH toolbar.
This gives the job details including the users name. A GGUS ticket could then be raised against ATLAS asking for the user to be informed.

No comments: