Tuesday, May 13, 2014

Configuring ARC CE and Condor with puppet

ARC CE and condor using puppet

We have started testing Condor and ARC CE with the intention of moving away from Torque.  Almost one third of cluster has been moved to condor and we are quite satisfied with Condor as a batch system.  Condor setup was fairly easy but configuring ARC CE was bit challenging.  I believe that new version of ARC CE has fixed most of the issue I faced.  Andrew Lahiff was of great help in troubleshooting our  problems .Our setup consists of
1           CE :  Configured as ARC CE and  Condor submit host and runs Condor SCHEDD process
2              Central manager :  Condor Server and  runs Condor COLLECTOR and NEGOTIATOR process
3              WN’s :  Runs Condor  STARTD process, also installed emi-wn and glexec metapackages.
CE , Central Manager and condor part of WN’s  were completely  configured  with puppet.  I have to run yaim on WN’s t configure emi-wn and glexec.
I used puppet modules from https://github.com/HEP-puppet which were initially written by Luke Kreczko from Bristol.  We are using Hiera to pass parameters but most puppet modules works without Hiera as well.  I am not intending to go into details of condor or ARC CE but rather use of puppet modules to install and configure Condor and ARC CE.

Condor :
It was a pleasing experience to configure condor with puppet.
     Git clone https://github.com/HEP-Puppet/htcondor.git to module directory on puppet server
     include htcondor
on CE, Central Manager and WN’s and then Hiera tells that which service has to be configured on a particular machine.
# Condor
- 't2condor01.physics.ox.ac.uk'
- 't2arc01.physics.ox.ac.uk'
- 't2wn*.physics.ox.ac.uk'

htcondor::uid_domain: 'physics.ox.ac.uk'
htcondor::collector_name: 'SOUTHGRID_OX'
htcondor::pool_password: 'puppet:///site_files/grid/condor_pool_password'

This configures a basic condor cluster.  There is no user account at this stage so a test user account can be created on all three machines and basic condor jobs can be tested.  Htcondor manual is here

Setting up user accounts :
I  used this module to create user accounts only  for central manager and ce.  Since I have to run yaim on WN’s to setup emi-wn and glexec so  created user account on WN through yaim.
This puppet module can parse a glite type users.conf to create users account or range of  id’s can be passed to the module.

Setting up voms server :
It is used to set voms client on central-manager and ce.  One way to use this module is to pass name of each VO separately as described in the readme file of the module.
     Class { ‘voms::atlas’}
I  have used small wrapper class to pass all VO’s as array to wrapper class
     include include setup_grid_accounts
Then pass name of the VO’s through Hiera setup_grid_accounts::vo_list:
    - 'alice'
    - 'atlas'
    - 'cdf'
    - 'cms'
    - 'dteam'
    - 'dzero'

include arc_ce and on CE and then pass configuration parameters from Hiera. It has a very long list of configurable parameters and most of the default values works ok.  Since most of values are passed through Hiera so arc Hiera file is quite long, I am giving few of the examples
    targethostname: 'index1.gridpp.rl.ac.uk'
    targetport: '2135'
    targetsuffix: 'Mds-Vo-Name=UK,o=grid'
    regperiod: '120'

       default_memory: '2048'
         - '1cpu:4'
          OSFamily: 'linux'
          OSName: 'ScientificSL'
          OSVersion: '6.5'
          OSVersionName: 'Carbon'
          CPUVendor: 'GenuineIntel'
          CPUClockSpeed: '2334'
          CPUModel: 'xeon'
          NodeMemory: '2048'
          totalcpus: '168'

This almost sets up condor cluster with arc ce. There are few bits in arc and puppet modules which are there as a workaround for things which have already been fixed upstream. It needs some testing and clean up.

WN's needs some small runtime env setting specific to ARC. When jobs arrive at WN's it looks into /etc/arc/runtime/ directory for ENV settings 
 Our's runtime tree is like this.
├── APPS
│   └── HEP
│       └── ATLAS-SITE-LCG
└── ENV
    ├── GLITE
    └── PROXY
It can be just empty files. SAM-Nagios doesn't submit jobs if ARC CE is not publishing GLITE env.

I may have missed few things so please feel free to point it out.



No comments: