Resolving Unable to determine user account for execution

If you are facing this issue “Unable to determine user account for execution;”, it is likely casued by the fact that LSF cluster must be restarted if user authentication is switched from NIS to LDAP.

After user authentication method is switched from NIS to LDAP, all user jobs are pending with the pending reason: “Unable to determine user account for execution”.

If user authentication mode is changed, you have to restart LSF daemons on all LSF hosts for jobs to run successfully. On the LSF master master host do:

# lsadmin reconfig
# badmin mbdrestart
# badmin hrestart all

References:

  1. LSF cluster must be restarted if user authentication is switched from NIS to LDAP

Packing serial jobs neatly in Platform LSF

Taken from Placing jobs based on available job slots of hosts

Platform LSF allowed you to pack or spread jobs as required. Before that, just a few terms to defined

  1. Packing means always placing jobs on the hosts with the least available slots first. Packing jobs can make room for bigger parallel jobs.
  2. Spreading tries to spread jobs out and places jobs on the hosts with the most available slots first. Spreading jobs maximizes the performance of individual jobs.

I will deal with one situation where I want to pack all the serial jobs neatly in as few nodes as possible.

But here are some terms from the LSF Wiki

The slots keyword represents available slots on each host and it is a built-in numeric decreasing resource. When a job occupies some of the job slots on the host, the slots resource value is decreased accordingly. For example, if MXJ of an LSF host is defined as 8, the slots value will be 8 when the host is empty. When 6 LSF job slots have been occupied, slots becomes 2. The slots resource can only be used in select[] and order[] sections of a job resource requirement string. To apply a job packing or spreading policy, you can use the order[] section in the job resource requirement. For example, -R “order[-slots]” will order candidate hosts based on the least available slots, while –R “order[slots]” will order candidate hosts based on the hosts with the most available slots.

To use ! in an order[] clause, you must set SCHED_PER_JOB_SORT=Y in lsb.params. To make the parameter take effect, run badmin mbdrestart or badmin reconfig on the master host to reconfigure mbatchd.

The following is an example of using the slots resource:
Step 1: Configure RES_REQ in a Queue section of lsb.queues.

Begin Queue
QUEUE_NAME = myqueue
…
RES_REQ = order[-slots]
…

End Queue

Step 2: Make the configuration take effects

# badmin reconfig

Step 3: Check whether the configuration take effects.

# bqueues -l myqueue
QUEUE: myqueue
…
RES_REQ: order[-slots]

You can also do it at the application level. For more information, please read the PLatfrom LSF Wiki

References:

  1. Placing jobs based on available job slots of hosts

Setting up Secondary Master Host on Platform LSF

To setup Secondary Master Host on Platform LSF can be very easy.

Step 1: Update the LSF_MASTER_LIST parameter in lsf.conf by updating the current master host

# cd $LSF_ENVDIR
# vim lsf.conf

At line 114

.....
LSF_MASTER_LIST="h00 h01"
.....

If you wish to switch the order of Master Host for maintenance, it can be done as well

Step 2: Reconfigure the cluster and restart the LSF mbatchd and mbschd processes

# lsadmin reconfig
# badmin mbdrestart

Step 3: Update the master_hosts in lsb.hosts

# cd $LSF_ENVDIR/lsbatch/yourhpclcuster/configdir/configdir
# vim lsb.hosts
Begin HostGroup
GROUP_NAME    GROUP_MEMBER      #GROUP_ADMIN # Key words
master_hosts      (h00 h01)
End HostGroup

References:

  1. Switch LSF master host to secondary master candidate

Submitting an interactive job on LSF Platform

Using a Pseudo-terminal to launch Interactive Job

Point 1: Submit a batch interactive job using a pseudo-terminal.

$ bsub -Ip vim output.log

Submits a batch interactive job to edit output.log.

Point 2:  Submit a batch interactive job and create a pseudo-terminal with shell mode support.

$ bsub -Is bash

Submits a batch interactive job that starts up bash as an interactive shell.

When you specify the -Is option, bsub submits a batch interactive job and creates a pseudo-terminal with shell mode support when the job starts.

References:

  1. Submit an interactive job by using a pseudo-terminal

Cleaning up Platform LSF parallel Job Execution Problems – Part 3

This refers to Parallel job abnormal task exit

This article is taken from Cleaning up Platform LSF parallel job execution problems

 If some tasks exit abnormally during parallel job execution, LSF takes action to terminate and clean up the entire job. This behaviour can be customized with RTASK_GONE_ACTION in an application profile in lsb.applications or with the LSB_DJOB_RTASK_GONE_ACTION environment variable in the job environment.
The LSB_DJOB_RTASK_GONE_ACTION environment variable overrides the setting of RTASK_GONE_ACTION in lsb.applications.
 The following values are supported:
[KILLJOB_TASKDONE | KILLJOB_TASKEXIT] [IGNORE_TASKCRASH]
KILLJOB_TASKDONE:               LSF terminates all tasks in the job when one remote task exits with a zero value.
KILLJOB_TASKEXIT:               LSF terminates all tasks in the job when one remote task exits with non-zero value.
IGNORE_TASKCRASH:              LSF does nothing when a remote task crashes. The job continues to run to completion.
By default, RTASK_GONE_ACTION is not defined, so LSF terminates all tasks, and shuts down the entire job when one task crashes.
 For example:
  • Define an application profile in lsb.applications:
Begin Application
NAME         = myApp
DJOB_COMMFAIL_ACTION=IGNORE_COMMFAIL
RTASK_GONE_ACTION=”IGNORE_TASKCRASH KILLJOB_TASKEXIT”
DESCRIPTION  = Application profile example
End Application
  • Run badmin reconfig as LSF administrator to make the configuration take effect.
  • Submit an MPICH2 job with –app myApp:
$ bsub –app myApp –n4 –R “span[ptile=2]” mpiexec.hydra ./cpi

References:

  1. Cleaning up parallel job execution problems
  2. Cleaning up Platform LSF parallel Job Execution Problems – Part 1
  3. Cleaning up Platform LSF parallel Job Execution Problems – Part 2
  4. Cleaning up Platform LSF parallel Job Execution Problems – Part 3

 

Compiling Intel BLAS95 and LAPACK95 Interface Wrapper Library

BLAS95 and LAPACK95 wrappers to Intel MKL are delivered both in Intel MKL and as source code which can be compiled to build to build standalone wrapper library with exactly the same functionality.

The source code for the wrappers, makefiles are found …..\interfaces\blas95 subdirectory in the Intel MKL Directory

For blas95

# cd $MKLROOT
# cd interfaces/blas95
# make libintel64  INSTALL_DIR=$MKLROOT/lib/intel64

Once Compiled, the libraries are kept $MKLROOT/lib/intel64

For Lapack95

# cd $MKLROOT
# cd interfaces/lapack95
# make libintel64  INSTALL_DIR=$MKLROOT/lib/intel64

Once Compiled, the libraries are kept $MKLROOT/lib/intel64