Cleaning up Platform LSF parallel Job Execution Problems – Part 3


This refers to Parallel job abnormal task exit

This article is taken from Cleaning up Platform LSF parallel job execution problems

 If some tasks exit abnormally during parallel job execution, LSF takes action to terminate and clean up the entire job. This behaviour can be customized with RTASK_GONE_ACTION in an application profile in lsb.applications or with the LSB_DJOB_RTASK_GONE_ACTION environment variable in the job environment.
The LSB_DJOB_RTASK_GONE_ACTION environment variable overrides the setting of RTASK_GONE_ACTION in lsb.applications.
 The following values are supported:
[KILLJOB_TASKDONE | KILLJOB_TASKEXIT] [IGNORE_TASKCRASH]
KILLJOB_TASKDONE:               LSF terminates all tasks in the job when one remote task exits with a zero value.
KILLJOB_TASKEXIT:               LSF terminates all tasks in the job when one remote task exits with non-zero value.
IGNORE_TASKCRASH:              LSF does nothing when a remote task crashes. The job continues to run to completion.
By default, RTASK_GONE_ACTION is not defined, so LSF terminates all tasks, and shuts down the entire job when one task crashes.
 For example:
  • Define an application profile in lsb.applications:
Begin Application
NAME         = myApp
DJOB_COMMFAIL_ACTION=IGNORE_COMMFAIL
RTASK_GONE_ACTION=”IGNORE_TASKCRASH KILLJOB_TASKEXIT”
DESCRIPTION  = Application profile example
End Application
  • Run badmin reconfig as LSF administrator to make the configuration take effect.
  • Submit an MPICH2 job with –app myApp:
$ bsub –app myApp –n4 –R “span[ptile=2]” mpiexec.hydra ./cpi

References:

  1. Cleaning up parallel job execution problems
  2. Cleaning up Platform LSF parallel Job Execution Problems – Part 1
  3. Cleaning up Platform LSF parallel Job Execution Problems – Part 2
  4. Cleaning up Platform LSF parallel Job Execution Problems – Part 3

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s