Cleaning up Platform LSF parallel Job Execution Problems – Part 1


Taken from IBM Spectrum LSF Wiki – Cleaning up Parallel Job Execution Problems. 

 Job cleanup refers to the following:
  1. Clean up all left-over processes on all execution nodes
  2. Perform post-job cleanup operations on all execution nodes, such as cleaning up cgroups, cleaning up Kerberos credentials, resetting CPU frequencies, etc.
  3. Clean up the job from LSF and mark job Exit status

The LSF default behavior is designed to handle most common recovery for these scenarios. LSF also offers a set of parameters to allow end users to tune LSF behavior for each scenario, especially how fast LSF can detect each failure and what action LSF should take in response.

 There are typically three scenarios requiring job cleanup:
  1. First execution host crashing or hanging
  2. Non-first execution host crashing or hanging
  3. Parallel job tasks exit abnormally
This article describes how to configure LSF to handle these scenarios.
 Scenario 1 – Parallel job first execution host crashing or hanging
When the first execution host crashes or hangs, by default, LSF will mark a running job as UNKNOWN. LSF does not clean up the job until the host comes back and the LSF master confirms that the job is really gone from the system. However, this default behaviour may not always be desirable, since such hung jobs will hold their resource allocation for some time. Define REMOVE_HUNG_JOBS_FOR in lsb.params to change the default LSF behaviour and remove the hung jobs from the system automatically.

In lsb.params

REMOVE_HUNG_JOBS_FOR = runlimit:host_unavail

LSF removes jobs if they run 10 minutes past the job RUN LIMIT or become UNKNOWN for 10 minutes due to the first execution host becoming unavailable. If you want to change the timing

 In lsb.params
REMOVE_HUNG_JOBS_FOR = runlimit[,wait_time=5]:host_unavail[,wait_time=5]

 

Other Information:

For DJOB_HB_INTERVAL, DJOB_RU_INTERVAL (lsb.applications) and LSF_RES_ALIVE_TIMEOUT (lsf.conf)

  1. The default value of LSB_DJOB_HB_INTERVAL is 120 seconds per 1000 nodes
  2. The default value of LSB_DJOB_RU_INTERVAL is 300 seconds per 1000 nodes

In case of large, long running parallel jobs, LSB_DJOB_RU_INTERVAL can be set to a long time or even disabled with a 0 value to prevent too frequent resource usage update, which consumes network bandwidth as well as CPU time for LSF to process large volume of resource usage information. LSB_DJOB_HB_INTERVAL cannot be disabled.

References: 

  1. Cleaning up parallel job execution problems
  2. Cleaning up Platform LSF parallel Job Execution Problems – Part 1
  3. Cleaning up Platform LSF parallel Job Execution Problems – Part 2
  4. Cleaning up Platform LSF parallel Job Execution Problems – Part 3
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s