There is a useful link Common LSF problems whichI find very helpful in troubleshooting
This refers to Parallel job non-first execution host crashing or hanging.
Scenario 1: LSB_FANOUT_TIMEOUT_PER_LAYER (lsf.conf)
Before a parallel job executes, LSF needs to do a few set up work on each job execution host and populate job information to all these hosts. LSF provides a communication fan-out framework to handle this. In the case of execution hosts failure, the framework has timeout value to control how quick LSF treats communication failure and roll back the job dispatching decision. By default, the timeout value is 20 seconds for each communication layer. Define LSB_FANOUT_TIMEOUT_PER_LAYER in lsf.conf to customize the timeout value.
# badmin hrestart all
LSB_FANOUT_TIMEOUT_PER_LAYER can also be defined in environment before job submission to override the value specified in lsf.conf.
You can set a larger value for large size jobs (for example, 60 for jobs across over 1K nodes).
- One indicator of the need to tune up this parameter is that bhist -l shows jobs bouncing back and forth between starting and pending due to job timeout errors. Timeout errors are logged in the sbatchd log.
$ bhist -l 100 Job , User , Project , Command Mon Oct 21 19:20:43: Submitted from host , to Queue , CW D , 320 Processors Requested, Reque sted Resources <span[ptile=8]>; Mon Oct 21 19:20:43: Dispatched to 40 Hosts/Processors < …… Mon Oct 21 19:20:43: Starting (Pid 19137); Mon Oct 21 19:21:06: Pending: Failed to send fan-out information to other SBDs;
Scenario 2: LSF_DJOB_TASK_REG_WAIT_TIME (lsf.conf)
When a parallel job is started, an LSF component on the first execution host needs to receive a registration message from other components on non-first execution hosts. By default, LSF waits for 300 seconds for those registration messages. After 300 seconds, LSF starts to clean up the job.
Use LSF_DJOB_TASK_REG_WAIT_TIME customize the time period. The parameter can be defined in lsf.conf or the job environment at job submission. The parameter in lsf.conf applies to all jobs in the cluster, while the job environment variable only controls the behaviour for the particular job. The job environment variable overrides the value in lsf.conf. The unit is seconds. Set a larger value for large jobs ( for example, 600 seconds for jobs across 5000 nodes).
# lsadmin resrestart
You should set this parameter if you see an INFO level message like the following in res.log.first_execution_host:
$ grep “waiting for all tasks to register” res.log.hostA Oct 20 20:20:29 2013 7866 6 9.1.1 doHouseKeeping4ParallelJobs: job 101 timed out (20) waiting for all tasks to register, registered (315) out of (320)
3. DJOB_COMMFAIL_ACTION (lsb.applications)
After a job is successfully launched and all tasks register themselves, LSF keeps monitoring the connection from the first node to the rest of the execution nodes. If a connection failure is detected, by default, LSF begins to shut down the job. Configure DJOB_COMMFAIL_ACTION in an application profile in lsb.applications to customize the behaviour. The parameter syntax is:
IGNORE_COMMFAIL: LSF allows the job to continue to run. Communication failures between the first node and the rest of the execution nodes are ignored and the job continues.
KILL_TASKS LSF tries to kill all the current tasks of a parallel or distributed job associated with the communication failure.
By default, DJOB_COMMFAIL_ACTION is not defined – LSF terminates all tasks and shuts down the entire job.
You can also use the environment variable LSB_DJOB_COMMFAIL_ACTION before submitting job to override the value set in the application profile.
- Clean up all left-over processes on all execution nodes
- Perform post-job cleanup operations on all execution nodes, such as cleaning up cgroups, cleaning up Kerberos credentials, resetting CPU frequencies, etc.
- Clean up the job from LSF and mark job Exit status
The LSF default behavior is designed to handle most common recovery for these scenarios. LSF also offers a set of parameters to allow end users to tune LSF behavior for each scenario, especially how fast LSF can detect each failure and what action LSF should take in response.
- First execution host crashing or hanging
- Non-first execution host crashing or hanging
- Parallel job tasks exit abnormally
REMOVE_HUNG_JOBS_FOR = runlimit:host_unavail
LSF removes jobs if they run 10 minutes past the job RUN LIMIT or become UNKNOWN for 10 minutes due to the first execution host becoming unavailable. If you want to change the timing
REMOVE_HUNG_JOBS_FOR = runlimit[,wait_time=5]:host_unavail[,wait_time=5]
For DJOB_HB_INTERVAL, DJOB_RU_INTERVAL (lsb.applications) and LSF_RES_ALIVE_TIMEOUT (lsf.conf)
- The default value of LSB_DJOB_HB_INTERVAL is 120 seconds per 1000 nodes
- The default value of LSB_DJOB_RU_INTERVAL is 300 seconds per 1000 nodes
In case of large, long running parallel jobs, LSB_DJOB_RU_INTERVAL can be set to a long time or even disabled with a 0 value to prevent too frequent resource usage update, which consumes network bandwidth as well as CPU time for LSF to process large volume of resource usage information. LSB_DJOB_HB_INTERVAL cannot be disabled.
This blog is a follow-up Topology Scheduling on Platform LSF
Scenario 1: Submit directly to a specific Compute Unit Type
$ bsub -m "r1" -n 64 ./a.out
This job asks for 64 slots, all of which must be on hosts in the CU r1.
Scenario 2: Requesting for a Compute Unit Type Level (For example rack)
$ bsub -R "cu[type=rack]" -n 64 ./a.out
Scenario 3: Sequential Job Packing
The following job sets a CU uses minavail to set preferences for the fewest free slots.
$ bsub -R "cu[pref=minavail]" ./a.out
Scenario 4: Parallel Job Packing
The following job sets a CU uses maxavail to set a preference for the largest free slots
$ bsub -R "cu[pref=maxavail]" -n 64 ./a.out
Scenario 5: Limiting the number of spaning of multiple CUs
The following allow a job to span 2 CUs of CU-Type belonging to “rack” with the largest free slots
$ bsub -R "cu[type=rack:pref=maxavail:maxcus=2]" -n 32 ./a.out
For a highly parallel job that span across multiple hosts, it is desirable to allocate hosts to the job that are close together according to network topology. The purpose is to minimize communication latency.
The article is taken from IBM Platform LSF Wiki “Using Compute units for Topology Scheduling”
Step 1: Define COMPUTE_UNIT_TYPES in lsb.params
COMPUTE_UNIT_TYPES = enclosure! switch rack
- The example specifies 3 CU Types. In this parameter, the order of the values corresponds to levels in the network topology. CU Type enclosure are contained in CU Type switch; CU Type rack
- The exclamation mark (!) following switch means that this is the default level to be used for jobs with CU topology requirements. If the exclamation mark is omitted, the first string listed is the default type.
Step 2: Arrange hosts into lsb.hosts
Begin ComputeUnit NAME TYPE CONDENSE MEMBER en1-1 enclosure Y (c00 c01 c02) en1-2 enclosure Y (c03 c04 c05) en1-3 enclosure Y (c06 c07 co8 c09 c10) ..... s1 switch Y (en1-1 en1-2) s2 switch Y (en1-3) ..... r1 rack Y (s1 s2) ..... End ComputeUnit
Update the mbatchd by doing a
# badmin reconfig
View the CU Configuration
# bmgroup -cu
Step 3: Using bhosts to display information
Since you are using “Y” under the CONDENSE Column in lsb.params, the bhosts display the CU type. But if you do a bhosts -X, you will see all the nodes.
You have to install Platform LSF 10.1 first. Please read Basic Configuration of Platform LSF 10.1
Step 1: Unpack the Platform Appliction Centre
# tar -zxvf pac10.1_standard_linux-x64.tar.Z # cd pac10.1_standard_linux-x64
Please go to the installation directory, go to $LSF_INSTALL/lsfshpc10.1-x86_64/pac/pac10.1_standard_linux-x64
Step 2: Yum install the mysql
# yum install mysql mysql-server mysql-connector-java
Step 2a: Configure the How to Install MySQL on CentOS 6
Step 3: Edit the pacinstall.sh
export MYSQL_JDBC_DRIVER_JAR="/usr/share/java/mysql-connector-java-5.1.17.jar" (Line 84)
Step 4: Complete the installation
Step 4a: Enable perfmom in your LSF Cluster
Optional. Enable perfmon in your LSF cluster to see the System Services Dashboard in IBM Spectrum LSF Application Center.
# badmin perfmon start # badmin perfmon view
Step 4b: Set the IBM Spectrum LSF Application Center environment
# cp /opt/pac/profile.platform /etc/profile.d/pac1.sh # source /etc/profile.d/pac1.sh
Step 4c: Start IBM Spectrum LSF Application Center services.
# perfadmin start all # pmcadmin start
Step 4d: Check services have started.
# perfadmin list # pmcadmin list
You can see the WEBGUI, jobdt, plc, purger, and PNC services started.
Step 5: Log in to IBM Spectrum LSF Application Center.
Browse to the web server URL and log in to the IBM Spectrum LSF Application Center with the IBM Spectrum LSF administrator name and password.
Step 6: Platform URL
When HTTPS is enabled, the web server URL is: https://host_name:8443/platform
Step 1: Prelimary Steps (Suggestion)
- Setup a NFS Shared Directory for the final installed destination of the setup (/opt/lsf)
- Use a NFS Shared Directory perhaps /usr/local to put the tar file so that the installation file can be placed in the future for client nodes (/usr/local/lsf_install)
- Make sure your /etc/hosts are configured correctly and selinux disabled
Step 2: Untar the LSF Tar file (lsfshpc10.1-x86_64.tar.gz).
# tar -zxvf lsfshpc10.1-x86_64.tar.gz
You will have a folder called lsfshpc10.1-x86_64.
Step 3: Navigate to lsfshpc10.1-x86_64/lsf.
You should have 2 following files
lsf10.1_linux2.6-glibc2.3-x86_64.tar.Z (LSF Distribution Package) lsf10.1_lsfinstall_linux_x86_64.tar.Z (LSF Installation File)
Step 4: Unpack the LSF Installation File
# tar -zxvf lsf10.1_lsfinstall_linux_x86_64.tar.Z
Step 5: Edit the Install.
# vim /usr/local/lsf_install/lsfshpc10.1-x86_64/lsf/lsf10.1_lsfinstall/install.config
Critical “Field”. Suggested
LSF_TOP="/opt/lsf" (line 43) LSF_ADMINS="lsfadmin admin" (line 53) LSF_CLUSTER_NAME="mycluster" (line 70) LSF_MASTER_LIST="h00" (line 85) LSF_TARDIR="/opt/lsf/lsf_distrib/" (line 95 - where you have placed the distribution) CONFIGURATION_TEMPLATE="PARALLEL" (line 106) LSF_ADD_CLIENTS="h00 c00" (line 165) LSF_QUIET_INST="N" (line 193) ENABLE_EGO="N" (line 290)
Step 6: Install using lsfinstall
# /usr/local/lsf_install/lsfshpc10.1-x86_64/lsf/lsf10.1_lsfinstall/lsfinstall -f install.config
Step 7: Follow the instruction and agree on the terms and conditions
Step 8: Create a file and Source the profile.lsf
# touch /etc/profile.d/lsf.sh
Inside the lsf.sh, put in the following line
Step 9: Create the user lsfadmin
# useradd -d /home/lsfadmin -g users -m lsfadmin
Step 10: Client Host setup
Copy /etc/profile.d/lsf.sh to the client’s /etc/profile.d/lsf.sh
# scp /etc/profile.d/lsf.sh remote_node:/etc/profile.d/
# cd /usr/local/lsf_install/lsfshpc10.1-x86_64/lsf/lsf10.1_lsfinstall/ # ./hostsetup --top="/opt/lsf" --boot="y"
Step 11: Restart the LSF services on the clients
# service lsf restart
Step 12: Restart the service on the headnode.
# lsadmin reconfig # badmin mbdrestart
Step 13: Test the cluster with basic LSF Commands.
run the lsid, lshosts, and bhosts commands and see whether there are outputs.