Debugging Tools to track run-time errors for mpirun

If you are having with unexplained issues with mpirun, you can use various method to troubleshoot.

Information on “–mca orte_base_help_aggregate 0”

If your mpirun dies without any error messages  you may want to take read from OpenMPI FAQ which
Debugging applications in parallel 7. My process dies without any output. Why?

If your application fails due to memory corruption, Open MPI may subsequently fail to output an error message before dying. Specifically, starting with v1.3, Open MPI attempts to aggregate error messages from multiple processes in an attempt to show unique error messages only once (vs. one for each MPI process — which can be unweildly, especially when running large MPI jobs).

However, this aggregation process requires allocating memory in the MPI process when it displays the error message. If the process’ memory is already corrupted, Open MPI’s attempt to allocate memory may fail and the process will simply die, possibly silently. When Open MPI does not attempt to aggregate error messages, most of its setup work is done during MPI_INIT and no memory is allocated during the “print the error” routine. It therefore almost always successfully outputs error messages in real time — but at the expense that you’ll potentially see the same error message for each MPI process that encourntered the error.

Hence, the error message aggregation is usually a good thing, but sometimes it can mask a real error. You can disable Open MPI’s error message aggregation with the orte_base_help_aggregate MCA parameter. For example:

 $ mpirun --mca orte_base_help_aggregate 0 ...

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s