@@ -16,7 +16,7 @@ This is an extremely terse summary of how to use ULFM:
1616 shell$ ./configure --with-ft=ulfm [...options...]
1717 shell$ make [-j N] all install
1818 shell$ mpicc my-ft-program.c -o my-ft-program
19- shell$ mpiexec -n 4 --with-ft ulfm my-ft-program
19+ shell$ mpirun -n 4 --with-ft ulfm my-ft-program
2020
2121 Features
2222--------
@@ -144,14 +144,15 @@ Running your application
144144^^^^^^^^^^^^^^^^^^^^^^^^
145145
146146You can launch your application with fault tolerance by simply using
147- the normal Open MPI ``mpiexec `` launcher, with the
147+ the normal Open MPI ``mpirun `` launcher, with the
148148``--with-ft ulfm `` CLI option (or its synonym ``--with-ft mpi ``):
149149
150150.. code-block ::
151151
152152 shell$ mpirun --with-ft ulfm ...
153153
154- .. important:: by default, fault tolerance is not active.
154+ .. important:: By default, fault tolerance is not active at run time.
155+ It must be enabled via `--with-ft ulfm`.
155156
156157 Running under a batch scheduler
157158^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -160,14 +161,14 @@ ULFM can operate under a job/batch scheduler, and is tested routinely
160161with ALPS, PBS, and Slurm. One difficulty comes from the fact that
161162many job schedulers will "cleanup" the application as soon as any
162163process fails. In order to avoid this problem, it is preferred that
163- you use ``mpiexec `` within an allocation (e.g., ``salloc ``,
164+ you use ``mpirun `` within an allocation (e.g., ``salloc ``,
164165``sbatch ``, ``qsub ``) rather than a direct launch (e.g., ``srun ``).
165166
166167* SLURM is tested and supported with fault tolerance.
167168
168169 .. important :: Do not use ``srun``, or your application gets killed
169170 by the scheduler upon the first failure. Instead,
170- use ``mpirun `` in an ``salloc/ sbatch `` allocation.
171+ use ``mpirun `` in an ``salloc `` or `` sbatch `` allocation.
171172
172173* LSF is untested with fault tolerance.
173174
@@ -186,8 +187,8 @@ errmgr_detector_bar <value>`` for PRTE options.
186187
187188 .. important :: The main control for enabling/disabling fault tolerance
188189 at runtime is the ``--with-ft ulfm `` (or its synomym
189- ``--with-ft mpi ``) ``mpiexec `` CLI option. This option
190- setup multiple subsystems of Open MPI to enable fault
190+ ``--with-ft mpi ``) ``mpirun `` CLI option. This option
191+ sets up multiple subsystems in Open MPI to enable fault
191192 tolerance. The options described below are best used to
192193 overide the default behavior after the ``--with-ft ulfm ``
193194 opion is used.
@@ -197,7 +198,7 @@ PRTE level options
197198
198199* ``prrte_enable_ft <true|false> (default: false) `` controls
199200 automatic cleanup of apps with failed processes within
200- mpirun. This option is automatically set to ``true `` when using
201+ `` mpirun `` . This option is automatically set to ``true `` when using
201202 ``--with-ft ulfm ``.
202203* ``errmgr_detector_priority <int> (default 1005 ``) selects the
203204 PRRTE-based failure detector. Only available when
@@ -216,17 +217,17 @@ PRTE level options
216217Open MPI level options
217218~~~~~~~~~~~~~~~~~~~~~~
218219
219- Some default values are applied to some Open MPI parameters when using
220- ``mpiexec --with-ft ulfm ``. These defaults are obtained from the ``ft-mpi ``
220+ Default values are applied to some Open MPI parameters when using
221+ ``mpirun --with-ft ulfm ``. These defaults are obtained from the ``ft-mpi ``
221222aggregate MCA param file
222223``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
223- runtime behavior with ULFM by either setting or unsetting variables in
224+ runtime behavior of ULFM by either setting or unsetting variables in
224225this file, or by overiding the variable on the command line (e.g.,
225226``--mca btl ofi,self ``).
226227
227228 .. important :: Note that if fault tolerance is disabled at runtime,
228- that is, when not using ``--with-ft ulfm ``), the
229- ``ft-mpi `` MCA param file is not loaded, thus
229+ ( that is, when not using ``--with-ft ulfm ``), the
230+ ``ft-mpi `` AMCA param file is not loaded, thus
230231 components that are unsafe for fault tolerance will
231232 load normally (this may change observed performance
232233 when comparing with and without fault tolerance).
@@ -260,16 +261,16 @@ this file, or by overiding the variable on the command line (e.g.,
260261 latency (typically 1us increase). * You may want to **enable this
261262 option if you experience false positive ** processes incorrectly
262263 reported as failed with the Open MPI failure detector.
263- This option is only relevant when `mpi_ft_detector ` is `true `.
264+ This option is only relevant when `` mpi_ft_detector `` is `` true ` `.
264265* ``mpi_ft_detector_period <float> (default: 3e0 seconds) `` heartbeat
265266 period. Recommended value is 1/3 of the timeout. _Values lower than
266267 100us may impart a noticeable effect on latency (typically a 3us
267268 increase)._
268- This option is only relevant when `mpi_ft_detector ` is `true `.
269+ This option is only relevant when `` mpi_ft_detector `` is `` true ` `.
269270* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds) `` heartbeat
270271 timeout (i.e. failure detection speed). Recommended value is 3 times
271272 the heartbeat period.
272- This option is only relevant when `mpi_ft_detector ` is `true `.
273+ This option is only relevant when `` mpi_ft_detector `` is `` true ` `.
273274
274275Known Limitations in ULFM
275276-------------------------
@@ -282,24 +283,20 @@ Known Limitations in ULFM
282283Modified, Untested and Disabled Components
283284------------------------------------------
284285
285- Frameworks and components which are not listed in the following list
286- are unmodified and support fault tolerance. Listed frameworks may be
287- **modified ** (and work after a failure), **untested ** (and work before
288- a failure, but may malfunction after a failure), or **disabled ** (they
289- cause unspecified behavior all around when FT is enabled).
286+ Frameworks and components are listed below and categorized into one of
287+ three classifications:
290288
291- All runtime disabled components are listed in the ``ft-mpi `` aggregate
292- MCA param file
293- ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
294- runtime behavior with ULFM by either setting or unsetting variables in
295- this file (or by overiding the variable on the command line (e.g.,
296- ``--mca btl ofi,self ``).
289+ 1. **Modified: ** This framework/component has been specifically modified
290+ such that it will continue to work after a failure.
291+ 2. **Untested: ** This framework/component has not been modified and/or
292+ tested with fault tolerance scenarios, and _may_ malfunction
293+ after a failure.
294+ 3. **Disabled: ** This framework/component will cause unspecified behavior when
295+ fault tolerance is enabled.
297296
298- .. important :: Note that if fault tolerance is disabled at runtime,
299- the ``ft-mpi `` MCA param file is not loaded, thus
300- components that are unsafe for fault tolerance will
301- load normally (this may change observed performance
302- when comparing with and without fault tolerance).
297+ Any framework or component not listed below are categorized as **Unmodified **,
298+ meaning that it is unmodified for fault tolerance, but will continue to work
299+ correctly after a failure.
303300
304301* ``pml ``: MPI point-to-point management layer
305302
@@ -343,8 +340,7 @@ this file (or by overiding the variable on the command line (e.g.,
343340
344341* ``vprotocol ``: Checkpoint/Restart components
345342
346- * These components have not been modified to handle faults, and are
347- **untested **.
343+ * All ``vprotocol `` components are **untested **
348344
349345* ``threads ``, ``wait-sync ``: Multithreaded wait-synchronization
350346 object
0 commit comments