The sources for SHMEM and OMPI can be found at $HPCX_HOME/sources/ .
Please refer to $HPCX_HOME/sources/ and HPC-X README file for more information on building details.
Profiling MPI API Application with IPM
$ export IPM_KEYFILE=$HPCX_IPM_DIR/etc/ipm_key_mpi
$ export IPM_LOG=FULL
$ export LD_PRELOAD=$HPCX_IPM_DIR/lib/libipm.so
$ mpirun -x LD_PRELOAD <...>
$ $HPCX_IPM_DIR/bin/ipm_parse -html outfile.xml
For further details on profiling MPI API, please refer to: http://ipm-hpc.org/
The NVIDIA®-supplied version of IPM contains an additional feature (Barrier before Collective), not found in the standard package, that allows end users to easily determine the extent of application imbalance in applications which use collectives. This feature instruments each collective so that it calls MPI_Barrier() before calling the collective operation itself. Time spent in this MPI_Barrier() is not counted as communication time, so by running an application with and without the Barrier before Collective feature, the extent to which application imbalance is a factor in performance can be assessed.
The instrumentation can be applied on a per-collective basis, and is controlled by the following environment variables:
$ export IPM_ADD_BARRIER_TO_REDUCE=1
$ export IPM_ADD_BARRIER_TO_ALLREDUCE=1
$ export IPM_ADD_BARRIER_TO_GATHER=1
$ export IPM_ADD_BARRIER_TO_ALL_GATHER=1
$ export IPM_ADD_BARRIER_TO_ALLTOALL=1
$ export IPM_ADD_BARRIER_TO_ALLTOALLV=1
$ export IPM_ADD_BARRIER_TO_BROADCAST=1
$ export IPM_ADD_BARRIER_TO_SCATTER=1
$ export IPM_ADD_BARRIER_TO_SCATTERV=1
$ export IPM_ADD_BARRIER_TO_GATHERV=1
$ export IPM_ADD_BARRIER_TO_ALLGATHERV=1
$ export IPM_ADD_BARRIER_TO_REDUCE_SCATTER=1
By default, all values are set to '0'.
Rebuilding Open MPI
Rebuilding Open MPI Using a Helper Script
The $HPCX_ROOT/utils/hpcx_rebuild.sh script can rebuild OMPI and UCX from HPC-X using the same sources and configuration. It also takes into account HPC-X's environments: vanilla, MT and CUDA.
For details, run:
$HPCX_HOME/utils/hpcx_rebuild.sh --help
Rebuilding Open MPI from HPC-X Sources
HPC-X package contains Open MPI sources that can be found in $HPCX_HOME/sources/ folder. Further information can be found in HPC-X README file.
$ HPCX_HOME=/path/to/extracted/hpcx
$ ./configure --prefix=${HPCX_HOME}/hpcx-ompi
--with-hcoll=${HPCX_HOME}/hcoll \ --with-ucx=${HPCX_HOME}/ucx \
--with-platform=contrib/platform/mellanox/optimized \
--with-slurm --with-pmix
$ make -j9 all && make -j9 install
Open MPI and OpenSHMEM are pre-compiled with UCX and HCOLL, and use them by default.
If HPC-X is intended to be used with SLURM PMIx plugin, Open MPI should be built against external PMIx, Libevent and HWLOC and the same Libevent and PMIx libraries should be used for both SLURM and Open MPI.
Additional configuration options:
--with-pmix=<path-to-pmix>
--with-libevent=<path-to-libevent>
--with-hwloc=<path-to-hwloc>
Running MPI with HCOLL
HCOLL is disabled by default in HPC-X.
-
Running with default HCOLL configuration parameters:
$ mpirun -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx4_0:1 <...> -
Running OSHMEM with HCOLL:
% oshrun -mca scoll_mpi_enable 1 -mca scoll basic,mpi -mca coll_hcoll_enable 1 <...>
Direct Launch of Open MPI and OpenSHMEM using SLURM 'srun'
The default HPC-X is not built with SLURM support. In order to use a direct launch with srun, rebuild the OpenMPI or HPC-X with the slurm version installed on the system:
-
Open MPI:
`env <MPI/OSHMEM-application-env> srun --mpi={pmi2|pmix} <srun-args> <mpi-app-args>`
All Open MPI/OpenSHMEM parameters that are supported by the mpirun/oshrun command line can be provided through environment variables using the following rule:
"-mca <param_name> <param-val>" => "export OMPI_MCA_<param_name>=<param-val>"
For example an alternative to "-mca coll_hcoll_enable 1" with 'mpirun' is
"export OMPI_MCA_coll_hcoll_enable=1" with 'srun '
Process and Memory Affinity with Open MPI v5.0
Starting with Open MPI v5.0, the runtime environment transitioned from ORTE to the PMIx Reference RunTime Environment (PRRTE). Processor and memory affinity management is now handled by PRRTE, which operates as an independent submodule.
While mapping and binding still utilize hwloc and the familiar --map-by <object> and --bind-to <object> directives, the default mapping policy has changed from map-by-socket to map-by-core.
Processor affinity must be enabled for memory affinity to function. An unbound process may migrate off its local NUMA memory and lose locality performance benefits.
Default Mapping and Binding Policy: OMPI v4.1.x vs. v5.0.x
|
Feature / Behavior |
OMPI v4.1.x (ORTE) |
OMPI v5.0.x (PRRTE 3.x) |
|
Default Mapping |
|
|
|
Default Binding |
|
|
|
Placement (N ≪ cores) |
Ranks round-robin across NUMA domains—load is spread across sockets. |
Ranks placed sequentially on consecutive cores—N ranks all land on socket 0 until that socket fills. |
Practical Impact
For multi-threaded MPI codes, the v5.0.x default packs all ranks on the first socket. As a result, threads of every rank will contend for the exact same memory subsystem. Conversely, the v4.1.x default spreads one rank per NUMA domain, leaving each rank a full NUMA's worth of cores for its respective threads.
Recovering v4.1.x-Equivalent Behavior Under v5.0.x
To recover the legacy behavior, pass the mapping and binding rules explicitly.
For NUMA-granularity placement:
mpirun --map-by numa --bind-to numa -n N <app>
For socket-granularity placement:
mpirun --map-by socket --bind-to socket -n N <app>
Passing only --bind-to core does not restore spread placement. When a binding is specified without an explicit mapping, v5.0.x automatically sets the mapping to match the binding object (--bind-to core $\Rightarrow$ --map-by core), producing the same packed layout on socket 0.
Always pair it with an explicit --map-by ... flag if you need a specific rank distribution across your hardware topology.
References
-
Open MPI Affinity Documentation: https://docs.open-mpi.org/en/main/tuning-apps/affinity.html
-
PMIx and PRRTE Launching: https://docs.open-mpi.org/en/v5.0.x/launching-apps/pmix-and-prrte.html
-
Up-to-date PRRTE Syntax: Run
prterun --help map-byin your environment.
Last updated: