DOCA Perftest

This guide describes DOCA Perftest, an RDMA benchmarking tool designed for compute clusters that enables fine-tuned evaluation of bandwidth, message rate, and latency across various RDMA operations and complex multi-node scenarios.

Introduction

NVIDIA® doca-perftest is an RDMA benchmarking utility designed to evaluate performance across a wide range of compute and networking environments—from simple client-server tests to complex, distributed cluster scenarios.

It provides fine-grained benchmarking of bandwidth, message rate, and latency, while supporting diverse RDMA operations and configurations.

Key features:

Comprehensive RDMA Benchmarks – Supports bandwidth, message rate, and latency testing.
Unified RDMA Testing Tool – A single executable for all RDMA verbs, with rich configuration options and CUDA/GPUDirect RDMA integration.
Cluster-Wide Benchmarking – Run distributed tests across multiple nodes, initiated from a single host, with aggregated performance results.
Flexible Scenario Definition – Define complex multi-node, multi-test configurations via a JSON input file.
Command-Line Simplicity – Quickly run local or point-to-point benchmarks directly from the CLI.
Synchronized Execution – Ensures all benchmarks begin and end simultaneously for consistent results.

The doca-perftest utility simplifies evaluation and comparison of RDMA performance across applications and environments.

Comparison with Legacy Perftest

Unlike legacy RDMA benchmarking tools (such as ib_write_bw, ib_send_lat, etc.), doca-perftest is a native implementation designed for modern data centers.
It is not a wrapper; it is a standalone product that replaces both the legacy tools and the custom orchestration scripts often required to run them at scale.

Architectural Differences:

Feature	Legacy Perftest	DOCA Perftest
Scope	Point-to-Point (P2P) only	Single-node to Cluster-wide
Orchestration	Manual or third-party wrappers	Built-in (Single-host initiation)
Concurrency	Single-process per execution	Native multi-process/multi-core
Synchronization	Loose (Serial start)	Hardware-aligned (Synchronized start/stop)
Result Handling	Per-process manual extraction	Automatic cluster-wide aggregation

Key Benefits of Migration

Elimination of External Wrappers: Standard RDMA benchmarks require complex external scripts (Ansible, Bash, Python) to manage remote process launching, NUMA pinning, GPU selection and result parsing.
doca-perftest handles these natively via the CLI or JSON scenario files.
True Traffic Synchronization: In large-scale clusters, measuring fabric congestion or incast/outcast scenarios requires all nodes to hit the network simultaneously.
doca-perftest utilizes a centralized sync engine to ensure all processes begin and end traffic in a coordinated window, providing accuracy that is impossible to achieve with asynchronous legacy wrappers.
Native Multi-Processing: While legacy tools require running multiple instances to saturate high-speed links (e.g., 200G/400G+), doca-perftest scales linearly across cores within a single execution using the -N or -C flags.
Comprehensive Reporting: Rather than collecting individual output files from dozens of servers, doca-perftest provides a unified report.
This includes the full scenario definition, all raw results, and calculated aggregations per-device, per-node, and per-test.

Simple Point-To-Point Benchmarks

For simple benchmarks, doca-perftest can be run directly from the command line.

When invoked on the client, the utility automatically launches the corresponding server process (requires passwordless SSH) and selects optimal CPU cores on both systems based on NUMA affinity.

Example command:

Bash

# Run on client
doca_perftest -d mlx5_0 -n <server-host-name>

This is equivalent to running:

# On server
doca_perftest -d mlx5_0 -N 1 -c RC -v write -m bw -s 65536 -D 10

# On client
doca_perftest -d mlx5_0 -N 1 -c RC -v write -m bw -s 65536 -D 10 -n <server-host-name>

Parameter breakdown:

Parameter	Description
`-d mlx5_0`	Uses the device `mlx5_0`.
`-N 1`	Runs one process, automatically selecting an optimal core. (Use `-C <core>` to specify manually.)
`-c RC`	Uses a Reliable Connection (RC) transport.
`-v write`	Selects the Write verb for transmission.
`-m bw`	Measures bandwidth.
`-s 65536`	Sets message size to 65,536 bytes.
`-D 10`	Runs for 10 seconds.
`-n <server-host-name>`	(Client only) Specifies the remote target host.

For a full list of CLI arguments, run doca_perftest -h or man doca_perftest.

If passwordless SSH is not configured, you must manually run doca-perftest on both client and server, ensuring parameters match.

Complex Multi-Node Scenarios

For large-scale or multi-benchmark configurations, doca-perftest accepts a JSON input file defining all participating nodes, benchmarks, and parameters.

Example invocation:

Bash

doca_perftest -f path_to_scenario_file.json

JSON mode advantages:

Can be initiated from any node in the cluster (even non-participating ones).
Synchronizes benchmark start and stop across all nodes.
Aggregates all metrics on the initiating host.
Supports predefined traffic patterns such as ALL_TO_ALL, MANY_TO_ONE, ONE_TO_MANY, and BISECTION.
Fully compatible with all CLI parameters — JSON parameters inherit the same defaults.

Example JSON configuration files are provided under: /usr/share/doc/doca-perftest/examples/. It is recommended to start by copying and modifying an existing example file.

Benchmark Results

Bandwidth

Bandwidth tests measure the aggregate data transfer rate and message-handling efficiency across all participating processes.

Metrics collected:

Message Rate (Mpps): Number of Completion Queue Entries (CQEs) processed per second.
Bandwidth (Gb/s): Total throughput (bandwidth = message_rate × message_size).

Measurement notes:

Results are aggregated across all active test processes.
Concurrency is controlled via -co (CLI) or the cores field (JSON).
Test duration is averaged across processes for consistent sampling.

Interpretation tips:

Observation	Possible Cause
High message rate, low bandwidth	Small message sizes
High bandwidth, moderate message rate	Larger messages or fewer CQEs

These results help optimize network saturation, queue depth, and core allocation strategies.

Latency

Latency tests measure the delay between message transmission and acknowledgment. The measured direction depends on the RDMA verb used.

RDMA verb modes:

Verb	Measurement Type
Send/Receive	One-way latency (Client → Server)
Write	Round-trip latency (Client → Server → Client)

Metrics collected:

Minimum latency – Fastest observed transaction
Maximum latency – Longest observed transaction
Mean latency – Average across all iterations
Median latency – Midpoint value (less influenced by outliers)
Standard deviation – Variability indicator
99% tail latency – 99% of messages completed within this time
99.9% tail latency – Outlier detection for extreme cases

Measurement notes:

Latency measured using tight RDMA verb loops.
Timing collected on the sender side for accuracy.
Aggregated across processes for final reporting.

Interpretation tips:

Pattern	Insight
Low mean/median, high max/tail	Indicates jitter or queue buildup
Low standard deviation	Indicates stable and predictable performance
High 99%/99.9% tail	Indicates possible SLA breaches in real-time workloads

doca-perftest provides improved write latency accuracy over legacy perftest tools.

Differences in latency measurement methodologies exist; compare tools carefully when validating results.

Common Arguments and Use-Cases

This section highlights some of the most commonly used parameters and use-cases.

Unidirectional vs Bidirectional Traffic

doca-perftest supports two traffic-flow modes that fundamentally change how data moves between nodes and how resources are allocated.

Unidirectional Traffic (Default)

In unidirectional mode, traffic flows in one direction only.
The client (requestor) initiates operations, and the server (responder) receives them.
This is the default mode and provides clear, predictable performance metrics.

Bidirectional Traffic

In bidirectional mode, traffic flows in both directions simultaneously. Each side acts as both requestor and responder, creating full-duplex communication.

Bidirectional tests use two traffic runners (requestor + responder) sharing resources. It may show different aggregate bandwidth than 2× unidirectional.

Run bi-directional traffic from the command line:sa

# Enable bidirectional traffic
doca_perftest -d mlx5_0 -n <server-name> -b

For JSON mode, use the "TrafficDirection" field and set it to "BIDIR" or "UNIDIR".

Traffic Patterns

Traffic patterns provide built-in shortcuts for complex multi-node communication scenarios.

While these configurations were always possible through detailed JSON definitions, traffic patterns dramatically simplify setup for common topologies.

Example JSONs using traffic patterns are available under /usr/share/doc/doca-perftest/examples.

Available patterns:

ONE_TO_ONE
ONE_TO_MANY
MANY_TO_ONE
ALL_TO_ALL
BISECTION

Multicast is not supported. Each connection is point-to-point, synchronized to start simultaneously.

They collapse complex multi-node wiring into a few lines of JSON. Instead of manually listing dozens of connections, you specify a regex-like host list and a pattern (e.g., ALL_TO_ALL) and doca-perftest generates and synchronizes all connections for you.

One-to-One (O2O)

Simple point-to-point between two nodes; useful for baseline performance testing.

YAML

"testNodes": [ {"hostname": "node01", "deviceName": "mlx5_0"},
               {"hostname": "node02", "deviceName": "mlx5_0"} ],
"trafficPattern": "ONE_TO_ONE"

One-to-Many (O2M)

Single sender to multiple receivers; the first node sends to all others.

YAML

"testNodes": [ {"hostname": "sender", "deviceName": "mlx5_0"},
               {"hostname": "receiver[1-10]", "deviceName": "mlx5_0"} ],
"trafficPattern": "ONE_TO_MANY"

This creates 10 connections: sender→receiver1, sender→receiver2, ..., sender→receiver10.

Many-to-One (M2O)

Multiple senders to one receiver; all nodes send to the first node.

YAML

"testNodes": [  {"hostname": "aggregator", "deviceName": "mlx5_0"},
                {"hostname": "client[01-20]", "deviceName": "mlx5_0"}  ],
"trafficPattern": "MANY_TO_ONE"

This creates 20 connections: client1→aggregator, client2→aggregator, ..., client20→aggregator.

All-to-All (A2A)

Full-mesh connectivity; every node connects to every other node.

YAML

"testNodes": [  {"hostname": "compute[01-16]", "deviceName": "mlx5_0"} ],
"trafficPattern": "ALL_TO_ALL",
"trafficDirection": "UNIDIR"

This creates 240 connections (16×15) for unidirectional, or 120 bidirectional pairs.

Bisection (B)

Divides nodes into two equal halves; the first half connects to the second half. Requires an even number of nodes.

YAML

"testNodes": [  {"hostname": "rack1-[01-10]", "deviceName": "mlx5_0"},
                {"hostname": "rack2-[01-10]", "deviceName": "mlx5_0"} ],
"trafficPattern": "BISECTION"

This creates 10 connections: rack1-01↔rack2-01, rack1-02↔rack2-02, ..., rack1-10↔rack2-10.

Per-Iteration-Sync Flow (Lock-Step Benchmarking)

The Per-Iteration-Sync flow is designed to mimic AI workloads and synchronous MPI collectives where data transfer occurs in distinct, synchronized steps without parallel computation.

Unlike standard "streaming" tests, every process in the cluster waits for all peers to complete the current iteration before anyone can proceed to the next.

Key Benefits:

Simulate Synchronous Workloads: Good for mirroring the behavior of MPI-like workloads.

Data Validation: Enables verifying data integrity at every single step of the test.

QP Modification: Allows for resetting or modifying QP parameters between iterations to test behavior from a clean state.

Key Requirements:

JSON Mode only: This flow cannot be triggered via CLI.

Iterations-based: The test must be defined by the number of iterations (duration mode is not supported).

Pattern: Requires ALL_2_ALL pattern and BIDIR traffic.

Logic and Implementation

The flow is built upon a bidirectional ALL-TO-ALL pattern. Each iteration is composed of three distinct phases:

Data Phase:
- Every process sends a data message to all peers.
- If using multiple QPs, the total msgSize is split across them.
- Each QP writes to a specific offset in the message to ensure the full buffer is utilized correctly.
Sync Phase:
- After the data transfer completes, each process sends a Sync Message to all peers.
- Under the hood: The Sync Message is a zero-length RDMA Write with Immediate Data.
Barrier:
A process only completes the iteration once it has received confirmation for its own Sync Send and has received Sync Messages from all its peers.
Between Iterations (Post-Iteration Stage):
1. This stage occurs after the synchronization is complete but before the next iteration's measurement starts.
2. It enables us to perform management tasks that we don't want to include in the performance measurement, such as modifying QPs, checking data validation results, or updating pointers.

Required Configuration

Set the following fields in your JSON input:

Field	Required Value	Notes
`iterationSync`	"true"	Activates the flow logic.
`trafficPattern`	"ALL_2_ALL"	Use this even for 1:1 node connections.
`trafficDirection`	"BIDIR"	Flow requires bidirectional exchange.
`iterations`	<val>	Mandatory. You must define the run by the number of iterations. "Duration" mode is not supported for this flow.
`verb`	"write" or "writeImm"	Determines the verb used in the Data Phase.
`metric`	"bw"	Note: Both BW and Latency will be calculated.

Combined with Data Validation

When dataValidationis set to true, the flow performs a bit-exact verification of all received data at the end of every iteration.

This is highly effective for catching transient data corruption in complex A2A patterns with no impact on the performance measured, as the validation occurs during the 'Between Iterations' stage, outside of the timed interval.

Limitations and Constraints

Performance: Due to the heavy synchronization, the measured "streaming" bandwidth will be lower than a standard A2A test.
Connectivity: Currently restricted to ALL_TO_ALL patterns with BIDIR traffic.
Homogeneous Scenarios Only:
- A single scenario file cannot contain a mix of different synchronization types.
- All tests defined within the same JSON config must either be "Iteration-Sync" tests ("iterationSyncType": "write_imm") or standard tests ("iterationSyncType": "none").

Hostname and DeviceName Ranged Selection

doca-perftest supports bracket-based range expansion for hostnames and device names in JSON mode, streamlining configuration for multi-node and multi-device scenarios.

Key Features:

Range expansion: hostname=perf-host[0-3] expands to perf-host0, perf-host1, perf-host2, perf-host3

Comma-separated values: hostname=perf-host[0,2,4] expands to perf-host0, perf-host2, perf-host4

Cartesian product expansion: hostname=perf-host[0-1] with devicename=mlx5_[0-1] creates 4 connections (2 hosts × 2 devices)

Zero-padding preservation: [00-03] maintains formatting as 00, 01, 02, 03

This feature significantly reduces configuration complexity for large-scale json scenario files.

Multiprocess (Cores)

doca-perftest can run synchronized multi-process tests, ensuring traffic starts simultaneously across all cores.

By default, it runs a single process on one automatically selected core.

Process and core selection:

Option	Description
`-N` / `"num_processes"`	Number of processes; cores auto-selected.
`-C` / `"cores"`	Explicitly specify core IDs or ranges.

Examples:

Bash

# Run on 3 synchronized processes (cores auto-selected)
doca_perftest -d mlx5_0 -n <server> -N 3

# Run on specific cores
doca_perftest -d mlx5_0 -n <server> -C 5
doca_perftest -d mlx5_0 -n <server> -C 5,7
doca_perftest -d mlx5_0 -n <server> -C 5-9

Working with GPUs – Device Selection

doca-perftest can automatically select the most suitable GPU for each network device based on PCIe topology proximity. The ranking follows NVIDIA's nvidia-smi topo hierarchy: NV > PIX > PXB > PHB > NODE > SYS.

This ensures that the GPU closest to the NIC is chosen, minimizing latency and maximizing throughput.

Although auto-selection is the default behavior, users can still manually specify a GPU device using the -G argument in CLI mode, or the "cuda_dev" field in JSON mode.

Bash

# Manually choose a specific GPU
doca_perftest -d mlx5_0 -n server-name -G 0

# Automatically select both GPU and memory type (recommended)
doca_perftest -d mlx5_0 -n server-name -M cuda

# Deprecated syntax (still supported, equivalent to cuda_auto_detect)
doca_perftest -d mlx5_0 -n server-name --cuda 0

Working with GPUs – Memory Types

RDMA operations can leverage GPU memory directly, bypassing CPU involvement for maximum throughput and minimal latency.

doca-perftest supports several CUDA memory modes optimized for different hardware and driver configurations.

Auto-Detection Mode (cuda_auto_detect)

Automatically selects the best available CUDA memory type in this order:

Data Direct
DMA-BUF
Peermem

This is the recommended mode for most users.

Automatically selects the optimal CUDA memory strategy:

Bash

# Auto-detect best GPU memory type (recommended)
doca_perftest -d mlx5_0 -n server-name -M cuda -G 0

# With custom CUDA library path
doca_perftest -d mlx5_0 -n server-name -M cuda -G 0 --cuda_lib_path /usr/local/cuda-12/lib64

# Deprecated but equivalent syntax
doca_perftest -d mlx5_0 -n server-name --cuda 0

Fallback behavior: With -M cuda_auto_detect, doca_perftest automatically tries cuda_data_direct → cuda_dmabuf → cuda_peermem in this order.

Standard CUDA Memory (cuda_peermem)

Traditional CUDA peer-memory allocation.

Supported on all CUDA-capable systems, though with slightly higher overhead compared to newer methods.

# Explicitly force peermem (bypasses auto-detect)
doca_perftest -d mlx5_0 -n server-name -M cuda_peermem -G 0

# Auto-detect fallback order (when using -M cuda_auto_detect):
#   1) cuda_data_direct (fastest, requires HW/driver support)
#   2) cuda_dmabuf
#   3) cuda_peermem (universal fallback)

DMA-BUF Memory (cuda_dmabuf)

Uses the Linux DMA-BUF framework for zero-copy GPU–NIC transfers. Requires CUDA 11.7+ and kernel support.

Bash

doca_perftest -d mlx5_0 -n server-name -M cuda_dmabuf-G 0

Data Direct Memory (cuda_data_direct)

Most efficient GPU memory access method using direct PCIe mappings. Requires specific hardware and driver support; provides the lowest latency and highest throughput.

Bash

doca_perftest -d mlx5_0 -n server-name -M cuda_data_direct-G 0

Memory Types

Beyond GPU memory types, doca-perftest supports several memory allocation strategies for RDMA operations.

Host Memory (host)

Default mode using standard system RAM.

# Default host memory usage
doca_perftest -d mlx5_0 -n <server-name>

# Explicitly specify host memory
doca_perftest -d mlx5_0 -n <server-name> -M host

Null Memory Region (nullmr)

Does not allocate real memory; useful for ultra-low-latency synthetic tests.

Bash

# Null memory region for bandwidth testing
doca_perftest -d mlx5_0 -n <server-name> -M nullmr

Device Memory (device)

Allocates memory directly on the adapter hardware (limited by on-board capacity).

Bash

# Null memory region for bandwidth testing
doca_perftest -d mlx5_0 -n <server-name> -M device

RDMA Drivers

Three RDMA driver backends are supported:

The available drivers depend on your installed packages and hardware.

Driver	Prerequisites	Usage
IBV (libibverbs)	The standard RDMA Verbs delivered as part of DOCA-OFED (and standard inbox drivers). Recommended for general compatibility across all IB/RoCE adapters.	`-r ibv` (default)
DV (doca_verbs)	The specialized DOCA RDMA Verbs backend. This provides a high-performance alternative to standard verbs and is strategically aligned with the DOCA SDK ecosystem.	`-r dv`

Auto-Launching Remote Server

doca-perftest can automatically launch the remote server via SSH (CLI-only).

Requires passwordless SSH and identical versions on both sides.

Bash

# Auto-launch server (default)
doca_perftest -d mlx5_0 -n server-name

# Disable auto-launch
doca_perftest -d mlx5_0 -n server-name --launch_server disable

Server override examples:

Bash

# Server uses different device than client
doca_perftest -d mlx5_0 -n server-name --server_device mlx5_1

# Server uses different memory type
doca_perftest -d mlx5_0 -n server-name -M host --server_mem_type cuda_auto_detect

# Server runs on specific cores
doca_perftest -d mlx5_0 -n server-name -C 0-3 --server_cores 4-7

# Alternate server executable path
doca_perftest -d mlx5_0 -n server-name --server_exe /tmp/other_doca_perftest_version

# Different SSH username, supported by passwordless-ssh
doca_perftest -d mlx5_0 -n server-name --server_username testuser

QP Histogram

The QP histogram provides visibility into how work is distributed across multiple queue pairs during a test. This is useful for identifying load balancing issues, scheduling inefficiencies, or hardware limitations when using multiple QPs.

Enabling QP histogram:

Bash

# Enable QP histogram with multiple queue pairs
doca_perftest -d mlx5_0 -n server-name -q 8 -H

Example output:

Bash

--------------------- QP WORK DISTRIBUTION ---------------------
Qp num 0:  ████████████████████████                     45.23 Gbit/sec  |  Relative deviation: -2.1%
Qp num 1:  █████████████████████████                    46.89 Gbit/sec  |  Relative deviation: 1.5%
Qp num 2:  ████████████████████████                     45.67 Gbit/sec  |  Relative deviation: -1.2%
Qp num 3:  █████████████████████████████                48.21 Gbit/sec  |  Relative deviation: 4.3%

Start PSN (Packet Sequence Number)

Start PSN lets you control the initial packet sequence number for each Queue Pair (QP) at connection bring‑up.
This is useful for reproducibility, interoperability testing, and debugging sequence‑sensitive behavior.
If not provided, the tool generates a random start PSN per QP.

How to use it:

CLI: Provide a comma‑separated list of PSN values with the start‑PSN flag. The number of values must match the number of QPs.
JSON input: Provide a startPsn object in the test config with one entry per QP (qp0, qp1, …). The QP keys must be contiguous and start at qp0.

Data Validation

Data validation verifies that RDMA traffic is not corrupted during bandwidth tests.
When enabled, the requestor generates a deterministic payload for each message, and the responder compares the received data against the expected pattern.
The validation result is reported as a count of invalid samples and a pass/fail summary.

JSON input:

Set the dataValidation field to true in the test configuration.
No other JSON changes are required beyond meeting the constraints listed below.

JSON output:

When validation is enabled, results include a validationResults section.
The key metric is invalidDataSampleCount, which is the total number of messages that failed validation.
If validation is not enabled, the validation section is omitted.

Constraints and limitations:

Bandwidth tests only: Validation is supported only for bandwidth tests, not latency.
Verb restrictions: Validation is supported for sendtests, or when running in iteration-sync mode.
No warmup: Warmup time must be disabled when validation is enabled.
Depth requirements: rxDepth must be greater than or equal to txDepth.
ER auto mode is disabled: If enhanced reliability (ER) auto mode is set, it will be automatically disabled when validation is enabled.
Logging cap: Only the first 5,000 invalid samples are logged individually; additional failures are counted but not logged.
Performance impact: Expect lower bandwidth results due to validation work and buffer handling (Except for iteration-sync mode)

ECE (Enhanced Connection Establishment)

ECE is an optional RDMA setup step that aligns connection capabilities between client and server before traffic starts.
If selected, doca-perftestexchanges ECE parameters between clients and servers for each connection.
It relies on the hardware-firmware negotiation that occurs when a Queue Pair (QP) transitions from RESET to INIT.

High‑Level Flow

Client discovery (pre‑INIT): The client queries its local ECE capabilities and sends them to the server over the control channel.
Server negotiation trigger: The server applies the client proposal, transitions its QP to INIT, and then queries the final accepted ECE configuration from the device.
Server to client exchange: The server sends the finalized ECE configuration back to the client.
Client commit & verify: The client applies the finalized ECE config, transitions its QP to INIT, and validates the negotiated result.
Traffic setup continues: Standard QP data exchange and RTR/RTS transitions proceed as usual.

How to Use

CLI - Enable ECE with the --use_ece flag.
JSON - Set useEce: true in the test configuration.

Limitations & Constraints

Driver: Supported only with the libibverbs driver (ibv); doca-verbssupport is planned for upcoming release.
Connection type: RC only.

QP Hints

DOCA RDMA Verbs supports setting an opaque object with CC hints to be used by the PCC algorithm.

doca-perftestallows you to provide a path to a hints binary file, file size, vendor ID and format ID, which will be passed to the PCC via the DOCA RDMA Verbs driver.

This option is only available for DOCA RDMA Verbs (dv) and the values can be set by:

CLI: --cc_group_hints <file_path>,<file_size>,<vendor_id>,<format_id> for command line input.

JSON:

"ccGroupHints": {"filePath": <path>,
                  "fileSize": <val>,
                  "vendorId": <val>,
                  "formatId": <val>
                }

TPH

PCIe optimization providing hints to CPUs for cache management and reduced memory-access latency.

Requires ConnectX-6 + hardware and a TPH-enabled kernel.

Parameters:

Option	Meaning
`--ph`	Processing hint: 0 = Bidirectional (default), 1 = Requester, 2 = Completer, 3 = High-priority completer
`--tph_core_id`	Target CPU core for TPH handling
`--tph_mem`	Memory type: `pm` = Persistent, `vm` = Volatile

Examples:

Bash

# Invalid: Core ID without memory type
doca_perftest -d mlx5_0 -n server-name --tph_core_id 0  # ERROR

# Invalid: Memory type without core ID  
doca_perftest -d mlx5_0 -n server-name --tph_mem pm  # ERROR

# Valid: Both or neither
doca_perftest -d mlx5_0 -n server-name --ph 1  # OK (hints only)
doca_perftest -d mlx5_0 -n server-name --ph 1 --tph_core_id 0 --tph_mem pm  # OK (full config)

SLURM

doca-perftest integrates seamlessly with SLURM job schedulers, leveraging MPI for multi-node orchestration within SLURM allocations.

The following is a basic usage example with salloc:

Allocate nodes via SLURM (e.g., salloc -N8).

Update the JSON to include the allocated nodes. Simple bisection example:

"testNodes": [  {"hostname": "rack1-[01-03]", "deviceName": "mlx5_0"},
                {"hostname": "rack2-[04-07]", "deviceName": "mlx5_0"} ],
"trafficPattern": "BISECTION"

Run the doca-perftest with the updated json

# Invalid: Core ID without memory type
doca_perftest -f <updated-json>

Last updated: January 29, 2026