DOCA SDK Documentation

DOCA Perftest

This guide describes DOCA Perftest, an RDMA benchmarking tool designed for compute clusters that enables fine-tuned evaluation of bandwidth, message rate, and latency across various RDMA operations and complex multi-node scenarios.

Introduction

NVIDIA® DOCA Perftest is an RDMA benchmarking utility designed to evaluate performance across a wide range of compute and networking environments (from simple client-server tests to complex, distributed cluster scenarios).

It provides fine-grained benchmarking of bandwidth, message rate, and latency while supporting diverse RDMA operations and configurations.

Key features:

  • Comprehensive RDMA benchmarks – evaluates bandwidth, message rate, and latency.

  • Unified RDMA testing tool – delivers a single executable for all RDMA verbs, featuring rich configuration options and CUDA/GPUDirect RDMA integration.

  • Cluster-wide benchmarking – executes distributed tests across multiple nodes (initiated from a single host) and aggregates the performance results.

  • Flexible scenario definition – defines complex multi-node, multi-test configurations via a JSON input file.

  • CLI simplicity – runs local or point-to-point benchmarks quickly and directly from the CLI.

  • Synchronized execution – ensures all benchmarks begin and end simultaneously to guarantee consistent results.

The doca-perftest utility simplifies the evaluation and comparison of RDMA performance across applications and environments.

Comparison with Legacy Perftest

Unlike legacy RDMA benchmarking tools (e.g., ib_write_bw, ib_send_lat), doca-perftest is a native implementation designed for modern data centers.
As opposed to a wrapper, it is a standalone product that replaces both the legacy tools and the custom orchestration scripts often required to run them at scale.

Architectural differences:

Feature

Legacy Perftest

DOCA Perftest

Scope

Point-to-Point (P2P) only

Single-node to Cluster-wide

Orchestration

Manual or third-party wrappers

Built-in (Single-host initiation)

Concurrency

Single-process per execution

Native multi-process/multi-core

Synchronization

Loose (Serial start)

Hardware-aligned (Synchronized start/stop)

Result Handling

Per-process manual extraction

Automatic cluster-wide aggregation

Benefits of migration to doca-perftest:

  • Standard RDMA benchmarks require complex external scripts (Ansible, Bash, Python) to manage remote process launching, NUMA pinning, GPU selection and result parsing.
    doca-perftest handles these natively via the CLI or JSON scenario files.

  • In large-scale clusters, measuring fabric congestion or incast/outcast scenarios requires all nodes to hit the network simultaneously.
    doca-perftest utilizes a centralized sync engine to ensure all processes begin and end traffic in a coordinated window, providing accuracy that is impossible to achieve with asynchronous legacy wrappers.

  • While legacy tools require running multiple instances to saturate high-speed links (e.g., 200G/400G+), doca-perftest scales linearly across cores within a single execution using the -N or -C flags.

  • Rather than collecting individual output files from dozens of servers, doca-perftest provides a unified report.
    This includes the full scenario definition, all raw results, and calculated aggregations per-device, per-node, and per-test.

Setup and Dependencies

doca-perftest is included in the DOCA Networking installation profile. It requires only a standard RDMA stack to run, with optional DOCA SDK components enabling advanced capabilities.

RDMA Core (libibverbs)

The only mandatory dependency is libibverbs (part of the rdma-core package). doca-perftest supports both the version bundled with the DOCA deployment and the upstream open-source release from rdma-core.

MPI (Multi-Node Scenarios)

Multi-node scenarios (JSON mode and --traffic_pattern CLI mode) require OpenMPI for orchestration.

  • Up to version 2.5.9 (January 2026), doca-perftest relied on the OpenMPI package bundled with DOCA.

  • Starting from version 3.0.0 (April 2026), doca-perftest uses upstream OpenMPI.

MPI is not required for simple point-to-point CLI benchmarks.

Optional DOCA SDK Components

Depending on the DOCA installation, additional capabilities are available:

  • DOCA Verbs (-r dv) – The DOCA RDMA Verbs backend provides a high-performance alternative to standard libibverbs. Loaded on demand; if the package is not installed, doca-perftest operates normally with the default IBV driver.

  • DOCA GPUNetIO (--engine gpunetio) – Offloads the RDMA data-path to GPU CUDA kernels using DOCA GPUNetIO. Requires the doca-gpunetio package, CUDA Toolkit, and a Volta+ GPU. See section 11 for details.

If an optional component is not installed and a user explicitly requests it (e.g., -r dv without DOCA Verbs), doca-perftest exits with a descriptive error message.

Point-To-Point Benchmarks

For simple benchmarks, doca-perftest can be run directly from the command line.

When invoked on the client, the utility automatically launches the corresponding server process (requires passwordless SSH) and selects optimal CPU cores on both systems based on NUMA affinity.

Example command:

Bash
# Run on client
doca_perftest -d mlx5_0 -n <server-host-name>

This is equivalent to running:

Bash
# On server
doca_perftest -d mlx5_0 -N 1 -c RC -v write -m bw -s 65536 -D 10

# On client
doca_perftest -d mlx5_0 -N 1 -c RC -v write -m bw -s 65536 -D 10 -n <server-host-name>

Parameter breakdown:

Parameter

Description

-d mlx5_0

Uses the device mlx5_0.

-N 1

Runs one process, automatically selecting an optimal core. (Use -C core to specify manually.)

-c RC

Uses a Reliable Connection (RC) transport.

-v write

Selects the Write verb for transmission.

-m bw

Measures bandwidth.

-s 65536

Sets message size to 65,536 bytes.

-D 10

Runs for 10 seconds.

-n server-host-name

(Client only) Specifies the remote target host.

For a full list of CLI arguments, run doca_perftest -h or man doca_perftest.

If passwordless SSH is not configured, you must manually run doca-perftest on both client and server, ensuring parameters match.

Auto-launching Remote Server

doca-perftest can automatically launch the remote server via SSH (CLI-only).

Requires passwordless SSH and identical versions on both sides.

Prerequisites:

  • Available only in CLI mode (not relevant for JSON configuration).

  • Requires passwordless SSH access to the server machine.

  • Both client and server must have the same doca-perftest version.

Bash
# Auto-launch server (default)
doca_perftest -d mlx5_0 -n server-name

# Disable auto-launch
doca_perftest -d mlx5_0 -n server-name --launch_server disable

Running server detection:

The system checks if a server is already running on the target host. If detected, it connects to the existing server instead of launching a new one. This prevents multiple server instances and allows sharing servers between tests.

Server override examples:

Bash
# Server uses different device than client
doca_perftest -d mlx5_0 -n server-name --server_device mlx5_1

# Server uses different memory type
doca_perftest -d mlx5_0 -n server-name -M host --server_mem_type cuda_auto_detect

# Server runs on specific cores
doca_perftest -d mlx5_0 -n server-name -C 0-3 --server_cores 4-7

# Alternate server executable path
doca_perftest -d mlx5_0 -n server-name --server_exe /tmp/other_doca_perftest_version

# Different SSH username, supported by passwordless-ssh
doca_perftest -d mlx5_0 -n server-name --server_username testuser

Scenario-based Benchmarks

When a benchmark involves multiple devices, multiple nodes, or multiple test configurations, doca-perftest uses MPI to orchestrate execution across all participants. Tests are launched, synchronized, and results are aggregated automatically on the initiating host.

Examples of scenarios requiring orchestration:

  • Multi-node – Traffic between two or more hosts across the network.

  • Multi-device – Traffic across multiple NICs on the same host or across hosts (e.g., railed configurations using mlx5_0 and mlx5_1).

  • Multi-test – Several benchmark configurations running simultaneously within a single execution (e.g., mixed message sizes for "elephant and mice" testing, latency under background load, or short-haul vs. long-haul comparisons).

Scenario-based benchmarks can be configured using one of two methods:

  • CLI mode (--traffic_pattern) – Defines a single test directly from the command line. No configuration file is required.

  • JSON mode (-f / -j) – Defines one or more tests via a JSON configuration. This is required for multi-test scenarios.

A test represents a single benchmark configuration: one verb, one message size, one traffic pattern, and one set of participating nodes. CLI mode is restricted to running exactly one test, whereas JSON mode can define multiple tests that execute in parallel, synchronized to start and stop together.

Choosing Between CLI and JSON Modes
  • Use CLI mode for single-test scenarios. It requires no configuration file while fully supporting all traffic patterns, partitions, and standard parameters

  • Use JSON mode when you require multiple tests in a single scenario, per-node parameter overrides, or integration with automation pipelines that generate configurations programmatically (refer to the "DOCA Perftest | Automation and CI Integration" section).

CLI Scenario Mode

Use the --traffic_pattern flag to launch an orchestrated scenario directly from the command line.

This mode builds a scenario configuration from CLI flags and feeds it into the standard MPI pipeline. All results, progress tracking, and aggregation function identically to JSON mode.

Bash
# Basic: all-to-all across 12 nodes on mlx5_0
doca_perftest -d mlx5_0 -n my_node_0[1-12] --traffic_pattern all_2_all

# Advanced: railed all-to-all on two NICs, 10 QPs, 1KB messages, 60-second run
doca_perftest -d mlx5_[0,1] -n my_node_0[1-12] --traffic_pattern a2a --traffic_partition railed -q 10 -s 1024 -D 60

The advanced example runs an all-to-all pattern across 12 nodes using two NICs per node (mlx5_0 and mlx5_1).

  • railed partition: Ensures each NIC only communicates with the identically named NIC on peer nodes (e.g., mlx5_0mlx5_0).

  • Test parameters: Utilizes 10 queue pairs (-q 10), 1 KB messages (-s 1024), and runs for 60 seconds (-D 60).

Key flags:

Flag

Description

--traffic_pattern

Required. Sets the pattern: all_to_all (a2a), bisection (b), ring (r), one_to_one (o2o), one_to_many (o2m), or many_to_one (m2o).

--traffic_partition

Optional. Sets the partition filter: global, railed, local, remote, cross_railed, or remote_cross_railed.

-n

Accepts hostlist syntax (e.g., node[001-008]).

-d

Accepts device hostlists (e.g., mlx5_[0-1]) for railed configurations.

-G

Accepts CUDA device hostlists (e.g., [0-3]). This is converted to an integer in simple CLI mode.

All standard test parameters (e.g., -v, -m, -s, -D, -q) are fully supported alongside --traffic_pattern.

JSON Scenario Mode

For multi-test scenarios or advanced configurations, provide a JSON scenario definition using one of two methods:

  • File (-f) – Load from a JSON file. Scenario files can be version-controlled, shared across teams, and reused for regression testing.

  • Inline string (-j) – Pass a raw JSON string directly. Useful for automation pipelines where the configuration is generated programmatically, avoiding extra file I/O.

Bash
# From file
doca_perftest -f path_to_scenario_file.json

# From inline JSON string
doca_perftest -j '{"testNodes": [...], "trafficPattern": "ALL_TO_ALL"}'

JSON mode capabilities:

  • Can be initiated from any node in the cluster (even non-participating ones).

  • All benchmarks run simultaneously, with millisecond-level synchronization for benchmark start and stop across all nodes.

  • Supports all traffic patterns and partitions.

  • Fully compatible with all CLI parameters — JSON parameters inherit the same defaults.

Common multi-test scenarios:

  • Mixed message sizes – Combine a large-message bandwidth test with a small-message message-rate test to evaluate how the fabric handles concurrent bulk and signaling traffic ("elephant and mice").

  • Latency under load – Run bandwidth traffic as background noise while measuring latency on a separate set of connections, revealing how congestion affects tail latency.

  • Short-haul vs long-haul – Define one test for intra-rack traffic (low-hop, optimized for latency) and another for inter-rack or cross-fabric traffic (different RDMA parameters tuned for distance).

  • Message size sweep – Run tests at multiple message sizes (e.g., 64B, 4KB, 64KB, 1MB) in a single scenario to profile throughput across the full range.

Example JSON configuration files are provided under: /usr/share/doc/doca-perftest/examples/. It is recommended to start by copying and modifying an existing example file.

Benchmark Results

By default, doca-perftest prints aggregated results to the terminal at the end of execution. This summary includes the total bandwidth, message rate, or latency statistics across all connections.

For detailed, machine-readable results, use the "outputFile" field (JSON mode) or the -o flag (CLI mode) to produce a structured JSON report. The JSON output provides full visibility into every level of the test hierarchy:

  • Per-connection results: Raw results for each individual connection, broken down by process. This level of detail is critical for identifying anomalies: if a subset of connections underperforms due to congestion, routing issues, or hardware faults, the per-connection data reveals it—even when the aggregated summary appears healthy.

  • Per-device and per-test aggregations: Results rolled up by NIC and by benchmark, making it easy to compare device-level performance across a cluster.

  • Full scenario configuration: The exact parameters used for execution, enabling reproducibility.

  • Threshold pass/fail status: Included if configured.

  • QP histogram data: Included if enabled.

JSON mode configuration:

Bash
"outputFile": "results.json"

CLI scenario mode:


Bash
doca_perftest -d mlx5_0 -n node[01-08] --traffic_pattern a2a -o results.json

The JSON output schema is versioned. Schema version and doca-perftest version are included in the output under the versionInfo block. For programmatic consumption of results via stdout, refer to the "DOCA Perftest | Structured JSON Output to stdout" section.

Bandwidth

Bandwidth tests measure the aggregate data transfer rate and message-handling efficiency across all participating processes.

Metrics collected:

  • Message rate (Mpps): The total number of Completion Queue Entries (CQEs) processed per second across all test processes. This metric indicates how efficiently the system can handle a high volume of small messages.

  • Bandwidth (Gb/s): The total data transfer rate, calculated by multiplying the number of received CQEs by the configured message size (-s or --msg_size). This metric evaluates the system's ability to sustain high-throughput communication.

Measurement notes:

  • Concurrency handling: Results reflect the sum of CQEs across all concurrent test processes, as specified by the -C command-line argument or the --cores field in the input JSON file.

  • Test duration: The duration is averaged across all test processes, ensuring consistency in measuring sustained performance over time.

In multi-node scenarios, bandwidth results are aggregated at multiple levels: per-process, per-device, per-node, and per-test. The JSON output includes all aggregation levels for post-processing.

Latency

Latency tests measure the delay between message transmission and acknowledgment. The measured direction depends on the RDMA verb used:

Verb

Measurement Type

Send/Receive

One-way latency (Client → Server)

Write

Round-trip latency (Client → Server → Client)

Metrics collected:

  • Minimum latency: Fastest observed transaction.

  • Maximum latency: Longest observed transaction.

  • Mean latency: Average across all iterations.

  • Median latency: Midpoint value (less influenced by outliers).

  • Standard deviation: Variability indicator.

  • 99% tail latency: 99% of messages completed within this time.

  • 99.9% tail latency: Outlier detection for extreme cases.

Measurement notes:

  • Latency is measured using tight RDMA verb loops.

  • Timing is collected on the sender side for maximum accuracy.

  • Results are aggregated across processes for final reporting.

doca-perftest provides improved write latency accuracy over legacy perftest tools. Differences in latency measurement methodologies exist; compare tools carefully when validating results against legacy baselines.

Traffic and Topology Configuration

Unidirectional vs Bidirectional Traffic

doca-perftest supports two traffic-flow modes that fundamentally change how data moves between nodes and how resources are allocated.

Unidirectional Traffic (Default)

  • In unidirectional mode, traffic flows in one direction only.

  • The client (requestor) initiates operations, and the server (responder) receives them.

  • This is the default mode and provides clear, predictable performance metrics.

Bidirectional Traffic

In bidirectional mode, traffic flows in both directions simultaneously. Each side acts as both requestor and responder, creating full-duplex communication.

Bidirectional tests use two traffic runners (requestor + responder) sharing resources. It may show different aggregate bandwidth than 2× unidirectional. Bidirectional mode is supported for Write and Send verbs.

Run bidirectional traffic from the command line:

# Enable bidirectional traffic
doca_perftest -d mlx5_0 -n <server-name> -b

For JSON mode, use the "TrafficDirection" field and set it to "BIDIR" or "UNIDIR".

Traffic Patterns

Traffic patterns provide built-in shortcuts for complex multi-node communication scenarios.

While these configurations were always possible through detailed JSON definitions, traffic patterns dramatically simplify setup for common topologies.

Example JSONs using traffic patterns are available under /usr/share/doc/doca-perftest/examples.

Available patterns:

  • ONE_TO_ONE

  • ONE_TO_MANY

  • MANY_TO_ONE

  • ALL_TO_ALL

  • BISECTION

  • RING

They collapse complex multi-node wiring into a few lines of JSON. Instead of manually listing dozens of connections, you specify a regex-like host list and a pattern (e.g., ALL_TO_ALL) and doca-perftest generates and synchronizes all connections for you.

One-to-One (O2O)

Simple point-to-point between two nodes; useful for baseline performance testing. 

"testNodes": [ {"hostname": "node01", "deviceName": "mlx5_0"},
               {"hostname": "node02", "deviceName": "mlx5_0"} ],
"trafficPattern": "ONE_TO_ONE"

Expected topology:

  • Total connections: 1

  • Routing: node01node02

One-to-Many (O2M)

Single sender to multiple receivers; the first node sends to all others.

"testNodes": [ {"hostname": "sender", "deviceName": "mlx5_0"},
               {"hostname": "receiver[1-10]", "deviceName": "mlx5_0"} ],
"trafficPattern": "ONE_TO_MANY"

Expected topology:

  • Total connections: 10

  • Routing: senderreceiver1, senderreceiver2, ..., senderreceiver10

Many-to-One (M2O)

Multiple senders to one receiver; all nodes send to the first node.

"testNodes": [  {"hostname": "aggregator", "deviceName": "mlx5_0"},
                {"hostname": "client[01-20]", "deviceName": "mlx5_0"}  ],
"trafficPattern": "MANY_TO_ONE"

Expected topology:

  • Total connections: 20

  • Routing: client01aggregator, client02aggregator, ..., client20aggregator

All-to-All (A2A)

Full-mesh connectivity; every node connects to every other node.

"testNodes": [  {"hostname": "compute[01-16]", "deviceName": "mlx5_0"} ],
"trafficPattern": "ALL_TO_ALL",
"trafficDirection": "UNIDIR"

Expected topology:

  • Total connections: 240 (Unidirectional) or 120 (Bidirectional)

  • Routing: Full mesh (every node connects to every other node)

Bisection (B)

Divides nodes into two equal halves; the first half connects to the second half. Requires an even number of nodes.

"testNodes": [  {"hostname": "rack1-[01-10]", "deviceName": "mlx5_0"},
                {"hostname": "rack2-[01-10]", "deviceName": "mlx5_0"} ],
"trafficPattern": "BISECTION"

Expected topology:

  • Total connections: 10

  • Routing: rack1-01rack2-01, rack1-02rack2-02, ..., rack1-10rack2-10

Ring (R)

Each node connects to the next in a circular ring: node i sends to node (i+1) % N. Requires at least 2 nodes.

"testNodes": [  {"hostname": "compute[01-08]", "deviceName": "mlx5_0"} ],
"trafficPattern": "RING"

Expected Topology:

  • Total connections: 8

  • Routing: compute01compute02, compute02compute03, ..., compute08compute01 

    All connections run simultaneously, not sequentially.

Traffic Partitions

When nodes have multiple NICs, a traffic partition specifies which connections to establish based on the relationship between devices and hosts. Combined with a traffic pattern, it provides a concise way to describe common multi-device topologies in a single test definition.

Set via the "trafficPartition" JSON field or the --traffic_partition CLI flag.

Partition

Usage

Connections Established

GLOBAL

Test all possible connections (default).

nodeA:mlx5_0 ↔ nodeB:mlx5_0, nodeA:mlx5_0 ↔ nodeB:mlx5_1, ...

RAILED

Test each NIC against its counterpart on peer nodes.

nodeA:mlx5_0 ↔ nodeB:mlx5_0, nodeA:mlx5_1 ↔ nodeB:mlx5_1

LOCAL

Test NIC-to-NIC traffic within the same host (loopback).

nodeA:mlx5_0 → nodeA:mlx5_1 nodeB:mlx5_0 → nodeB:mlx5_1

REMOTE

Test only cross-host traffic, excluding loopback.

nodeA:mlx5_0 → nodeB:mlx5_0 nodeA:mlx5_0 → nodeB:mlx5_1

CROSS_RAILED

Test each NIC against a different NIC on peer nodes.

nodeA:mlx5_0 ↔ nodeB:mlx5_1, nodeA:mlx5_0 ↔ nodeA:mlx5_1, nodeA:mlx5_1 ↔ nodeB:mlx5_0

REMOTE_CROSS_RAILED

Same as CROSS_RAILED, but only across different hosts.

nodeA:mlx5_0 ↔ nodeB:mlx5_1, nodeA:mlx5_1 ↔ nodeB:mlx5_0

Example — railed all-to-all across a cluster with two NICs per node: 

"testNodes": [  {"hostname": "node[01-16]", "deviceName": "mlx5_[0,1]"} ],
"trafficPattern": "ALL_TO_ALL",
"trafficPartition": "RAILED"

Without the partition, achieving the same railed topology would require a separate test for each device with otherwise identical parameters.

Traffic partitions apply to any traffic pattern. The partition determines which connections are kept regardless of how they were generated. Some pattern-partition combinations may result in zero connections (e.g., LOCAL with a single device per node, or REMOTE on a one-to-one test within the same host).

Hostname, Device Name, and CUDA Device ID Ranged Selection

To streamline configuration for multi-node and multi-device scenarios, doca-perftest supports bracket-based range expansion in both JSON and CLI modes. This allows you to define large-scale clusters concisely.

Supported Syntax

Feature

Syntax Example

Expansion Result

Numeric Range

perf-host[0-3]

perf-host0, perf-host1, perf-host2, perf-host3

Comma List

perf-host[0,2,4]

perf-host0, perf-host2, perf-host4

Zero Padding

node[01-03]

node01, node02, node03 (Padding is preserved)

Expansion Logic

When ranges are defined for both hostnames and device names, the tool generates all possible combinations (Cartesian product).

For example:

  • Input: hostname=host[1-2], devicename=mlx5_[0-1]

  • Result (4 Connections):

    1. host1mlx5_0

    2. host1mlx5_1

    3. host2mlx5_0

    4. host2mlx5_1

CUDA Device ID Ranged Selection

The cudaDeviceId field supports the same bracket-based range expansion syntax. When both deviceName and cudaDeviceId use ranges, values are paired by order (not as a Cartesian product).

Examples:

  • deviceName = "mlx5_[0-1]" with cudaDeviceId = "[1,0]" creates the pairs: [mlx5_0, GPU1] and [mlx5_1, GPU0].

  • deviceName = "mlx5_[0-3]" with cudaDeviceId = "[0,0,1,1]" creates the pairs: [mlx5_0, GPU0], [mlx5_1, GPU0], [mlx5_2, GPU1], [mlx5_3, GPU1].

The number of deviceName elements must be equal to the number of cudaDeviceId elements.

Hardware and Memory Offloads

Multiprocess (Cores)

doca-perftest can run synchronized multi-process tests, ensuring that traffic starts simultaneously across all utilized cores. By default, the application runs a single process on one automatically selected core.

Process and core selection:

CLI Flag

JSON Field

Description

-N <count>

"num_processes"

Specifies the number of processes to run. The application will automatically select the optimal cores.

-C <cores>

"cores"

Explicitly specifies the exact core IDs or ranges to use.

CLI examples: 

# Run on 3 synchronized processes (cores auto-selected)
doca_perftest -d mlx5_0 -n <server-name> -N 3

# Run on a single specific core (Core 5)
doca_perftest -d mlx5_0 -n <server-name> -C 5

# Run on multiple explicitly defined cores (Cores 5 and 7)
doca_perftest -d mlx5_0 -n <server-name> -C 5,7

# Run on a contiguous range of cores (Cores 5 through 9)
doca_perftest -d mlx5_0 -n <server-name> -C 5-9


Topology-Aware GPU Selection

doca-perftest can automatically select the most suitable GPU for each network device based on PCIe topology proximity. The ranking follows NVIDIA's nvidia-smi topo hierarchy: NV > PIX > PXB > PHB > NODE > SYS. It also load-balances by choosing the least-loaded GPU among those with the best available connection type.

This ensures that the GPU closest to the NIC is chosen, minimizing latency and maximizing throughput.

Although auto-selection is the default behavior, users can still manually specify a GPU device using the -G argument in CLI mode, or the "cudaDeviceId" field in JSON mode.

CLI examples:

Bash
# Automatically select both the GPU and memory type (recommended)
doca_perftest -d mlx5_0 -n <server-name> -M cuda

# Manually choose a specific GPU (e.g., GPU 0)
doca_perftest -d mlx5_0 -n <server-name> -G 0

# Deprecated syntax (still supported, equivalent to using auto-detect)
doca_perftest -d mlx5_0 -n <server-name> --cuda 0

Memory Types

doca-perftest supports several memory allocation strategies for RDMA operations. The memory type determines where the data read or written by the NIC resides (host DRAM, on-device memory, or GPU memory).

Select the memory type via the -M flag (CLI) or the "memoryType" field (JSON).

Host Memory (host)

Default mode using standard system RAM.

Bash
# Default host memory usage
doca_perftest -d mlx5_0 -n <server-name>

# Explicitly specify host memory
doca_perftest -d mlx5_0 -n <server-name> -M host

Null Memory Region (nullmr)

Does not allocate real memory; useful for ultra-low-latency synthetic tests.

Bash
# Null memory region for bandwidth testing
doca_perftest -d mlx5_0 -n <server-name> -M nullmr

Device Memory (device)

Allocates memory directly on the adapter hardware (limited by on-board capacity).

Bash
doca_perftest -d mlx5_0 -n <server-name> -M device

CUDA (GPU) Memory

RDMA operations can read from or write to GPU memory directly, bypassing host DRAM for maximum throughput and minimal latency on GPU-centric workloads.

doca-perftest supports several CUDA memory modes optimized for different hardware and driver configurations:

Mode

Flag

Description

Auto-Detection

-M cuda / cuda_auto_detect

Recommended. Picks the best available mode, trying cuda_data_directcuda_dmabufcuda_peermem in order.

Peermem

-M cuda_peermem

Traditional CUDA peer-memory allocation. Supported on all CUDA-capable systems, with slightly higher overhead.

DMA-BUF

-M cuda_dmabuf

Linux DMA-BUF framework for zero-copy GPU–NIC transfers. Requires CUDA 11.7+ and kernel support.

Data Direct

-M cuda_data_direct

Most efficient method using direct PCIe mappings. Requires specific hardware and driver support; lowest latency, highest BW.

Examples:

Bash
# Auto-detect best GPU memory type (recommended)
doca_perftest -d mlx5_0 -n server-name -M cuda -G 0

# With custom CUDA library path
doca_perftest -d mlx5_0 -n server-name -M cuda -G 0 --cuda_lib_path /usr/local/cuda-12/lib64

# Force a specific CUDA memory mode
doca_perftest -d mlx5_0 -n server-name -M cuda_peermem -G 0
doca_perftest -d mlx5_0 -n server-name -M cuda_dmabuf -G 0
doca_perftest -d mlx5_0 -n server-name -M cuda_data_direct -G 0

# Deprecated syntax (still supported, equivalent to cuda_auto_detect)
doca_perftest -d mlx5_0 -n server-name --cuda 0

Execution Engine

While the memory type dictates where RDMA data resides, the execution engine determines who drives the RDMA data-path—specifically, which processor prepares Work Queue Elements (WQEs), rings doorbells, and polls completion queues. The two concepts are complementary and can be combined (for example, utilizing CUDA memory alongside either the CPU or the GPUNetIO engine).

Select the execution engine via the --engine flag (CLI) or the "engine" field (JSON).

CPU Engine (Default)

The host CPU drives the entire RDMA data-path. This is the default execution model and is fully compatible with all memory types, verbs, and traffic patterns.

No flag is required; the CPU engine is automatically utilized whenever the --engine flag is omitted.

GPUNetIO Engine (GPU-initiated RDMA)

With the gpunetio engine, WQE preparation, doorbell ringing, and Completion Queue (CQ) polling all execute inside CUDA kernels directly on the GPU. The host CPU is only involved during the initial setup, teardown, and final result collection.

The GPUNetIO engine is powered by DOCA GPUNetIO. For implementation details, refer to the DOCA GPUNetIO programming guide.

Key requirements:

  • Requires the DOCA Verbs driver (-r dv) and CUDA memory (-M cuda).

  • Supports the Write verb only (-v write) for both bandwidth and latency metrics.

  • Supports unidirectional traffic only (bidirectional is not supported).

  • Requires the doca-gpunetio package and the CUDA Toolkit.

Basic usage: 

doca_perftest -d mlx5_0 -n server-name -M cuda -G 0 -r dv --engine gpunetio

JSON response:

"engine": "gpunetio",
"memoryType": "cuda",
"cudaDeviceId": 0,
"rdmaDriver": "dv",
"verb": "write",
"metric": "bw"

RDMA Drivers

Two RDMA driver backends are supported. The available drivers depend on your installed packages and hardware.

Driver

Prerequisites

Usage

IBV (libibverbs)

The standard RDMA Verbs delivered as part of DOCA-OFED (and standard inbox drivers). Recommended for general compatibility across all IB/RoCE adapters.

-r ibv (default)

DV (doca_verbs)

The specialized DOCA RDMA Verbs backend. This provides a high-performance alternative to standard verbs and is optimized for the DOCA SDK ecosystem.

-r dv

DOCA Verbs is not a mandatory dependency. doca-perftest runs without it and loads the DOCA Verbs driver on demand when -r dv is requested. If the DOCA Verbs package is not installed, the IBV driver is always available.

Advanced RDMA Tuning

Per-iteration-sync Flow (Lock-step Benchmarking)

Designed to mimic AI workloads, this flow ensures data transfer occurs in distinct, synchronized steps. By forcing every process to wait for all peers to complete an iteration before proceeding, it enables granular data validation and allows for Queue Pair (QP) parameter modification between steps.

Configuration constraints:

  • Mode restriction: Supported only in multi-node mode (simple Point-to-Point is not supported).

  • Traffic rules: Requires the ALL_2_ALL pattern with BIDIR traffic.

  • Duration rules: Must be defined by specific iterations (time-based duration is not supported).

  • Invocation: Supported in both JSON mode ("iterationSyncType": "write_imm") and CLI mode (--traffic_pattern a2a --iteration_sync).

Logic and Implementation

The flow utilizes a bidirectional ALL_TO_ALL pattern. Each iteration consists of four distinct phases:

  1. Data phase: Every process sends a data message to all peers. The total msgSize is split across available QPs, with each QP writing to a specific offset to utilize the full buffer.

  2. Sync phase: Once data transfer completes, each process sends a Sync Message to all peers. The Sync Message is a zero-length RDMA Write with Immediate Data.

  3. Barrier phase: A process completes the iteration only after it has received confirmation for its own Sync Send and received Sync Messages from all peers.

  4. Post-iteration (management) phase: This occurs after synchronization but before the next iteration begins. It performs non-timed management tasks, such as modifying QPs, checking data validation results, or updating pointers.

Configuration

Add the following fields to your JSON configuration file:

JSON Field

Value

Notes

iterationSync

"true"

Activates the flow logic.

trafficPattern

"ALL_2_ALL"

Required. Must be used even for 1:1 node connections.

trafficDirection

"BIDIR"

Required. Flow requires bidirectional exchange.

iterations

(Integer)

Required. Defines the run duration. Time-based "Duration" mode is not supported.

verb

"write"/"writeImm"

Determines the verb used in the Data phase.

metric

"bw"

Calculates both Bandwidth and Latency.

Measurement and Validation

  • Data validation integration: When dataValidation is set to true, the flow performs a bit-exact verification of all received data at the end of every iteration. This is highly effective for catching transient data corruption in complex A2A patterns. Validation occurs during the "post-iteration" management phase, strictly outside of the timed performance interval.

  • Bandwidth measurement: In bidirectional iteration-sync mode, bandwidth includes both TX and RX bytes, reflecting the total data exchanged per iteration.

Limitations

  • Throughput overhead: Due to the heavy synchronization barrier, the measured "streaming" bandwidth will be lower than a standard continuous A2A test.

  • Mixed synchronization: A single scenario file cannot mix synchronization types. All tests must be either "Iteration-Sync" ("iterationSyncType": "write_imm") or standard ("iterationSyncType": "none").

  • Raw data output: When raw data output is enabled, iteration-sync and latency tests export per-iteration data in CSV format.

Start Packet Sequence Number

Start PSN controls the initial Packet Sequence Number for each Queue Pair (QP) at connection initialization. If unspecified, a random value is generated.

This feature is essential for debugging sequence-sensitive behavior, ensuring reproducibility, and interoperability testing.

Interface

Configuration

Requirement

CLI

--start-psn <val1>,<val2>...

The number of values must exactly match the number of QPs.

JSON

"startPsn" object

Keys must be contiguous (e.g., qp0, qp1) and start at qp0.

Example JSON configuration: 

"startPsn": {
    "qp0": 1000,
    "qp1": 1001
}

Enhanced Connection Establishment

ECE is an optional RDMA setup phase that aligns connection capabilities between the client and server before traffic begins.

When enabled, doca-perftest exchanges ECE parameters for each connection, leveraging the hardware-firmware negotiation that occurs during the Queue Pair (QP) transition from RESET to INIT.

High‑Level Flow

The ECE process ensures both sides agree on supported features before establishing the connection.

  1. The client queries its local ECE capabilities and sends them to the server via the control channel.

  2. The server applies the client's proposal, transitions its QP to INIT, and queries the device for the final accepted ECE configuration.

  3. The server sends the finalized ECE configuration back to the client.

  4. The client applies the finalized configuration, transitions its QP to INIT, and validates the negotiated result.

  5. Standard QP data exchange and RTR/RTS transitions proceed as usual.

ECE Configuration

Interface

Instruction

CLI

Add the --use_ece flag.

JSON

Set "useEce": true in the test configuration.

QP Hints

DOCA RDMA Verbs supports attaching opaque Congestion Control (CC) hints to Queue Pairs (QPs) for use by the Programmable Congestion Control (PCC) algorithm.

doca-perftest allows users to provide a binary hints file along with specific metadata (file size, vendor ID, and format ID). These parameters are passed directly to the PCC via the DOCA RDMA Verbs driver.

This feature is available only when using the DOCA RDMA Verbs driver (-r dv).

Configuring QP Hints

You can configure QP hints via CLI or JSON.

  • CLI – Pass a comma-separated list containing the file path and metadata using the --cc_group_hints flag.

  • JSON input – Add the ccGroupHints object to your test configuration: 

    "ccGroupHints": {
        "filePath": "/path/to/hints.bin",
        "fileSize": 1024,
        "vendorId": 1,
        "formatId": 1
    }
    

TPH

PCIe optimization providing hints to CPUs for cache management and reduced memory-access latency.

Requires ConnectX-6+ hardware and a TPH-enabled kernel.

Parameters:

Option

Meaning

--ph

Processing hint:

  • 0 = Bidirectional (default)

  • 1 = Requester

  • 2 = Completer

  • 3 = High-priority completer

--tph_core_id

Target CPU core for TPH handling

--tph_mem

Memory type:

  • pm = Persistent

  • vm = Volatile

Examples:

Bash
# Invalid: Core ID without memory type
doca_perftest -d mlx5_0 -n server-name --tph_core_id 0  # ERROR

# Invalid: Memory type without core ID
doca_perftest -d mlx5_0 -n server-name --tph_mem pm  # ERROR

# Valid: Both or neither
doca_perftest -d mlx5_0 -n server-name --ph 1  # OK (hints only)
doca_perftest -d mlx5_0 -n server-name --ph 1 --tph_core_id 0 --tph_mem pm  # OK (full config)

Warmup Configuration

By default, doca-perftest performs a time-based warmup before measuring results. Two warmup modes are available:

Option (CLI)

JSON Field

Description

-w <seconds>

warmupTimeSecs

Duration-based warmup (default).

--warmup_iterations <count>

warmupIterations

Iteration-based warmup.

Constraints:

  • The two modes are mutually exclusive.

  • Iteration-sync tests reject duration-based warmup; use --warmup_iterations instead.

  • Latency tests reject iteration-based warmup; use -w instead.

  • Data validation requires warmup to be explicitly disabled (-w 0).

QP Transport Timeout

The --qp_timeout flag (CLI) or "qpTimeOut" field (JSON) configures the transport-level timeout for each Queue Pair. This controls how long the QP waits before retransmitting an unacknowledged packet.

doca_perftest -d mlx5_0 -n server-name --qp_timeout 18

Multi-Verb WR Groups

The --wr_group flag (CLI) or "operations" field (JSON) enables posting multiple Work Requests per iteration, each with a different RDMA verb and message size. This is useful for benchmarking mixed-verb workloads.

# Post a Write of 64KB followed by a Send of 4KB per iteration
doca_perftest -d mlx5_0 -n server-name --wr_group write:65536,send:4096

Each WR in the group gets its own verb, message size, and memory region.

Telemetry and Diagnostics

QP Histogram

The QP histogram provides visibility into how work is distributed across multiple queue pairs during a test. This is highly useful for identifying load-balancing issues, scheduling inefficiencies, or hardware limitations when using multiple QPs.

Enabling QP histogram:

Interface

Configuration

Execution Constraints

CLI

-H or --qp_histogram

Strictly restricted to one process (-N 1 is the default).

JSON

"printQpHistogram": true

Fully supports multi-process execution.

CLI configuration example:

Bash
# Enable QP histogram with 8 queue pairs using the shorthand flag
doca_perftest -d mlx5_0 -n <server-name> -q 8 -H

# Or using the long flag
doca_perftest -d mlx5_0 -n <server-name> -q 8 --qp_histogram

Histogram Output

When running in CLI mode, the histogram is printed directly to standard output as a visual bar chart:

Bash
--------------------- QP WORK DISTRIBUTION ---------------------
Qp num 0:  ████████████████████████                     45.23 Gbit/sec  |  Relative deviation: -2.1%
Qp num 1:  █████████████████████████                    46.89 Gbit/sec  |  Relative deviation: 1.5%
Qp num 2:  ████████████████████████                     45.67 Gbit/sec  |  Relative deviation: -1.2%
Qp num 3:  █████████████████████████████                48.21 Gbit/sec  |  Relative deviation: 4.3%

When exporting results to JSON, the histogram data is safely nested within the resultsPerProcess array:

Bash
"connections": [
    {
        "resultsPerProcess": [
            {
                "qpHistogram": [
                    {
                        "qpIndex": 0,
                        "bwGbSec": 45.23,
                        "relativeDeviation": -2.1
                    },
                    {
                        "qpIndex": 1,
                        "bwGbSec": 46.89,
                        "relativeDeviation": 1.5
                    },
                    {
                        "qpIndex": 2,
                        "bwGbSec": 45.67,
                        "relativeDeviation": -1.2
                    },
                    {
                        "qpIndex": 3,
                        "bwGbSec": 48.21,
                        "relativeDeviation": 4.3
                    }
                ]
            }
        ]
    }
]

Live Bandwidth Reporting

The --report_interval (-R) flag enables periodic live bandwidth reporting during CLI bandwidth tests. When set, a scrolling log of throughput samples (Gb/s, Mpps) is printed at the configured interval.

Bash
# Print live BW every 2 seconds
doca_perftest -d mlx5_0 -n server-name -m bw -R 2

Default is 0 (disabled). Live reporting is supported in both unidirectional and bidirectional modes.

Data Validation

Data validation verifies the integrity of RDMA traffic during bandwidth tests. When enabled, the requestor generates a deterministic payload for each message, and the responder compares the received data against the expected pattern.

To enable validation, set the dataValidation field to true in your test configuration. 

No other specific JSON changes are required, provided the test meets the constraints listed below.

Validation introduces CPU and memory overhead, reducing measured bandwidth. iteration-sync mode, however, performs validation during the inter-iteration gap, preserving performance accuracy.

Prerequisites and Constraints

  • Test type: Supported only for bandwidth tests (latency testing is not supported).

  • Supported modes:

    • Standard send verb tests.

    • Tests running in iteration-sync mode.

  • Buffer configuration: rxDepth must be greater than or equal to txDepth.

  • Warmup: Warmup must be explicitly disabled (-w 0).

  • Enhanced Reliability (ER): If "ER Auto Mode" is set, it will be automatically disabled when validation is active.

Output and Reporting

When validation is enabled, the JSON output includes a validationResults section.

  • Key metric: invalidDataSampleCount (the total number of messages that failed validation).

  • Logging: Individual failure logs are capped at the first 5,000 invalid samples. Additional failures are counted in the metric but not logged individually.

If validation is disabled, this section is omitted entirely.

Automation and CI Integration

doca-perftest provides several features designed explicitly for integration with automation frameworks, CI/CD pipelines, and regression testing systems. 

For automated environments, do not parse the human-readable terminal output, as this format is subject to change between versions to improve readability. Instead, utilize the structured JSON output (via --print_json_results or an output file) as your stable programmatic API.

Performance Thresholds

Performance thresholds allow automated pass/fail validation of benchmark results. When a threshold is set, doca-perftest compares the measured results against the specified value and reports whether the test passed or failed.

If any test fails its threshold check, doca-perftest exits with return code 2 (instead of the normal 0). This enables CI systems to detect regressions by checking the process exit code.

Option

JSON Field

Unit

Validation Rule

--bw_threshold

bwThreshold

Gbit/sec

Test fails if average NIC bandwidth is below this value

--lat_threshold

latThreshold

usec

Test fails if the worst NIC median latency exceeds this value

Both thresholds default to 0 (disabled). The threshold and pass/fail status are included in the JSON output.

Examples:

Bash
# Bandwidth regression test: fail if below 90 Gbit/sec
doca_perftest -d mlx5_0 -n server-name -m bw --bw_threshold 90

# Latency regression test: fail if median exceeds 5 usec
doca_perftest -d mlx5_0 -n server-name -m lat --lat_threshold 5.0

Non-Interactive Mode

The --non_interactive flag suppresses real-time progress indicators (such as spinners, progress bars, and live table redraws). This is highly recommended when doca-perftest output is redirected to a file or pipe, or when a monitoring system reads stdout.

Bash
doca_perftest -d mlx5_0 -n server-name --non_interactive

Structured JSON Output to stdout

The --print_json_results flag (CLI) or "printJsonOutput": true field (JSON) prints the aggregated results summary directly to stdout in JSON format. This follows the exact same schema used for JSON output files.

This JSON output serves as a stable API contract and will not change without a corresponding schema version bump, making it the recommended method for programmatic result consumption.

Bash
# CLI: print JSON results to stdout
doca_perftest -d mlx5_0 -n server-name --print_json_results
Bash
# JSON mode: add to scenario config
"printJsonOutput": true,
"outputFile": "results.json"

Inline JSON Configuration

The -j flag accepts a scenario configuration as a raw JSON string, eliminating the need to create a physical file on disk. This is especially useful for programmatic invocations where the configuration is generated dynamically.

Bash
doca_perftest -j '{"testNodes": [{"hostname": "node01", "deviceName": "mlx5_0"}, {"hostname": "node02", "deviceName": "mlx5_0"}], "trafficPattern": "ONE_TO_ONE"}'

Running doca-perftest on BlueField-3 Devices

doca-perftest is capable of generating traffic from either the x86 host or the BlueField Arm cores, determined entirely by the input JSON configuration.

MPI Network Configuration

When launching doca-perftest from the server (regardless of whether the traffic originates from the x86 host or the BlueField), it is recommended to explicitly specify the MPI TCP network interface.

Add the subnet that connects the management server and the BlueField devices to the mpiTcpNetworkInterfaces field in your JSON input (e.g., "mpiTcpNetworkInterfaces": "10.7.8.0/24").

Traffic Originating from x86 Host (Server)

In this mode, traffic is generated by the x86 server. The RDMA device on the host (e.g., mlx5_0) performs DMA operations directly to/from host DRAM via PCIe.

Data path:

  • Path: NIC ↔ PCIe ↔ Host Memory

  • Bottlenecks: Performance is influenced by PCIe bandwidth and host CPU behavior, in addition to the network link and NIC capabilities.

JSON configuration:

  • hostName: Set to the x86 server hostname.

  • deviceName: Set to the RDMA device on the server (e.g., mlx5_0).

Traffic Originating from BlueField (Arm Cores)

In this mode, traffic is generated by the BlueField Arm cores, even if the test is launched from the x86 server. The RDMA device on the BlueField (e.g., p0, p1, mlx5_2, pf0vf0, depending on the operating mode) performs DMA operations to/from the BlueField's on-board DDR.

Data path:

  • Path: NIC ↔ DPU DDR (No PCIe hop)

  • Bottlenecks: Performance is typically limited by the network link, NIC, and DPU DDR bandwidth. The PCIe bus is not involved in the data path.

JSON configuration:

  • hostName: Set to the BlueField hostname.

  • deviceName: Set to the RDMA device on the BlueField (e.g., p0, mlx5_2).

    Device naming conventions may vary depending on the BlueField operating mode.

Last updated: