DOCA GPUNetIO

This document provides an overview and configuration instructions for DOCA GPUNetIO API.

Introduction

The quality status of DOCA libraries is listed here.

DOCA GPUNetIO enables real-time GPU processing for network packets, making it ideal for application domains such as:

Signal processing
Network security
Information gathering
Input reconstruction

Traditional approaches often rely on a CPU-centric model, where the CPU coordinates with the NIC to receive packets in GPU memory using GPUDirect RDMA. Afterward, the CPU notifies a CUDA kernel on the GPU to process the packets. However, on low-power platforms, this CPU dependency can become a bottleneck, limiting GPU performance and increasing latency.

DOCA GPUNetIO addresses this challenge by offering a GPU-centric solution that removes the CPU from the critical path. By combining multiple NVIDIA technologies, it provides a highly efficient and scalable method for network packet processing.

Technologies integrated with DOCA GPUNetIO:

GPUDirect RDMA – Enables direct packet transfer between the NIC and GPU memory, eliminating unnecessary memory copies
GPUDirect Async Kernel-Initiated (GDAKI) – Allows CUDA kernels to control network operations without CPU intervention
- GDAKI is also named IBGDA when used with the RDMA protocol
GDRCopy Library – Allows the CPU to access GPU memory directly
NVIDIA BlueField DMA Engine – Supports GPU-triggered memory copies

The following is an example diagram of a CPU-centric approach:

The following is an example diagram of a GPU-centric approach:

Key features of DOCA GPUNetIO include:

GPUDirect Async Kernel-Initiated (GDAKI)
- GDAKI network communications – a GPU CUDA kernel can control network communications to send or receive data
  - GPU can control Ethernet communications (Ethernet/IP/UDP/TCP/ICMP)
  - GPU can control RDMA communications (InfiniBand or RoCE are supported)
  - CPU intervention is unnecessary in the application critical path
GPUDirect RDMA
- Enables direct data transfers to and from GPU memory without CPU staging copies.
DMA Engine Control
- CUDA kernels can initiate memory copies using BlueField's DMA engine
Semaphores for Low-Latency Communication
- Supports efficient message passing between CUDA kernels or between CUDA kernels and CPU threads
Smart Memory Allocation
- Allocates aligned GPU memory buffers, optimizing memory access
- GDRCopy library to allocate a GPU memory buffer accessible from the CPU
Accurate Send Scheduling
- Provides precise control over Ethernet packet transmission based on user-defined timestamps.

NVIDIA applications that use DOCA GPUNetIO include:

Aerial 5G SDK – For ultra-low latency 5G network operations
NIXL

– NVIDIA Inference Xfer Library (NIXL) is targeted for accelerating point to point communications in AI inference frameworks (e.g., NVIDIA Dynamo)
Holoscan Advanced Network Operator

– Powering real-time data processing in edge AI environments
NVQLink – New GPU RoCE transceiver Holoscan sensor bridge operator
UCX – new GDAKI module via GPUNetIO functions
NCCL – GIN transport enabled via GPUNetIO GPU communications

For more information about DOCA GPUNetIO, refer to the following NVIDIA blog posts:

Changes From Previous Releases

Changes in 3.3

DGX Spark for GPUNetIO application section added in this programming guide
ConnectX8 reliable doorbell feature enabled for GPUNetIO + Verbs
GPU packet processing application now measures the receive BW for UDP traffic
GPU packet processing application programming guide extended with instructions to run T-Rex or dpdk-testpmd packet generators

General Performance and Best Practices

To ensure optimal results and avoid common pitfalls when working with DOCA GPUNetIO, please adhere to the following guidelines:

Build Configuration

Use release builds: Always set buildtype = 'release' in your meson.build file when compiling samples or applications. Using the default debug mode will result in significantly lower performance and higher latency.

Hardware and System Setup

Topology matters: For maximum throughput, ensure your GPU and NIC have a PIX or PXB topology (connected via a single PCIe bridge). Avoid NODE or SYS topologies where data must traverse NUMA nodes or SMP interconnects.
GPU BAR1 size: Check your GPU's BAR1 size using nvidia-smi -q. If your application fails to map memory buffers to the NIC, you may need to enable "Resizable BAR" in your system BIOS to increase this space.
DGX spark exceptions: If your hardware does not support GPUDirect RDMA (like DGX Spark), you must use the DOCA_GPU_MEM_TYPE_CPU_GPU memory type and enable CPU proxy mode for transmission.

Programming Best Practices

CUDA context initialization: Before calling any DOCA GPUNetIO functions (like doca_gpu_create), ensure a CUDA context is active on the target device. A simple way to do this is by calling cudaFree(0).
Execution scopes: When using high-level Ethernet APIs, choose the widest possible execution scope (Warp or Block) rather than Thread scope. This reduces contention on atomic operations and significantly improves doorbell ringing efficiency.
Memory pointers: Be extremely careful with memory pointers. Always use memptr_gpu for data accessed within CUDA kernels and memptr_cpu for CPU-side management to avoid segmentation faults.

Permissions

Root access: All Ethernet samples and applications rely on DOCA Flow and must be executed with sudo or root privileges.
Non-root alternative: Verbs, RDMA, and DMA operations can run without root if you configure the NVreg_RegistryDwords="PeerMappingOverride=1;" option in the NVIDIA driver.

Navigation Hub

Choose the path that matches your current goal:

Goal	Description	Link
Set up my system	Install packages, configure NIC firmware (ConnectX/BlueField), and verify PCIe topology	Installation and Setup
Learn the concepts	Understand GDAKI, CPU vs. GPU control paths, and the GPUNetIO memory model	Architecture and Design
Start coding	Reference CPU and GPU functions for Ethernet, RDMA, Verbs, and DMA	API Reference
Run a demo	Walk through build instructions and execute provided sample applications	Sample Guide

Last updated: January 22, 2026