This guide provides instructions for building and developing applications that require telemetry data collection from NVIDIA® BlueField and NVIDIA® ConnectX® families of networking platforms.
Introduction
The doca_telemetry_pci library provides access to PCIe status and performance information from BlueField or ConnectX networking platforms.
DOCA Telemetry PCI is supported at alpha level.
Prerequisites
To utilize DOCA Telemetry PCI, your system must meet the following baseline requirements:
-
Firmware: Version
>=28/32/40.43.1000is required for ConnectX-7, BlueField-3, and ConnectX-8 devices. -
Driver: The
fwctldriver must be fully installed and actively loaded on the system.
Verifying the fwctl Driver
To verify that the fwctl driver is successfully loaded, check the device directories:
$ ls /sys/class/fwctl/
$ ls /dev/fwctl
The expected output for a standard 2-port device is fwctl0 fwctl1.
Manually Loading the Driver
If the directories /sys/class/fwctl or /dev/fwctl do not exist or are empty, the module may be installed but inactive.
Check for the module's presence:
$ grep fwctl -R /lib/modules/$(uname -r)/
If the output confirms the presence of fwctl.ko and mlx5_fwctl.ko, manually load the module and verify its status:
$ sudo modprobe mlx5_fwctl
$ lsmod | grep fwctl
Reinstalling the DOCA Host Package
If you cannot locate the installed fwctl module while manually loading the driver, or if the modprobe command fails to load it successfully, you must reinstall the DOCA Host package.
-
Download the package (DOCA 3.3.0 example):
$ wget https://www.mellanox.com/downloads/DOCA/DOCA_v3.3.0/host/doca-host_3.3.0-088000-26.01-ubuntu2204_amd64.deb
-
Purge existing DOCA and OFED modules:
$ sudo for f in $( dpkg --list | grep doca | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done $ sudo for f in $( dpkg --list | grep mlnx | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done $ sudo for f in $( dpkg --list | grep dpdk | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done $ sudo for f in $( dpkg --list | grep ofed | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done $ sudo /usr/sbin/ofed_uninstall.sh --force $ sudo apt-get autoremove -
Install the new package and restart services:
Once the reinstallation is complete, confirm the module is successfully loaded according to section "DOCA Telemetry PCI | Verifying the fwctl Driver".$ sudo dpkg -i doca-host_3.3.0-088000-26.01-ubuntu2204_amd64.deb $ sudo apt-get update $ sudo apt-get -y install doca-all $ sudo /etc/init.d/openibd restart
Environment
DOCA Telemetry-based applications can run on either the host machine (ConnectX-7 or BlueField-3 and newer) or the DPU target (BlueField-3 and newer).
Architecture
DOCA Telemetry PCI provides insights into PCIe devices, including:
-
Management information: PCIe link and speed details, power usage, function count, error detection flags, and more.
-
PCIe performance counters: Data transfer rates, error rates, stall counters, L0 recovery count, and other performance metrics.
-
PCIe latency histogram: Helps understand the duration of PCIe operations.
To interact with a device, typically corresponding to a specific NIC port, create a DOCA Telemetry PCI context using doca_telemetry_pci_create().
Configuration Phase
Device Support
DOCA Telemetry PCI requires a device to operate. For picking a device, refer to "DOCA Core Device Discovery".
As device capabilities may change, it is recommended to check your device using the required set of PCIe telemetry options you desire before opening it to be confident the operations you desire are available. The set of available capability checks for DOCA Telemetry PCI are out lined below:
|
Functionality |
Method |
|---|---|
|
PCI Management Information |
|
|
PCI Performance Counters Group 1 |
|
|
PCI Performance Counters Group 2 |
|
|
PCI Latency Histogram |
|
Within the structures provided during the execution phase some fields are only populated which a further sub-capability is also supported:
|
Functionality |
Field(s) |
Method |
|---|---|---|
|
PCI Management Information |
|
|
|
|
|
|
|
PCI Performance Counters Group 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Execution Phase
Retrieving PCIe Management Information
Using a running doca_telemetry_pci context which supports PCIe management information the user can call doca_telemetry_pci_read_management_info as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read management info
result = doca_telemetry_pci_cap_management_info_is_supported(devinfo);
if(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Check any sub capabilities if you require those fields
// Create PCI telemetry
struct doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Read management info
struct doca_telemetry_pci_management_info management_info = {};
// Use DPN to identify the data subject
struct doca_telemetry_pci_dpn dpn = {0, 0, 0};
result = doca_telemetry_pci_read_management_info(pci_telem, dpn, &management_info);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use PCI address to identify the data subject (Replace SSSS.BB.DD.F with an actual PCI address)
result = doca_telemetry_pci_read_management_info_by_pci_addr(pci_telem, "SSSS.BB.DD.F", &management_info);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
Retrieving Performance Counters Group 1
Using a running doca_telemetry_pci context which supports performance counters group 1 the user can call doca_telemetry_pci_read_perf_counters_1 as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read perf counters group 1
result = doca_telemetry_pci_cap_perf_counters_1_is_supported(devinfo);
if(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Check any sub capabilities if you require those fields
// Create PCI telemetry
struct doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Read perf counters group 1
struct doca_telemetry_pci_perf_counters_1 counters= {};
// Use DPN to identify the data subject
struct doca_telemetry_pci_dpn dpn = {0, 0, 0};
result = doca_telemetry_pci_read_perf_counters_1(pci_telem, dpn, &counters);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use PCI address to identify the data subject (Replace SSSS.BB.DD.F with an actual PCI address)
result = doca_telemetry_pci_read_perf_counters_1_by_pci_addr(pci_telem, "SSSS.BB.DD.F", &counters);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
Retrieving Latency Histogram
Using a running doca_telemetry_pci context which supports latency histogram the user must first call doca_telemetry_pci_get_latency_histogram_dimensions to learn the correct dimmensions of the histogram. They can then allocate an array of histogram values and then finally they can call doca_telemetry_pci_read_latency_histogram as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read perf counters group 2
result = doca_telemetry_pci_cap_latency_histogram_is_supported(devinfo);
if(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Create PCI telemetry
struct doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Learn the histograms dimmensions
uint32_t bucket_count;
uint32_t bucket_width_ns;
// Use DPN to identify the data subject
struct doca_telemetry_pci_dpn dpn = {0, 0, 0};
result = doca_telemetry_pci_get_latency_histogram_dimensions(pci_telem, dpn, &bucket_count, &bucket_width_ns);
if(result != DOCA_SUCCESS)
// Handle failure to get histogram dimmensions
// Use PCI address to identify the data subject (Replace SSSS.BB.DD.F with an actual PCI address)
result = doca_telemetry_pci_get_latency_histogram_dimensions_by_pci_addr(pci_telem, "SSSS.BB.DD.F", &bucket_count, &bucket_width_ns);
if(result != DOCA_SUCCESS)
// Handle failure to get histogram dimmensions
// Allocate memory to hold histogram data
uint64_t* buckets_arr = malloc(bucket_count * sizeof(uint64_t));
if( buckets_arr == NULL)
// Handle failure to allocate memory
// Fetch histogram data
// Use DPN to identify the data subject
result = doca_telemetry_pci_read_latency_histogram(pci_telem, dpn, buckets_arr);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use PCI address to identify the data subject (Replace SSSS.BB.DD.F with an actual PCI address)
result = doca_telemetry_pci_read_latency_histogram_by_pci_addr(pci_telem, "SSSS.BB.DD.F", buckets_arr);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
free(buckets_arr);
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
Alternative Datapath Options
DOCA Telemetry PCI supports only CPU-based datapaths.
Last updated: