This guide provides instructions for building and developing applications that require telemetry data collection from NVIDIA® BlueField and NVIDIA® ConnectX® families of networking platforms.
Introduction
The DOCA Telemetry PHY library provides access to detailed telemetry data and statistics from NVIDIA BlueField and ConnectX networking platforms. It allows developers to monitor and analyze link, cable, and port status and statistics.
DOCA Telemetry PHY is currently supported at the alpha level.
Prerequisites
To utilize DOCA Telemetry PHY, your system must meet the following baseline requirements:
-
Firmware: Version
>=28/32/40.43.1000is required for ConnectX-7, BlueField-3, and ConnectX-8 devices. -
Driver: The
fwctldriver must be fully installed and actively loaded on the system.
Verifying the fwctl Driver
To verify that the fwctl driver is successfully loaded, check the device directories:
$ ls /sys/class/fwctl/
$ ls /dev/fwctl
The expected output for a standard 2-port device is fwctl0 fwctl1.
Manually Loading the Driver
If the directories /sys/class/fwctl or /dev/fwctl do not exist or are empty, the module may be installed but inactive.
Check for the module's presence:
$ grep fwctl -R /lib/modules/$(uname -r)/
If the output confirms the presence of fwctl.ko and mlx5_fwctl.ko, manually load the module and verify its status:
$ sudo modprobe mlx5_fwctl
$ lsmod | grep fwctl
Reinstalling the DOCA Host Package
If you cannot locate the installed fwctl module while manually loading the driver, or if the modprobe command fails to load it successfully, you must reinstall the DOCA Host package.
-
Download the package (DOCA 3.3.0 example):
$ wget https://www.mellanox.com/downloads/DOCA/DOCA_v3.3.0/host/doca-host_3.3.0-088000-26.01-ubuntu2204_amd64.deb
-
Purge existing DOCA and OFED modules:
$ sudo for f in $( dpkg --list | grep doca | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done $ sudo for f in $( dpkg --list | grep mlnx | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done $ sudo for f in $( dpkg --list | grep dpdk | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done $ sudo for f in $( dpkg --list | grep ofed | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done $ sudo /usr/sbin/ofed_uninstall.sh --force $ sudo apt-get autoremove -
Install the new package and restart services:
Once the reinstallation is complete, confirm the module is successfully loaded according to section "DOCA Telemetry PHY | Verifying the fwctl Driver".$ sudo dpkg -i doca-host_3.3.0-088000-26.01-ubuntu2204_amd64.deb $ sudo apt-get update $ sudo apt-get -y install doca-all $ sudo /etc/init.d/openibd restart
Environment
DOCA Telemetry PHY-based applications can run on:
-
Host machines with ConnectX-7 or BlueField-3 and newer
-
DPU targets with BlueField-3 and newer
Architecture
The DOCA Telemetry PHY library provides access to the following categories of telemetry data:
-
Operation information – Retrieves details related to active link technology and capabilities.
-
Supported information – Retrieves details related to supported link speed capabilities.
-
Troubleshooting information – Retrieves status information specifically for troubleshooting purposes.
-
Module information – Retrieves details related to cable technology and capabilities.
-
Counter and BER information – Retrieves details related to link statistics and Bit Error Rate (BER).
-
FEC histogram information – Retrieves details related to the link's Forward Error Correction (FEC) histogram.
-
Management cable information – Retrieves information related to the cable EEPROM module in raw format (single page or all related pages) or limited to cable Digital Diagnostic Monitoring (DDM).
To interact with a device (which typically corresponds to a specific port on a NIC), you must create a DOCA Telemetry PHY context using doca_telemetry_phy_create().
Each context operates independently of standard DOCA PHY contexts. Consequently, changes made to PHY configurations are not automatically reflected within the telemetry context.
Configuration Phase
Device Support
DOCA Telemetry PHY requires a device to operate. For picking a device, refer to "DOCA Core Device Discovery".
As device capabilities may change, it is recommended to check your device using the doca_telemetry_phy_is_supported() method.
Each of the provided DOCA Telemetry PHY API functionalities can be checked independently using the method:
|
Functionality |
Method |
|---|---|
|
Operation information |
|
|
Module information |
|
|
Supported information |
|
|
Troubleshooting information |
|
|
Counter and BER information |
|
|
FEC histogram information |
|
|
Management cable information |
|
Output Structure Format
The calling application is strictly responsible for allocating the necessary output structures. Refer to the doca_telemetry_phy.h header file for exact details on the output structures provided by the doca_telemetry_phy context.
-
Returned values that fall outside the corresponding enumeration range or prescribed limit values must be interpreted as invalid or not applicable, unless expressly indicated otherwise.
-
Returned values that are not explicitly defined, yet fall within the corresponding enumeration range, must be interpreted as reserved or unknown, unless expressly indicated otherwise.
Output Data Fields
The following table details the specific data fields retrieved by the Telemetry PHY context based on the requested functionality category:
|
Category |
Field / Functionality |
Description |
|---|---|---|
|
Operation |
|
Active protocol (InfiniBand or Ethernet). |
|
|
Firmware PHY manager FSM state. |
|
|
|
Physical state (related to InfiniBand or Ethernet). |
|
|
|
Link speed active (related to InfiniBand or Ethernet). |
|
|
|
Link width. |
|
|
|
Active FEC mode. |
|
|
|
Loopback mode. |
|
|
|
Auto-negotiation status. |
|
|
Supported |
|
Active protocol (InfiniBand or Ethernet). |
|
|
Enabled link speed (related to InfiniBand or Ethernet). |
|
|
|
Supported link speed (related to InfiniBand or Ethernet). |
|
|
Troubleshooting |
|
Status operation code raw value. |
|
|
Status operation code. |
|
|
|
Group operation code. |
|
|
|
Status message. |
|
|
Module |
|
Number of supported module lanes. |
|
|
Nominal bit rate in Gb/s. |
|
|
|
Active protocol (InfiniBand or Ethernet). |
|
|
|
Error code response for Control or Set configuration of the data path. |
|
|
|
Cable vendor ID, name, manufacture date, part, serial, and revision numbers. |
|
|
|
Static information (Memory map, cable technology, ID, type, compliance code). |
|
|
|
Cable power and temperature info. |
|
|
|
Active cable information (cable emphasis, wavelength). |
|
|
|
Error counter information (latched Tx fault, Rx loss of signal, flags per lane). |
|
|
|
Status information for the plugged cable. |
|
|
|
Latency information. |
|
|
Counter & BER |
|
Number of supported module lanes. |
|
|
Unintentional link drops counter (no remote consideration). |
|
|
|
Successful recovery events per active link. |
|
|
|
Time passed since the last counters clear event (in msec). |
|
|
|
Active protocol (InfiniBand or Ethernet). |
|
|
|
Error counter caused by invalid bits that were not corrected by PHY correction mechanisms (Valid for InfiniBand only). |
|
|
|
Error counter caused by invalid bits that were not corrected by the FEC algorithm. |
|
|
|
Raw error counter caused by invalid bits received. |
|
|
FEC Histogram |
|
Available number of bins. |
|
|
Range of bin errors distribution. |
|
|
|
Bin errors distribution according to |
|
|
Cable Dump |
|
Number of pages returned. |
|
|
Returned page information (page ID, bytes read from offsets 0 and 128, and size of returned data). |
|
|
Cable DDM |
|
Number of supported module lanes. |
|
|
Temperature parameters (high/low alarm flags, current value, and thresholds in Celsius). |
|
|
|
Module voltage parameters (high/low alarm flags, current value, and thresholds in mV). |
|
|
|
RX power parameters (high/low alarm flags, current value, and thresholds per lane in dBm). |
|
|
|
TX power parameters (high/low alarm flags, current value, and thresholds per lane in dBm). |
|
|
|
TX bias parameters (high/low alarm flags, current value, and thresholds per lane in mA). |
Execution Phase
To retrieve telemetry data during the execution phase, pass your established context and the target output structure to the appropriate retrieval API.
For every function listed below, you must ensure that memory is properly allocated for the output structure before calling the API.
|
Information Category |
Retrieval API |
|---|---|
|
Operation |
|
|
Supported |
|
|
Troubleshooting |
|
|
Module |
|
|
Counter and BER |
|
|
FEC histogram |
|
|
Cable single page |
|
|
Cable dump |
|
|
Cable DDM |
|
State Machine
The doca_telemetry_phy context transitions through specific operational states. This section outlines these states, the operations permitted within them, and how to transition between them.
Idle
The context has been created and is idle.
In this state, it is expected for the application to:
-
Destroy the context
-
Start the context for processing
Allowed operations:
-
Configuring the context according to section "DOCA Telemetry PHY | Configuration Phase"
It is possible to reach this state as follows:
|
Previous State |
Transition Action |
|---|---|
|
None |
Create the context |
|
Running |
Call stop |
Running
In this state it is expected for the application to:
-
Stop the context.
-
Retrieve operation information.
-
Retrieve supported information.
-
Retrieve troubleshooting information.
-
Retrieve module information.
-
Retrieve counter and BER information.
-
Retrieve FEC histogram information.
-
Retrieve management cable single page information.
-
Retrieve management cable dump information.
-
Retrieve management cable DDM information.
Allowed operations:
-
Calling stop, moving the context to "Idle" state
It is possible to reach this state as follows:
|
Previous State |
Transition Action |
|---|---|
|
Idle |
Successfully start the context |
There are currently no state restrictions on the majority of API functions.
Alternative Datapath Options
DOCA Telemetry PHY supports only CPU-based datapaths.
DOCA Telemetry Phy Sample
Running the Sample
-
Refer to the following documents:
-
DOCA Installation Guide for Linux for details on how to install BlueField-related software.
-
NVIDIA BlueField Platform Software Troubleshooting Guide for any issue you may encounter with the installation, compilation, or execution of DOCA samples.
-
-
To build a given sample, run the following command. If you downloaded the sample from GitHub, update the path in the first line to reflect the location of the sample file:
cd /opt/mellanox/doca/samples/doca_telemetry/telemetry_phy meson /tmp/build ninja -C /tmp/buildThe binary
doca_telemetry_phyis created under/tmp/build/.
Sample usage:
Usage: doca_telemetry_phy [DOCA Flags] [Program Flags]
DOCA Flags:
-h, --help Print a help synopsis
-v, --version Print program version information
-l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
--sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
Program Flags:
-p, --pci-addr DOCA device PCI device address
-oi, --get-operation-info Retrieve operation info
-si, --get-supported-info Retrieve supported info
-ti, --get-troubleshooting-info Retrieve troubleshooting info
-mi, --get-module-info Retrieve module info
-bi, --get-counter-ber-info Retrieve counter and BER info
-fi, --get-fec-histogram-info Retrieve FEC Histogram info
-mcspi, --get-management-cable-single-page-info Retrieve management cable single page info
-mcdi, --get-management-cable-dump-info Retrieve management cable dump info
-mcddmi, --get-management-cable-ddm-info Retrieve management cable DDM info
The sample includes:
-
Locating and opening a DOCA device.
-
Creating a
doca_telemetry_phyinstance. -
Retrieving and printing operation information.
-
Retrieving and printing supported information.
-
Retrieving and printing troubleshooting information.
-
Retrieving and printing module information.
-
Retrieving and printing counter and BER information.
-
Retrieving and printing FEC histogram information.
-
Retrieving and printing management cable single page information.
-
Retrieving and printing management cable dump information.
-
Retrieving and printing management cable DDM information.
-
Destroying the
doca_telemetry_phycontext.
Last updated: