DOCA SDK Documentation

DOCA Telemetry PHY

This guide provides instructions for building and developing applications that require telemetry data collection from NVIDIA® BlueField and NVIDIA® ConnectX® families of networking platforms.

Introduction

The DOCA Telemetry PHY library provides access to detailed telemetry data and statistics from NVIDIA BlueField and ConnectX networking platforms. It allows developers to monitor and analyze link, cable, and port status and statistics.

DOCA Telemetry PHY is currently supported at the alpha level.

Prerequisites

To utilize DOCA Telemetry PHY, your system must meet the following baseline requirements:

  • Firmware: Version >=28/32/40.43.1000 is required for ConnectX-7, BlueField-3, and ConnectX-8 devices.

  • Driver: The fwctl driver must be fully installed and actively loaded on the system.

Verifying the fwctl Driver

To verify that the fwctl driver is successfully loaded, check the device directories: 

$ ls /sys/class/fwctl/
$ ls /dev/fwctl

The expected output for a standard 2-port device is fwctl0 fwctl1.

Manually Loading the Driver

If the directories /sys/class/fwctl or /dev/fwctl do not exist or are empty, the module may be installed but inactive.

Check for the module's presence:

$ grep fwctl -R /lib/modules/$(uname -r)/

If the output confirms the presence of fwctl.ko and mlx5_fwctl.ko, manually load the module and verify its status:

$ sudo modprobe mlx5_fwctl
$ lsmod | grep fwctl

Reinstalling the DOCA Host Package

If you cannot locate the installed fwctl module while manually loading the driver, or if the modprobe command fails to load it successfully, you must reinstall the DOCA Host package.

  1. Download the package (DOCA 3.3.0 example):

    $ wget https://www.mellanox.com/downloads/DOCA/DOCA_v3.3.0/host/doca-host_3.3.0-088000-26.01-ubuntu2204_amd64.deb

  2. Purge existing DOCA and OFED modules:

    $ sudo for f in $( dpkg --list | grep doca | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done
    $ sudo for f in $( dpkg --list | grep mlnx | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done
    $ sudo for f in $( dpkg --list | grep dpdk | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done
    $ sudo for f in $( dpkg --list | grep ofed | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done
    $ sudo /usr/sbin/ofed_uninstall.sh --force
    $ sudo apt-get autoremove

  3. Install the new package and restart services:

    $ sudo dpkg -i doca-host_3.3.0-088000-26.01-ubuntu2204_amd64.deb
    $ sudo apt-get update
    $ sudo apt-get -y install doca-all
    $ sudo /etc/init.d/openibd restart
    Once the reinstallation is complete, confirm the module is successfully loaded according to section "DOCA Telemetry PHY | Verifying the fwctl Driver".

Environment

DOCA Telemetry PHY-based applications can run on:

  • Host machines with ConnectX-7 or BlueField-3 and newer

  • DPU targets with BlueField-3 and newer

Architecture

The DOCA Telemetry PHY library provides access to the following categories of telemetry data:

  • Operation information – Retrieves details related to active link technology and capabilities.

  • Supported information – Retrieves details related to supported link speed capabilities.

  • Troubleshooting information – Retrieves status information specifically for troubleshooting purposes.

  • Module information – Retrieves details related to cable technology and capabilities.

  • Counter and BER information – Retrieves details related to link statistics and Bit Error Rate (BER).

  • FEC histogram information – Retrieves details related to the link's Forward Error Correction (FEC) histogram.

  • Management cable information – Retrieves information related to the cable EEPROM module in raw format (single page or all related pages) or limited to cable Digital Diagnostic Monitoring (DDM).

To interact with a device (which typically corresponds to a specific port on a NIC), you must create a DOCA Telemetry PHY context using doca_telemetry_phy_create()

Each context operates independently of standard DOCA PHY contexts. Consequently, changes made to PHY configurations are not automatically reflected within the telemetry context.

Configuration Phase

Device Support

DOCA Telemetry PHY requires a device to operate. For picking a device, refer to "DOCA Core Device Discovery".

As device capabilities may change, it is recommended to check your device using the doca_telemetry_phy_is_supported() method.

Each of the provided DOCA Telemetry PHY API  functionalities can be checked independently using the method:

Functionality

Method

Operation information

doca_telemetry_phy_cap_operation_info_is_supported()

Module information

doca_telemetry_phy_cap_module_info_is_supported()

Supported information

doca_telemetry_phy_cap_supported_info_is_supported()

Troubleshooting information

doca_telemetry_phy_cap_troubleshooting_info_is_supported()

Counter and BER information 

doca_telemetry_phy_cap_counter_and_ber_info_is_supported()

FEC histogram information

doca_telemetry_phy_cap_fec_histogram_info_is_supported()

Management cable information

doca_telemetry_phy_cap_management_cable_is_supported()

Output Structure Format

The calling application is strictly responsible for allocating the necessary output structures. Refer to the doca_telemetry_phy.h header file for exact details on the output structures provided by the doca_telemetry_phy context.

Data Interpretation Rules
  • Returned values that fall outside the corresponding enumeration range or prescribed limit values must be interpreted as invalid or not applicable, unless expressly indicated otherwise.

  • Returned values that are not explicitly defined, yet fall within the corresponding enumeration range, must be interpreted as reserved or unknown, unless expressly indicated otherwise.

Output Data Fields

The following table details the specific data fields retrieved by the Telemetry PHY context based on the requested functionality category:

Category

Field / Functionality

Description

Operation

active_protocol

Active protocol (InfiniBand or Ethernet).

state

Firmware PHY manager FSM state.

phy_state

Physical state (related to InfiniBand or Ethernet).

link_speed_active

Link speed active (related to InfiniBand or Ethernet).

link_width

Link width.

fec_mode_active

Active FEC mode.

loopback_mode

Loopback mode.

auto_negotiation

Auto-negotiation status.

Supported

active_protocol

Active protocol (InfiniBand or Ethernet).

enabled_link_speed

Enabled link speed (related to InfiniBand or Ethernet).

supported_cable_speed

Supported link speed (related to InfiniBand or Ethernet).

Troubleshooting

status_opcode_raw

Status operation code raw value.

status_opcode

Status operation code.

group_opcode

Group operation code.

status_message

Status message.

Module

number_of_lanes

Number of supported module lanes.

nominal_bit_rate

Nominal bit rate in Gb/s.

active_protocol

Active protocol (InfiniBand or Ethernet).

error_code

Error code response for Control or Set configuration of the data path.

cable_vendor_info

Cable vendor ID, name, manufacture date, part, serial, and revision numbers.

cable_general_properties_info

Static information (Memory map, cable technology, ID, type, compliance code).

cable_power_and_temp_info

Cable power and temperature info.

active_cable_info

Active cable information (cable emphasis, wavelength).

error_counter_info

Error counter information (latched Tx fault, Rx loss of signal, flags per lane).

status_info

Status information for the plugged cable.

latency_info

Latency information.

Counter & BER

number_of_lanes

Number of supported module lanes.

link_down_counter

Unintentional link drops counter (no remote consideration).

link_down_recovery_counter

Successful recovery events per active link.

time_since_last_clear

Time passed since the last counters clear event (in msec).

active_protocol

Active protocol (InfiniBand or Ethernet).

symbol_errors

Error counter caused by invalid bits that were not corrected by PHY correction mechanisms (Valid for InfiniBand only).

effective_errors

Error counter caused by invalid bits that were not corrected by the FEC algorithm.

raw_errors

Raw error counter caused by invalid bits received.

FEC Histogram

num_of_bins

Available number of bins.

bin_range

Range of bin errors distribution.

bin_errors

Bin errors distribution according to bin_range.

Cable Dump

num_of_pages

Number of pages returned.

page_info

Returned page information (page ID, bytes read from offsets 0 and 128, and size of returned data).

Cable DDM

number_of_lanes

Number of supported module lanes.

module_temperature

Temperature parameters (high/low alarm flags, current value, and thresholds in Celsius).

module_voltage

Module voltage parameters (high/low alarm flags, current value, and thresholds in mV).

rx_power

RX power parameters (high/low alarm flags, current value, and thresholds per lane in dBm).

tx_power

TX power parameters (high/low alarm flags, current value, and thresholds per lane in dBm).

tx_bias

TX bias parameters (high/low alarm flags, current value, and thresholds per lane in mA).

Execution Phase

To retrieve telemetry data during the execution phase, pass your established context and the target output structure to the appropriate retrieval API. 

Warning: Memory Allocation

For every function listed below, you must ensure that memory is properly allocated for the output structure before calling the API.

Information Category

Retrieval API

Operation

doca_telemetry_phy_get_operation_info(context, &operation_info); 

Supported

doca_telemetry_phy_get_supported_info(context, &supported_info);

Troubleshooting

doca_telemetry_phy_get_troubleshooting_info(context, &troubleshooting_info); 

Module

doca_telemetry_phy_get_module_info(context, &module_info); 

Counter and BER

doca_telemetry_phy_get_counter_and_ber_info(context, &counter_and_ber_info); 

FEC histogram

doca_telemetry_phy_get_fec_histogram_info(context, &fec_histogram_info); 

Cable single page

doca_telemetry_phy_get_management_cable_single_page_info(context, &management_cable_raw_info); 

Cable dump

doca_telemetry_phy_get_management_cable_dump_info(context, &management_cable_raw_info); 

Cable DDM

doca_telemetry_phy_get_management_cable_ddm_info(context, &management_cable_ddm_info); 

State Machine

The doca_telemetry_phy context transitions through specific operational states. This section outlines these states, the operations permitted within them, and how to transition between them.

Idle

The context has been created and is idle.

In this state, it is expected for the application to:

  • Destroy the context

  • Start the context for processing

Allowed operations:

It is possible to reach this state as follows:

Previous State

Transition Action

None

Create the context

Running

Call stop

Running

In this state it is expected for the application to:

  1. Stop the context.

  2. Retrieve operation information.

  3. Retrieve supported information.

  4. Retrieve troubleshooting information.

  5. Retrieve module information.

  6. Retrieve counter and BER information.

  7. Retrieve FEC histogram information.

  8. Retrieve management cable single page information.

  9. Retrieve management cable dump information.

  10. Retrieve management cable DDM information.

Allowed operations:

  • Calling stop, moving the context to "Idle" state

It is possible to reach this state as follows:

Previous State

Transition Action

Idle

Successfully start the context

There are currently no state restrictions on the majority of API functions.

Alternative Datapath Options

DOCA Telemetry PHY supports only CPU-based datapaths.

DOCA Telemetry Phy Sample

Running the Sample

  1. Refer to the following documents:

    1. DOCA Installation Guide for Linux for details on how to install BlueField-related software.

    2. NVIDIA BlueField Platform Software Troubleshooting Guide for any issue you may encounter with the installation, compilation, or execution of DOCA samples.

  2. To build a given sample, run the following command. If you downloaded the sample from GitHub, update the path in the first line to reflect the location of the sample file:

    cd /opt/mellanox/doca/samples/doca_telemetry/telemetry_phy
    meson /tmp/build
    ninja -C /tmp/build
    

    The binary doca_telemetry_phy is created under /tmp/build/.

Sample usage: 

Usage: doca_telemetry_phy [DOCA Flags] [Program Flags]

DOCA Flags:
  -h, --help                        Print a help synopsis
  -v, --version                     Print program version information
  -l, --log-level                   Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
  --sdk-log-level                   Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>

Program Flags:
  -p, --pci-addr                    DOCA device PCI device address
  -oi, --get-operation-info         Retrieve operation info
  -si, --get-supported-info         Retrieve supported info
  -ti, --get-troubleshooting-info   Retrieve troubleshooting info
  -mi, --get-module-info            Retrieve module info
  -bi, --get-counter-ber-info       Retrieve counter and BER info
  -fi, --get-fec-histogram-info     Retrieve FEC Histogram info
  -mcspi, --get-management-cable-single-page-info Retrieve management cable single page info
  -mcdi, --get-management-cable-dump-info Retrieve management cable dump info
  -mcddmi, --get-management-cable-ddm-info Retrieve management cable DDM info

The sample includes:

  1. Locating and opening a DOCA device.

  2. Creating a doca_telemetry_phy instance.

  3. Retrieving and printing operation information.

  4. Retrieving and printing supported information.

  5. Retrieving and printing troubleshooting information.

  6. Retrieving and printing module information.

  7. Retrieving and printing counter and BER information.

  8. Retrieving and printing FEC histogram information.

  9. Retrieving and printing management cable single page information.

  10. Retrieving and printing management cable dump information.

  11. Retrieving and printing management cable DDM information.

  12. Destroying the doca_telemetry_phy context.

Last updated: