DOCA SDK Documentation

DOCA Telemetry Traceback Application

Introduction

The DOCA Telemetry Traceback application provides a framework for monitoring telemetry data to identify long-term trends locally on the device. Because funneling massive volumes of raw telemetry data to a centralized data store can overwhelm network resources at a data-center scale, this application processes data directly on the NVIDIA® BlueField® DPU or NVIDIA® ConnectX® NIC.

The application gathers telemetry data into a circular buffer, retaining the last N samples from the device. This historical data is continuously evaluated against user-defined rules to detect specific "events" (anomalies or notable threshold breaches). When an event triggers, the application writes the preceding historical data and all subsequent active event data to a log file.

The included rules are purposefully simplistic, as this application is intended to serve as a reference traceback model rather than a fully featured data processing engine

System Design

The doca_telemetry_traceback application is a standalone application which runs on top of a NVIDIA® BlueField® DPU or NVIDIA® ConnectX® NIC.

Application Architecture

The doca_telemetry_traceback application is a standalone service composed of four primary components:

  1. DOCA Telemetry Diag (doca_telemetry_diag) instance – The underlying data provider. This instance is configured using parameters from the application's JSON configuration file (e.g., sample period, data IDs) alongside hardcoded application values. 

  2. Sample history circular buffer (samples_circular_buffer) – A contiguous block of memory that stores the telemetry samples. Each record is stored as timestamp, data_id[0], ... data_id[N]. To ensure there is an equal amount of historical data and active event data, the buffer size is automatically doubled based on the user's requested depth.

  3. Rule (rule) collection – A set of logical monitors. Each rule monitors a specific data_id against defined logic. When the logic conditions are met, an "event start" is triggered. The rule remains active until new data arrives that no longer meets the conditions, triggering an "event end".

  4. Events log file – The output destination for event data. Upon execution, the application logs its configuration to ensure absolute certainty regarding the parameters that triggered an event. When an event starts, the log captures the available history, an "event started" marker, all data collected during the event, and an "event end" marker. 

    If multiple events activate simultaneously, individual begin/end rows are inserted to track their respective lifecycles. Data logging pauses once all events conclude and resumes upon the next activation.

0001-application-architecture.png

Traceback Configuration File

The application retrieves its operational parameters from a JSON configuration file.

Example configuration:

{
  "diag_sample_period_ns": 100000,
  "sample_history_depth": 10000,
  "data-ids": [
    {
      "name": "port_0_rx_bytes",
      "data_id": "0x1020000100000000"
    }
  ],
  "rules": [
    {
      "name": "Rx Bytes/s > 1MeBs",
      "data_id": "0x1020000100000000",
      "comparison_type": "differential_greater_than",
      "comparison_value": 1048576
    }
  ]
}

Field

Type

Notes

diag_sample_period_ns

Integer

The sampling interval in nanoseconds. This is passed directly to the diag instance via doca_telemetry_diag_get_sample_period.

sample_history_depth

Integer

The number of historical records a rule evaluates. The application automatically provisions a circular buffer double this size.

data-ids

Array

A collection of data ID definitions.

data-ids[N].name

String

A friendly name for the data_id utilized as the column header in the events log.

data-ids[N].data_id

String

A 16-character hexadecimal value. Format: 0x[0-9a-fA-F]{16}.

rules

String

A collection of rule definitions.

rules[N].name

String

A friendly name used to identify the rule in the events log.

rules[N].data_id

String

The 16-character hexadecimal data_id this rule monitors. Must match an ID defined in the data-ids array.

rules[N].comparison_type

String

Defines the rule's operational logic. Acceptable values:

  • incremental_greater_than (inc_gt)

  • differential_greater_than (dt_gt)

  • differential_less_than (dt_lt)

  • greater_than (gt)

  • less_than (lt)

See section "DOCA Telemetry Traceback Application | Rule Definitions" for details.

rules[N].comparison_value

Integer

The numerical threshold utilized when evaluating the rule. See section "DOCA Telemetry Traceback Application | Rule Definitions" for details.

Rule Definitions

The application supports five distinct rule types. Each rule establishes its own hysteresis logic, defining precisely when an event activates and deactivates.

Incremental Greater Than (inc_gt)

Designed for incrementally increasing values to guarantee triggering each time a value increases by the comparison threshold.

Activation Trigger

Activates if the sampled value exceeds the calculated setpoint (Initial Setpoint = First Sampled Value + comparison_value).

Deactivation Trigger

Deactivates immediately once any new sampled value exceeds the current setpoint, and a new setpoint is calculated.

Rapid data spikes exceeding the setpoint multiple times simultaneously will only activate the rule once.

Differential Greater Than (dt_gt)

Evaluates the delta between the current sample and an older historical sample. Requires at least sample_history_depth values in the buffer before functioning.

Activation Trigger

Activates if the delta between the newest sample and the sample at [N-(sample_history_depth-1)] exceeds the comparison_value.

Deactivation Trigger

Deactivates only after sample_history_depth consecutive samples are acquired where each new delta remains below the comparison_value.

Differential Less Than (dt_lt)

Operates using the identical historical delta logic as dt_gt, but activates when the delta is less than the comparison_value.

Greater Than (gt)

Designed strictly for values that can fluctuate (increase and decrease). It is not suitable for metrics that only increment (e.g., total bytes transferred).

Activation Trigger

Activates when the sampled value explicitly exceeds the comparison_value for sample_history_depth times consecutively.

Deactivation Trigger

Deactivates only after sample_history_depth consecutive samples fall below the comparison_value.

Less Than (lt)

Operates using the identical fluctuation logic as gt, but activates when the sampled value remains below the comparison_value for the required consecutive depth.

DOCA Libraries

This application leverages the following DOCA libraries:

Refer to its official programming guide for extended capability details.

Compiling the Application

Please refer to the DOCA Installation Guide for Linux for details on how to install BlueField-related software.

DOCA reference applications are installed with full source code and build instructions. This allows you to compile them as-is or modify the source code to create custom versions.

For more information about the applications as well as development and compilation tips, refer to the DOCA Reference Applications page.

The source code for this application is located at:

/opt/mellanox/doca/applications/telemetry/traceback/

Compiling All Applications

By default, the DOCA Meson project is configured to build every reference application simultaneously.

cd /opt/mellanox/doca/applications/
meson /tmp/build
ninja -C /tmp/build

The compiled binaries will be generated in /tmp/build/telemetry/.

Compiling Only the Current Application

To significantly reduce build times, administrators can configure Meson to isolate and compile only the Telemetry Traceback application. This can be achieved via CLI flags or by modifying the configuration file.

Option 1: Command Line Configuration

Append the following flags to the meson setup command to disable all other applications and explicitly enable Traceback:

$ cd /opt/mellanox/doca/applications/
$ meson /tmp/build -Denable_all_applications=false -Denable_telemetry=true -Denable_telemetry_traceback=true
$ ninja -C /tmp/build

Option 2: Configuration File 

You can persistently isolate the build by editing the meson_options.txt file directly.

  1. Open /opt/mellanox/doca/applications/meson_options.txt and set the following parameters:
    enable_all_applications = falseenable_telemetry = trueenable_telemetry_traceback =trueConfiguration RuleIf enable_all_applications is set to false, the main enable_telemetry flag must be set to true for any underlying telemetry application to compile successfully. The individual applications default to true, but you can toggle them on or off as needed once the main telemetry flag is enabled.

  2. Once the file is saved, run the standard compilation commands:

    $ cd /opt/mellanox/doca/applications/
    $ meson /tmp/build
    $ ninja -C /tmp/build
    

Running the Application

Application Execution

The DOCA Telemetry Traceback application is distributed as source code and must be compiled prior to execution.

Execution syntax: 

$ doca_telemetry_traceback [DOCA Flags] [Program Flags]

Execution example:

$ sudo doca_telemetry_traceback -d 03:00.0 --rules my_rules_file.json
Root Privileges

The application strictly requires sudo (root privileges) to access hardware-level telemetry data on the device.

PCIe Addressing

Ensure the -d parameter (e.g., 03:00.0) exactly matches the physical PCIe address of your target device.

Command Line Flags

General Flags

Short Flag

Long Flag

Description

-h

--help

Prints a help synopsis and exits

-v

--version

Prints program version information and exits

-l

--log-level

Sets the numeric log level for the application:

  • 10 – DISABLE

  • 20 – CRITICAL 

  • 30 – ERROR

  • 40 – WARNING

  • 50 – INFO

  • 60 – DEBUG

  • 70 – TRACE (requires compilation with TRACE support)

N/A

--sdk-log-level

Sets the SDK numeric log level using the same 10-70 scale as above

N/A

--log-filter

Filters logs from specific modules (comma-separated list)

-j

--json

Parses command-line flags from a specified input JSON file

Refer to DOCA Arg Parser for more information regarding the supported flags and execution modes.

Program Flags

Short Flag

Long Flag

Description

-d

--device

Mandatory flag. The exact PCIe address of the device to read telemetry information from.

-r

--rules

The path to a JSON file containing the application configuration. See section "DOCA Telemetry Traceback Application | Traceback Configuration File".

N/A

--event-log

Path to output log file which will be populated with events information

N/A

--force-ownership

Force ownership of the diag domain. Only applies when the device firmware is in single domain mode.

Unsupported Data IDs

It is not guaranteed that all data IDs supported by the application are supported by your specific device. The application will report any unsupported IDs, requiring administrators to adjust their inputs accordingly.

Troubleshooting

Refer to the NVIDIA BlueField Platform Software Troubleshooting Guide for any issue encountered with the compilation, installation, or execution of the DOCA applications.

Application Code Flow

The DOCA Telemetry Traceback application executes in three distinct operational phases: argument parsing, component initialization, and a continuous data polling loop.

Phase 1: Argument Parsing

The application initializes the logger and parses both CLI arguments and JSON configurations to establish its operational parameters.

  • create_doca_logger_backend(): Creates the logger backend to display system messages.

  • parse_configuration(): Parses the CLI arguments. Values unique to the application instance are passed via the CLI, while values shared across a multi-server deployment are deferred to the JSON file.

  • load_file_into_buffer() and parse_json_configuration(): Loads the shared configuration from the defined JSON file (defaults to rules.json if not overridden via CLI).

  • validate_configuration(): Performs initial validity checks to catch expected configuration errors before allocating hardware resources.

Phase 2: Component Initialization

The application prepares its memory buffers, defines its rule objects, and establishes the connection to the DOCA Telemetry Diag context.

Data and Logging Preparation

  • History buffer: Calls init_data_aligned_ring() to prepare the main circular data store. The buffer is deliberately sized to double the requested depth, ensuring there is exactly as much history available as there is active data.

  • Rule objects: Populates a C++ vector with rule objects. 

    Memory Optimization

    The rules vector is specifically sized using reserve because the correct size is known upfront, allowing it to be sized correctly with a single memory allocation to prevent fragmentation.

    Data Indexing

    For each rule, the application calculates an array index (sample_record_offset) of 1 + N. This explicit math is required because the application hardcodes the DOCA_TELEMETRY_DIAG_OUTPUT_FORMAT_1 format, which produces a timestamp followed by N values.

  • Events log: Creates or truncates the events.log file using std::ofstream and immediately writes the application configuration to the header via log_app_start_header().

Hardware and Telemetry Initialization

  • Device open: Identifies and opens the requested DOCA device using open_doca_dev()

    The device pointer is stored in a RAII unique_ptr to guarantee automatic cleanup upon application exit without requiring complex goto trees.

  • Context creation: Verifies that the device supports telemetry data (requiring elevated permissions), creates the context via create_and_prepare_diag_ctx(), and starts the data sampling process with doca_telemetry_diag_start().

  • Context validation: Calls doca_telemetry_diag_get_sample_size() and doca_telemetry_diag_get_sample_period() to confirm the hardware accepted the requested parameters.

    If the actual sample size or period does not match the requested parameters, the application explicitly stops. While this could be worked around, the codebase relies on these assumptions for simplicity and requires them to be correct.

Phase 3: Main Execution Loop

The application enters a continuous polling state, checking the telemetry instance for new data, updating the circular buffer, and evaluating rule logic.

Setup and Signal Handling

  • Allocates a small read buffer (sample_buffer) to ingest data from the telemetry instance. 

    Data is read here, evaluated, mutated, and then copied into the history buffer. This decoupled design future-proofs the application, allowing users to read from multiple telemetry sources where the formats between the read buffer and history buffer could differ significantly.

  • Installs a SIGINT signal handler allowing administrators to terminate the process cleanly via CTRL+C.

Datapath Loop

  • Data polling: Calls doca_telemetry_diag_query_counters(). If no data is available (0 samples), the thread sleeps for diag_sample_period_ns to prevent CPU exhaustion. Typically, 1 sample is read, though 2+ may be read if the application experiences wake-up latency.

  • Sample processing:

    • Casts the read pointer to doca_telemetry_diag_data_sample_format_1.

    • Advances the circular buffer head (advance_head_idx()) and copies the hardware timestamp and data values into the new history record.  

      Bit manipulation is applied here to accommodate the DOCA_TELEMETRY_DIAG_TIMESTAMP_SOURCE_RTC format.

    • If any rules are currently active, the sample is immediately written to the events log file.

Rule Evaluation and State Changes

Each rule evaluates the new sample using monitor.process_sample()

If a rule does not change state during the processing of this new sample, there is no more work to be done, and the application immediately moves to process the next rule.

Rule Activates

The application increments active_monitors_count.

  • If this is the first rule activating, the application writes the data column headers and dumps the historical buffer to the events log.

  • For all rule activations (whether the first or the Nth rule), an "event started" message is explicitly written to the log. 

Note on Sequencing

All samples read before the rule activated have already been written to the history. Data received while any rule is active is logged strictly before the rules processing step for each sample, continuing until all rules deactivate.

Rule Deactivates

Decrements active_monitors_count and writes an "event completed" marker. If the count reaches 0, no further data is written to the log file until a rule activates again in the future.

Because device cleanup is handled via C++ RAII unique_ptr objects, all resources are safely released automatically when the SIGINT signal breaks the execution loop

References

  • /opt/mellanox/doca/applications/telemetry/traceback/

Last updated: