NVIDIA UFM Enterprise User Manual

Telemetry

UFM Telemetry allows the collection and monitoring of InfiniBand fabric port statistics, such as network bandwidth, congestion, errors, latency, and more.

UFM Telemetry Capabilities

  • Real-time monitoring views

  • Monitoring of multiple attributes

  • Intelligent Counters for error and congestion counters

  • InfiniBand port-based error counters

  • InfiniBand congestion XmitWait counter-based congestion measurement

  • InfiniBand port-based bandwidth data

Telemetry Session Panels Supported Actions

  • Rearrangement via a straightforward drag-and-drop function

  • Resizing by hovering over the panel's border

Understanding Telemetry Types

UFM collects by default two types of telemetry data, each serving different monitoring purposes:

Primary Telemetry (High-Frequency)

  • Default Sample Rate: 30 seconds

  • Use Cases:Real-time monitoringUFM dashboard chartsPort threshold event detectionLive telemetry sessions

  • Counters: Collects approximately 30 key performance counters covering bandwidth, congestion, and error metrics

  • Historical Data: Collected every 5 minutes and stored in UFM's SQLite database

Secondary Telemetry (Low-Frequency)

  • Default Sample Rate: 300 seconds (5 minutes)

  • Use Cases:Historical analysisDetailed diagnosticsExtended monitoring scenarios

  • Counters: Collects approximately 120 extended counters for comprehensive fabric analysis

Telemetry Management Methods

  • Legacy Mode (via UFM): In this mode, telemetry instances are invoked during UFM startup and fully managed by UFM.

  • Clustered Telemetry (UTM) Mode: In this mode, telemetry instances are managed by the UFM Telemetry Manager (UTM) plugin.

  • Telemetry Microservice Mode: In this mode, primary telemetry runs as a dedicated background service.

Telemetry Microservice

Starting with UFM v6.25.1, primary telemetry data collection and computation runs as a dedicated background service, separate from the main UFM process. This architecture improves UFM responsiveness by offloading telemetry processing — including counter computation, rate calculation, and threshold event detection — to an independent process.

The telemetry microservice is enabled by default. To disable it and revert to the legacy in-process telemetry handling, set the following in gv.cfg:

[TelemetryService]
enabled = false

Additional optional settings:

Parameter

Default

Description

enabled

true

Enable or disable the telemetry microservice

port

8090

Internal API port used by the service

log_level

INFO

Logging verbosity (DEBUGINFOWARNINGERROR)

The service logs are written to /opt/ufm/files/log/telemetry_service.log.

When disabled, UFM falls back to the legacy telemetry path with no loss of functionality.


Telemetry Instance

Description

REST API

High-Frequency (Primary) Telemetry Instance

A default telemetry session that collects a predefined set of ~30 counters covering bandwidth, congestion, and error metrics, which UFM analyzes and reports.

These counters are used for: 

  • Default Telemetry Session - An ongoing session used by the UFM to display UFM WebUI dashboard charts information and for monitoring and analyzing ports threshold events (the session interval is 30 secs by default)

  • Real-Time Telemetry - allows users to define live telemetry sessions for monitoring small subsets of devices or ports and a selected set of counters. For more information, refer to Telemetry.

  • Telemetry | Historical Telemetry - based on the primary telemetry and collects statistical data from all fabric ports and stores them in an internal UFM SQLite database (the session interval is 5 mins by default)

For Default and Real-time Telemetry: Monitoring REST API

For Historical Telemetry: History Telemetry Sessions REST API → History Telemetry Sessions 


Low-Frequency (Secondary) Telemetry Instance

Operates automatically upon UFM startup, offering an extended scope of 120 counters. For a list of the Secondary Telemetry Fields, refer to Low-Frequency (Secondary) Telemetry Fields.

N/A

For direct telemetry endpoint access, which exposes the list of supported counters:

For the High-Frequency (Primary) Telemetry Instance, run the following command:

curl -s 127.0.0.1:9001/csv/cset/converted_enterprise

For the Low-Frequency (Secondary) Telemetry Instance, run the following command: 

curl -s 127.0.0.1:9002/csv/xcset/low_freq_debug

Historical Telemetry Collection in UFM

Storage Considerations

UFM periodically collects fabric port statistics and saves them in its SQLite database. Before starting up UFM Enterprise, please consider the following disk space utilization for various fabric sizes and duration.
The measurements in the table below were taken with sampling interval set to once per 30 seconds.

Be aware that the default sampling rate is once per 300 seconds. Disk utilization calculation should be adjusted accordingly.


Number of Nodes

Ports per Node

Storage per Hour

Storage per 15 Days

Storage per 30 Days

16

8

1.6 MB

576 MB (0.563 GB)

1152 MB (1.125 GB)

100

8

11 MB

3960 MB (3.867 GB)

7920 MB (7.734 GB)

500

8

50 MB

18000 MB (17.58 GB)

36000 MB (35.16 GB)

1000

8

100 MB

36000 MB (35.16 GB)

72000 MB (70.31 GB)


Last updated: