NVIDIA UFM Cable Validation Tool

Prometheus Endpoint

Cable Validation Tool - Prometheus Metrics Endpoint

Overview

The Cable Validation Tool (CVT) now provides a Prometheus-compatible metrics endpoint that exposes real-time cable health and performance data for monitoring and alerting. This endpoint enables integration with modern monitoring stacks like Prometheus, Grafana, and other observability tools.

Features

🎯 Key Capabilities

  • Real-time Metrics: Live cable validation data from network switches and hosts

  • Multi-format Support: Prometheus, JSON, and CSV output formats

  • Rich Labeling: Complete network topology context with peer relationships

  • High Performance: Multi-level caching optimized for frequent scraping

  • Production Ready: Handles 100K+ ports with memory-adaptive optimizations

📊 Metrics Categories

  • Power Metrics: RX/TX optical power per lane (up to 8 lanes per port)

  • BER Metrics: Effective and Raw Bit Error Rates

  • Temperature Metrics: Module temperature and thresholds

  • Counter Metrics: Transceiver reinsert/swap events

  • Validation Metrics: Port validation status with issue descriptions

  • Threshold Metrics: Power and temperature alarm thresholds

  • Timestamp Metrics: Data collection and report timestamps

API Endpoints

Base URL

https://<cvt-server>/cablevalidation/metrics

Available Endpoints

Endpoint

Format

Content-Type

Description

/cablevalidation/metrics

Prometheus

text/plain

Standard Prometheus exposition format

/cablevalidation/metrics/json

JSON

application/json

Structured JSON for programmatic access

/cablevalidation/metrics/csv

CSV

text/plain

Comma-separated values for spreadsheet import

Authentication

  • No authentication required for metrics endpoints

  • HTTPS enforced for security

  • Bypasses session handling for automated scraping

Sample Output

Prometheus Format

# HELP effective_ber Effective Bit Error Rate # TYPE effective_ber gauge # HELP validation_status Port validation status with issue descriptions in labels (value = issue count) # TYPE validation_status gauge # HELP port_info Port information with status and validation details in labels # TYPE port_info gauge # Healthy port with performance metrics effective_ber{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 1.5e-254 1759345924622 module_temperature{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 65.2 1759345924622 validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0 1759345924622 # Unplugged port with validation issue (power/temp metrics excluded due to NA values) validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1 1759345923752 effective_ber{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 0.0 1759345923752 time_since_last_clear{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 320035.6 1759345923752 # Port info with detailed status port_info{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",phy_manager_state="Disable",module_oper_status="unplugged",cable_sn="N/A",cable_pn="N/A",protocol="Ethernet",module_fw_version="N/A"} 1 1759345923752

JSON Format

{ "ufm-host38:enp3s0f0np0": { "timestamp": 1757524769.645, "port_info": { "node_name": "ufm-host38", "port_name": "enp3s0f0np0", "peer_node_name": "r-ufm-sw-eth01", "peer_port_name": "swp2", "node_type": "Host", "su_number": "SU1", "data_hall": "DH1" }, "port_labels": { "cable_sn": "ABC123", "cable_pn": "DEF456", "protocol": "400G" }, "port_stats": { "effective_ber": 1.5e-254, "module_temperature": 65.2, "rx_power_lane_0": -2.5 }, "validation_data": { "issues_count": 1, "last_report_time": 1757524769.645, "issues": { "WrongNeighbor": "Check cable connection to switch2" } } } }

Metrics Reference

Power Metrics

Metric

Type

Description

Unit

rx_power_lane_N

gauge

RX optical power for lane N (0-7, not all lanes may be present)

dBm

tx_power_lane_N

gauge

TX optical power for lane N (0-7, not all lanes may be present)

dBm

rx_power_high_th

gauge

RX power high threshold

dBm

rx_power_low_th

gauge

RX power low threshold

dBm

BER Metrics

Metric

Type

Description

effective_ber

gauge

Effective Bit Error Rate

raw_ber

gauge

Raw Bit Error Rate

Temperature Metrics

Metric

Type

Description

Unit

module_temperature

gauge

Current module temperature

Celsius

temperature_high_th

gauge

Temperature high threshold

Celsius

temperature_low_th

gauge

Temperature low threshold

Celsius

Status Metrics

Metric

Type

Description

Values

port_status

gauge

Port plugged status

1=Up, 0=Down

port_oper_status

gauge

Port operational status

1=Up, 0=Down

Counter Metrics

Metric

Type

Description

transceiver_reinsert_cnt

counter

Number of transceiver reinsert events

transceiver_swap_cnt

counter

Number of transceiver swap events

time_since_last_clear

gauge

Time since last counter clear (seconds)

Validation Metrics

Metric

Type

Description

Special Features

validation_status

gauge

Port validation status with issue descriptions

Value = issue count, descriptions in labels

last_report_time

gauge

Timestamp of last validation report

Unix timestamp

Validation Status Labels

The validation_status metric includes dynamic labels for each type of validation issue:

  • WrongNeighbor: "Check cable connection to correct switch"

  • MediaUnplugged: "Insert; Reseat or Replace Cable/Transceiver"

  • AnomalousPort: "Temperature exceeds threshold"

  • FlappingLink: "Reseat transceiver; Check Fiber"

  • UnknownNeighbor: "Verify neighbor device connectivity"

  • WrongPort: "Check port mapping in topology"

  • ExtraCable: "Remove unexpected cable connection"

  • UnreachableDevice: "Check device connectivity and power"

  • LinkDown_NoSignal: "Check physical connection"

  • ErrDisable_Flap: "Port disabled due to flapping"

  • AdminDown: "Port administratively disabled"

  • ErrDisable_Rx: "RX error disable condition"

  • NegotiationFail: "Check autonegotiation settings"

  • NicNameMismatch: "Verify NIC provisioning"

  • ModulePnMismatch: "Replace with compatible module"

Note: Commas in descriptions are automatically converted to semicolons to maintain Prometheus label format compatibility.

Examples:

# Port with validation issues (unplugged cable) validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1 # Port without issues (healthy connection) validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0 # Port with multiple validation issues validation_status{node="switch1",port="1/1",WrongNeighbor="Check cable connection to switch2",AnomalousPort="Temperature exceeds threshold"} 2

Labels

Topology Labels (All Metrics)

  • node: Switch or host name

  • port: Port identifier

  • peer_node: Connected peer node name

  • peer_port: Connected peer port identifier

  • node_type: Node type (Switch, Host, etc.)

  • su_number: Scalable Unit identifier

  • data_hall: Data hall location

Cable Labels (Status Metrics Only)

  • cable_sn: Cable serial number

  • cable_pn: Cable part number

  • protocol: Cable protocol (400G, InfiniBand, etc.)

  • port_status: Port status (Up, Down, etc.)

  • plugged: Module plugged status

Performance Characteristics

Update Frequency

  • Agent Data: Updated every 10 minutes (configurable)

  • Metrics Cache: Invalidated on data changes

  • Prometheus Scraping: Recommended 15-30 second intervals

Performance Metrics

Deployment Size

Response Time

Memory Usage

Caching Strategy

< 10K ports

< 50ms

~20MB

Full caching enabled

10K-50K ports

< 200ms

~100MB

Collection cache only

50K+ ports

< 500ms

~200MB

No caching, real-time generation

Optimization Features

  • Multi-level Caching: Port, collection, and label caching

  • Memory Adaptive: Automatically adjusts for large deployments

  • Smart Change Detection: Only updates when cable/module data changes

  • Zero Value Handling: Includes all values for complete visibility

Troubleshooting

Common Issues

1. No Metrics Data

Symptoms: Empty response or no metrics Causes:

  • CVT service not running

  • No topology loaded

  • No advanced stats collection

Solutions:

# Check service status # Check topology loading # Check agent connectivity

2. Missing Port Data

Symptoms: Some ports not appearing in metrics Causes:

  • Port not in loaded topology

  • Agent not deployed on switch

  • Advanced stats not collected

Solutions:

  • Verify topology includes all expected ports

  • Deploy agents on missing switches

  • Check agent connectivity and data collection

3. Stale Timestamps

Symptoms: Old timestamps in metrics Causes:

  • Agent not sending updates

  • Network connectivity issues

Solutions:

  • Check agent logs for errors

  • Verify network connectivity to switches

  • Restart agents if necessary

4. Missing Validation Data

Symptoms: validation_status metrics missing or always 0 Causes:

  • Validation reports not being generated

  • Agent data filtering (switch not in topology)

  • Report processing errors

Solutions:

  • Verify validation is started on agents

  • Check switch IP exists in topology

  • Review agent and collector logs for errors

5. Inconsistent Issue Counts

Symptoms: validation_status count doesn't match expected issues Causes:

  • Issues filtered by port

  • Report data synchronization issues

  • Processing errors

Solutions:

  • Check that report data includes port-specific issues

  • Verify advanced stats and reports arrive together

  • Review validation report structure

Performance Tuning

Environment Variables

TBD: not supported yet.

# Adjust caching thresholds PROMETHEUS_MAX_CACHED_PORTS=10000 # Disable detailed metrics for very large deployments PROMETHEUS_ENABLE_DETAILED_METRICS=false # Adjust cache TTL PROMETHEUS_CACHE_TTL=60

Memory Optimization

For deployments > 50K ports:

  • Collection-level caching automatically disabled

  • Port-level caching automatically disabled

  • Real-time generation used (acceptable 200-500ms response time)

Security Considerations

Access Control

  • HTTPS Required: All access must be over HTTPS

  • No Authentication: Designed for automated monitoring tools

  • Network Restrictions: Consider IP-based access control

Sensitive Data

  • Network Topology: Metrics expose network structure

  • Cable Information: Serial numbers and part numbers included

  • Performance Data: Could reveal network capacity information

<Location /cablevalidation/metrics> Use BringupProxy SSLRequireSSL # Restrict to monitoring networks <RequireAll> Require ip 10.0.0.0/8 # Internal networks Require ip 172.16.0.0/12 # Container networks Require ip 192.168.0.0/16 # Private networks </RequireAll> </Location>

Integration Examples

Python Client Example

import requests import json # Get metrics in different formats def get_cvt_metrics(server: str, port: int, response_format='prometheus'): endpoints = { 'prometheus': '/cablevalidation/metrics', 'json': '/cablevalidation/metrics/json', 'csv': '/cablevalidation/metrics/csv' } url = f"https://{server}:{port}{endpoints[format]}" response = requests.get(url, verify=False) if format == 'json': return response.json() return response.text # Usage metrics = get_cvt_metrics('cvt-server.example.com', 'json') for port_key, port_data in metrics.items(): if port_data['port_stats']['effective_ber'] > 1e-12: print(f"High BER on {port_key}: {port_data['port_stats']['effective_ber']}")

Validation Monitoring Examples

# Find all ports with validation issues validation_status > 0 # Count issues by syndrome type sum by (node) (validation_status{WrongNeighbor!=""}) sum by (node) (validation_status{MediaUnplugged!=""}) sum by (node) (validation_status{LinkDown_NoSignal!=""}) # Find specific issue types validation_status{MediaUnplugged!=""} > 0 # Unplugged cables validation_status{AdminDown!=""} > 0 # Administratively disabled ports validation_status{ModulePnMismatch!=""} > 0 # Hardware compatibility issues # Ports with multiple issue types validation_status{WrongNeighbor!="",AnomalousPort!=""} # Correlation with performance metrics (validation_status > 0) and (effective_ber > 1e-12) # Port status correlation validation_status{MediaUnplugged!=""} and on() port_info{module_oper_status="unplugged"}

Architecture

Data Flow

Network Agents → CVT Collector → Advanced Stats + Report Data → Prometheus Collector → Metrics Endpoint ↓ ↓ ↓ ↓ ↓ (10 min) (Real-time) (Synchronized) (Multi-level Cache) (GET request)

Enhanced Data Processing
  1. Agent Data Validation: Switch IP validated against topology before processing

  2. Synchronized Processing: Advanced stats and validation reports processed together

  3. Optimized Issue Processing: Report data pre-processed to group issues by port (O(n+m) complexity)

  4. Independent Validation Cache: PortValidationStatus class with hash-based change detection

  5. Robust Syndrome Handling: Automatic fallback for unknown syndromes with developer warnings

  6. Smart Data Quality: NA values properly excluded, counters preserve semantics

Caching Strategy

  1. Port-level Cache: Individual port metrics cached until data changes

  2. Collection-level Cache: Aggregated output cached for fast retrieval

  3. Label Cache: Stable topology/cable labels cached separately

  4. Validation Cache: Independent cache for validation status with hash-based change detection

  5. Metadata Cache: Static TYPE/HELP comments cached permanently

Performance Optimizations

  • Push-based Updates: Metrics updated when advanced stats arrive

  • Smart Change Detection: Only cable/module changes invalidate caches

  • Memory Adaptive: Caching disabled automatically for large deployments

  • String Manipulation: Efficient JSON aggregation using string operations

  • Validation Processing: O(n+m) complexity with report preprocessing

  • Hash-based Cache: Validation cache only invalidated when issue content changes

  • Agent Data Filtering: Invalid switch IPs filtered early to prevent unnecessary processing

Monitoring Best Practices

Prometheus Configuration

  • Scrape Interval: 15-30 seconds (matches CVT data update frequency)

  • Timeout: 10 seconds (allows for cache generation)

  • Retention: Configure based on historical analysis needs

Alerting Guidelines

  • BER Thresholds: Alert when effective_ber > 1e-12

  • Temperature Limits: Alert when module_temperature approaches temperature_high_th

  • Validation Issues: Alert when validation_status > 0

  • Critical Issues: Alert on specific syndromes (MediaUnplugged, UnreachableDevice, ModulePnMismatch)

  • Infrastructure Issues: Alert on LinkDown_NoSignal, ErrDisable conditions

  • Counter Anomalies: Alert on rapid increases in transceiver_reinsert_cnt

Sample Alerting Rules

# Validation issues alert - alert: PortValidationIssues expr: validation_status > 0 labels: severity: warning annotations: summary: "Port {{ $labels.node }}:{{ $labels.port }} has validation issues" description: "{{ $value }} validation issues detected" # Critical validation issues - alert: CriticalPortIssues expr: validation_status{MediaUnplugged!=""} > 0 or validation_status{UnreachableDevice!=""} > 0 labels: severity: critical annotations: summary: "Critical issues on {{ $labels.node }}:{{ $labels.port }}" description: "{{ if $labels.MediaUnplugged }}Cable unplugged: {{ $labels.MediaUnplugged }}{{ end }}{{ if $labels.UnreachableDevice }}Device unreachable: {{ $labels.UnreachableDevice }}{{ end }}" # Infrastructure issues - alert: InfrastructureIssues expr: validation_status{LinkDown_NoSignal!=""} > 0 or validation_status{ModulePnMismatch!=""} > 0 labels: severity: warning annotations: summary: "Infrastructure issue on {{ $labels.node }}:{{ $labels.port }}" # Administrative issues - alert: AdminIssues expr: validation_status{AdminDown!=""} > 0 or validation_status{NicNameMismatch!=""} > 0 labels: severity: info annotations: summary: "Administrative issue on {{ $labels.node }}:{{ $labels.port }}"

Version Information

Release Notes

  • Version: 1.1.0

  • Release Date: October 2025

  • Compatibility: CVT 1.7.0 and later

  • Dependencies: Requires advanced stats collection enabled

New in Version 1.1.0
  • ✅ Validation Metrics Integration: Port validation status with actionable issue descriptions

  • ✅ Synchronized Data Processing: Advanced stats and validation reports processed together

  • ✅ Performance Optimizations: O(n+m) validation processing, hash-based change detection

  • ✅ Enhanced Security: Agent data validation prevents processing from unknown switches

  • ✅ Improved Data Quality: None-based initialization for gauges, proper counter semantics

  • ✅ Better Caching: Independent validation cache with content-based invalidation

  • ✅ Comprehensive Syndrome Coverage: 15+ validation issue types with fallback handling

  • ✅ Real-world Validation: Successfully tested with production data and unplugged ports

API Stability

  • Metric Names: Stable (no breaking changes planned)

  • Label Names: Stable (additions possible, no removals)

  • Output Format: Prometheus standard compliance maintained

  • Endpoint URLs: Stable API contract

Data Quality Improvements

Enhanced Counter Semantics

  • Gauges default to None: Missing sensor data excluded instead of showing false zeros

  • Counters preserve values: No unexpected resets when data temporarily unavailable

  • Proper NA handling: Invalid data marked as NA and excluded from metrics

  • Temperature accuracy: Fixed zero temperature issue by using actual amber timestamps

Validation Integration Benefits

  • Synchronized processing: Performance metrics and validation issues always in sync

  • Rich context: Issue descriptions provide actionable corrective actions

  • Efficient processing: O(n+m) complexity prevents performance degradation

  • Smart caching: Validation cache independent of performance metrics cache

Support

Troubleshooting

  1. Check CVT Service: Ensure Cable Validation service is running

  2. Verify Topology: Confirm network topology is loaded

  3. Agent Status: Check that agents are deployed and collecting data

  4. Network Connectivity: Verify switch/host accessibility

Performance Monitoring

# Check metrics endpoint response time time curl -k https://cvt-server/cablevalidation/metrics > /dev/null

Contact Information

  • Development Team: Cable Validation Engineering

  • Documentation: [Internal Wiki Link]

  • Support: [Support Channel/Email]


This endpoint provides comprehensive cable validation metrics for modern monitoring and observability workflows, enabling proactive network health management and automated alerting.


Last updated: