Cable Validation Tool - Prometheus Metrics Endpoint
Overview
The Cable Validation Tool (CVT) now provides a Prometheus-compatible metrics endpoint that exposes real-time cable health and performance data for monitoring and alerting. This endpoint enables integration with modern monitoring stacks like Prometheus, Grafana, and other observability tools.
Features
🎯 Key Capabilities
-
Real-time Metrics: Live cable validation data from network switches and hosts
-
Multi-format Support: Prometheus, JSON, and CSV output formats
-
Rich Labeling: Complete network topology context with peer relationships
-
High Performance: Multi-level caching optimized for frequent scraping
-
Production Ready: Handles 100K+ ports with memory-adaptive optimizations
📊 Metrics Categories
-
Power Metrics: RX/TX optical power per lane (up to 8 lanes per port)
-
BER Metrics: Effective and Raw Bit Error Rates
-
Temperature Metrics: Module temperature and thresholds
-
Counter Metrics: Transceiver reinsert/swap events
-
Validation Metrics: Port validation status with issue descriptions
-
Threshold Metrics: Power and temperature alarm thresholds
-
Timestamp Metrics: Data collection and report timestamps
API Endpoints
Base URL
https://<cvt-server>/cablevalidation/metrics
Available Endpoints
|
Endpoint |
Format |
Content-Type |
Description |
|---|---|---|---|
|
|
Prometheus |
|
Standard Prometheus exposition format |
|
|
JSON |
|
Structured JSON for programmatic access |
|
|
CSV |
|
Comma-separated values for spreadsheet import |
Authentication
-
No authentication required for metrics endpoints
-
HTTPS enforced for security
-
Bypasses session handling for automated scraping
Sample Output
Prometheus Format
# HELP effective_ber Effective Bit Error Rate # TYPE effective_ber gauge # HELP validation_status Port validation status with issue descriptions in labels (value = issue count) # TYPE validation_status gauge # HELP port_info Port information with status and validation details in labels # TYPE port_info gauge # Healthy port with performance metrics effective_ber{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 1.5e-254 1759345924622 module_temperature{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 65.2 1759345924622 validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0 1759345924622 # Unplugged port with validation issue (power/temp metrics excluded due to NA values) validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1 1759345923752 effective_ber{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 0.0 1759345923752 time_since_last_clear{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 320035.6 1759345923752 # Port info with detailed status port_info{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",phy_manager_state="Disable",module_oper_status="unplugged",cable_sn="N/A",cable_pn="N/A",protocol="Ethernet",module_fw_version="N/A"} 1 1759345923752
JSON Format
{ "ufm-host38:enp3s0f0np0": { "timestamp": 1757524769.645, "port_info": { "node_name": "ufm-host38", "port_name": "enp3s0f0np0", "peer_node_name": "r-ufm-sw-eth01", "peer_port_name": "swp2", "node_type": "Host", "su_number": "SU1", "data_hall": "DH1" }, "port_labels": { "cable_sn": "ABC123", "cable_pn": "DEF456", "protocol": "400G" }, "port_stats": { "effective_ber": 1.5e-254, "module_temperature": 65.2, "rx_power_lane_0": -2.5 }, "validation_data": { "issues_count": 1, "last_report_time": 1757524769.645, "issues": { "WrongNeighbor": "Check cable connection to switch2" } } } }
Metrics Reference
Power Metrics
|
Metric |
Type |
Description |
Unit |
|---|---|---|---|
|
|
gauge |
RX optical power for lane N (0-7, not all lanes may be present) |
dBm |
|
|
gauge |
TX optical power for lane N (0-7, not all lanes may be present) |
dBm |
|
|
gauge |
RX power high threshold |
dBm |
|
|
gauge |
RX power low threshold |
dBm |
BER Metrics
|
Metric |
Type |
Description |
|---|---|---|
|
|
gauge |
Effective Bit Error Rate |
|
|
gauge |
Raw Bit Error Rate |
Temperature Metrics
|
Metric |
Type |
Description |
Unit |
|---|---|---|---|
|
|
gauge |
Current module temperature |
Celsius |
|
|
gauge |
Temperature high threshold |
Celsius |
|
|
gauge |
Temperature low threshold |
Celsius |
Status Metrics
|
Metric |
Type |
Description |
Values |
|---|---|---|---|
|
|
gauge |
Port plugged status |
1=Up, 0=Down |
|
|
gauge |
Port operational status |
1=Up, 0=Down |
Counter Metrics
|
Metric |
Type |
Description |
|---|---|---|
|
|
counter |
Number of transceiver reinsert events |
|
|
counter |
Number of transceiver swap events |
|
|
gauge |
Time since last counter clear (seconds) |
Validation Metrics
|
Metric |
Type |
Description |
Special Features |
|---|---|---|---|
|
|
gauge |
Port validation status with issue descriptions |
Value = issue count, descriptions in labels |
|
|
gauge |
Timestamp of last validation report |
Unix timestamp |
Validation Status Labels
The validation_status metric includes dynamic labels for each type of validation issue:
-
WrongNeighbor: "Check cable connection to correct switch" -
MediaUnplugged: "Insert; Reseat or Replace Cable/Transceiver" -
AnomalousPort: "Temperature exceeds threshold" -
FlappingLink: "Reseat transceiver; Check Fiber" -
UnknownNeighbor: "Verify neighbor device connectivity" -
WrongPort: "Check port mapping in topology" -
ExtraCable: "Remove unexpected cable connection" -
UnreachableDevice: "Check device connectivity and power" -
LinkDown_NoSignal: "Check physical connection" -
ErrDisable_Flap: "Port disabled due to flapping" -
AdminDown: "Port administratively disabled" -
ErrDisable_Rx: "RX error disable condition" -
NegotiationFail: "Check autonegotiation settings" -
NicNameMismatch: "Verify NIC provisioning" -
ModulePnMismatch: "Replace with compatible module"
Note: Commas in descriptions are automatically converted to semicolons to maintain Prometheus label format compatibility.
Examples:
# Port with validation issues (unplugged cable) validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1 # Port without issues (healthy connection) validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0 # Port with multiple validation issues validation_status{node="switch1",port="1/1",WrongNeighbor="Check cable connection to switch2",AnomalousPort="Temperature exceeds threshold"} 2
Labels
Topology Labels (All Metrics)
-
node: Switch or host name -
port: Port identifier -
peer_node: Connected peer node name -
peer_port: Connected peer port identifier -
node_type: Node type (Switch, Host, etc.) -
su_number: Scalable Unit identifier -
data_hall: Data hall location
Cable Labels (Status Metrics Only)
-
cable_sn: Cable serial number -
cable_pn: Cable part number -
protocol: Cable protocol (400G, InfiniBand, etc.) -
port_status: Port status (Up, Down, etc.) -
plugged: Module plugged status
Performance Characteristics
Update Frequency
-
Agent Data: Updated every 10 minutes (configurable)
-
Metrics Cache: Invalidated on data changes
-
Prometheus Scraping: Recommended 15-30 second intervals
Performance Metrics
|
Deployment Size |
Response Time |
Memory Usage |
Caching Strategy |
|---|---|---|---|
|
< 10K ports |
< 50ms |
~20MB |
Full caching enabled |
|
10K-50K ports |
< 200ms |
~100MB |
Collection cache only |
|
50K+ ports |
< 500ms |
~200MB |
No caching, real-time generation |
Optimization Features
-
Multi-level Caching: Port, collection, and label caching
-
Memory Adaptive: Automatically adjusts for large deployments
-
Smart Change Detection: Only updates when cable/module data changes
-
Zero Value Handling: Includes all values for complete visibility
Troubleshooting
Common Issues
1. No Metrics Data
Symptoms: Empty response or no metrics Causes:
-
CVT service not running
-
No topology loaded
-
No advanced stats collection
Solutions:
# Check service status # Check topology loading # Check agent connectivity
2. Missing Port Data
Symptoms: Some ports not appearing in metrics Causes:
-
Port not in loaded topology
-
Agent not deployed on switch
-
Advanced stats not collected
Solutions:
-
Verify topology includes all expected ports
-
Deploy agents on missing switches
-
Check agent connectivity and data collection
3. Stale Timestamps
Symptoms: Old timestamps in metrics Causes:
-
Agent not sending updates
-
Network connectivity issues
Solutions:
-
Check agent logs for errors
-
Verify network connectivity to switches
-
Restart agents if necessary
4. Missing Validation Data
Symptoms: validation_status metrics missing or always 0 Causes:
-
Validation reports not being generated
-
Agent data filtering (switch not in topology)
-
Report processing errors
Solutions:
-
Verify validation is started on agents
-
Check switch IP exists in topology
-
Review agent and collector logs for errors
5. Inconsistent Issue Counts
Symptoms: validation_status count doesn't match expected issues Causes:
-
Issues filtered by port
-
Report data synchronization issues
-
Processing errors
Solutions:
-
Check that report data includes port-specific issues
-
Verify advanced stats and reports arrive together
-
Review validation report structure
Performance Tuning
Environment Variables
TBD: not supported yet.
# Adjust caching thresholds PROMETHEUS_MAX_CACHED_PORTS=10000 # Disable detailed metrics for very large deployments PROMETHEUS_ENABLE_DETAILED_METRICS=false # Adjust cache TTL PROMETHEUS_CACHE_TTL=60
Memory Optimization
For deployments > 50K ports:
-
Collection-level caching automatically disabled
-
Port-level caching automatically disabled
-
Real-time generation used (acceptable 200-500ms response time)
Security Considerations
Access Control
-
HTTPS Required: All access must be over HTTPS
-
No Authentication: Designed for automated monitoring tools
-
Network Restrictions: Consider IP-based access control
Sensitive Data
-
Network Topology: Metrics expose network structure
-
Cable Information: Serial numbers and part numbers included
-
Performance Data: Could reveal network capacity information
Recommended Security
<Location /cablevalidation/metrics> Use BringupProxy SSLRequireSSL # Restrict to monitoring networks <RequireAll> Require ip 10.0.0.0/8 # Internal networks Require ip 172.16.0.0/12 # Container networks Require ip 192.168.0.0/16 # Private networks </RequireAll> </Location>
Integration Examples
Python Client Example
import requests import json # Get metrics in different formats def get_cvt_metrics(server: str, port: int, response_format='prometheus'): endpoints = { 'prometheus': '/cablevalidation/metrics', 'json': '/cablevalidation/metrics/json', 'csv': '/cablevalidation/metrics/csv' } url = f"https://{server}:{port}{endpoints[format]}" response = requests.get(url, verify=False) if format == 'json': return response.json() return response.text # Usage metrics = get_cvt_metrics('cvt-server.example.com', 'json') for port_key, port_data in metrics.items(): if port_data['port_stats']['effective_ber'] > 1e-12: print(f"High BER on {port_key}: {port_data['port_stats']['effective_ber']}")
Validation Monitoring Examples
# Find all ports with validation issues validation_status > 0 # Count issues by syndrome type sum by (node) (validation_status{WrongNeighbor!=""}) sum by (node) (validation_status{MediaUnplugged!=""}) sum by (node) (validation_status{LinkDown_NoSignal!=""}) # Find specific issue types validation_status{MediaUnplugged!=""} > 0 # Unplugged cables validation_status{AdminDown!=""} > 0 # Administratively disabled ports validation_status{ModulePnMismatch!=""} > 0 # Hardware compatibility issues # Ports with multiple issue types validation_status{WrongNeighbor!="",AnomalousPort!=""} # Correlation with performance metrics (validation_status > 0) and (effective_ber > 1e-12) # Port status correlation validation_status{MediaUnplugged!=""} and on() port_info{module_oper_status="unplugged"}
Architecture
Data Flow
Network Agents → CVT Collector → Advanced Stats + Report Data → Prometheus Collector → Metrics Endpoint ↓ ↓ ↓ ↓ ↓ (10 min) (Real-time) (Synchronized) (Multi-level Cache) (GET request)
Enhanced Data Processing
-
Agent Data Validation: Switch IP validated against topology before processing
-
Synchronized Processing: Advanced stats and validation reports processed together
-
Optimized Issue Processing: Report data pre-processed to group issues by port (O(n+m) complexity)
-
Independent Validation Cache: PortValidationStatus class with hash-based change detection
-
Robust Syndrome Handling: Automatic fallback for unknown syndromes with developer warnings
-
Smart Data Quality: NA values properly excluded, counters preserve semantics
Caching Strategy
-
Port-level Cache: Individual port metrics cached until data changes
-
Collection-level Cache: Aggregated output cached for fast retrieval
-
Label Cache: Stable topology/cable labels cached separately
-
Validation Cache: Independent cache for validation status with hash-based change detection
-
Metadata Cache: Static TYPE/HELP comments cached permanently
Performance Optimizations
-
Push-based Updates: Metrics updated when advanced stats arrive
-
Smart Change Detection: Only cable/module changes invalidate caches
-
Memory Adaptive: Caching disabled automatically for large deployments
-
String Manipulation: Efficient JSON aggregation using string operations
-
Validation Processing: O(n+m) complexity with report preprocessing
-
Hash-based Cache: Validation cache only invalidated when issue content changes
-
Agent Data Filtering: Invalid switch IPs filtered early to prevent unnecessary processing
Monitoring Best Practices
Prometheus Configuration
-
Scrape Interval: 15-30 seconds (matches CVT data update frequency)
-
Timeout: 10 seconds (allows for cache generation)
-
Retention: Configure based on historical analysis needs
Alerting Guidelines
-
BER Thresholds: Alert when effective_ber > 1e-12
-
Temperature Limits: Alert when module_temperature approaches temperature_high_th
-
Validation Issues: Alert when validation_status > 0
-
Critical Issues: Alert on specific syndromes (MediaUnplugged, UnreachableDevice, ModulePnMismatch)
-
Infrastructure Issues: Alert on LinkDown_NoSignal, ErrDisable conditions
-
Counter Anomalies: Alert on rapid increases in transceiver_reinsert_cnt
Sample Alerting Rules
# Validation issues alert - alert: PortValidationIssues expr: validation_status > 0 labels: severity: warning annotations: summary: "Port {{ $labels.node }}:{{ $labels.port }} has validation issues" description: "{{ $value }} validation issues detected" # Critical validation issues - alert: CriticalPortIssues expr: validation_status{MediaUnplugged!=""} > 0 or validation_status{UnreachableDevice!=""} > 0 labels: severity: critical annotations: summary: "Critical issues on {{ $labels.node }}:{{ $labels.port }}" description: "{{ if $labels.MediaUnplugged }}Cable unplugged: {{ $labels.MediaUnplugged }}{{ end }}{{ if $labels.UnreachableDevice }}Device unreachable: {{ $labels.UnreachableDevice }}{{ end }}" # Infrastructure issues - alert: InfrastructureIssues expr: validation_status{LinkDown_NoSignal!=""} > 0 or validation_status{ModulePnMismatch!=""} > 0 labels: severity: warning annotations: summary: "Infrastructure issue on {{ $labels.node }}:{{ $labels.port }}" # Administrative issues - alert: AdminIssues expr: validation_status{AdminDown!=""} > 0 or validation_status{NicNameMismatch!=""} > 0 labels: severity: info annotations: summary: "Administrative issue on {{ $labels.node }}:{{ $labels.port }}"
Version Information
Release Notes
-
Version: 1.1.0
-
Release Date: October 2025
-
Compatibility: CVT 1.7.0 and later
-
Dependencies: Requires advanced stats collection enabled
New in Version 1.1.0
-
✅ Validation Metrics Integration: Port validation status with actionable issue descriptions
-
✅ Synchronized Data Processing: Advanced stats and validation reports processed together
-
✅ Performance Optimizations: O(n+m) validation processing, hash-based change detection
-
✅ Enhanced Security: Agent data validation prevents processing from unknown switches
-
✅ Improved Data Quality: None-based initialization for gauges, proper counter semantics
-
✅ Better Caching: Independent validation cache with content-based invalidation
-
✅ Comprehensive Syndrome Coverage: 15+ validation issue types with fallback handling
-
✅ Real-world Validation: Successfully tested with production data and unplugged ports
API Stability
-
Metric Names: Stable (no breaking changes planned)
-
Label Names: Stable (additions possible, no removals)
-
Output Format: Prometheus standard compliance maintained
-
Endpoint URLs: Stable API contract
Data Quality Improvements
Enhanced Counter Semantics
-
Gauges default to None: Missing sensor data excluded instead of showing false zeros
-
Counters preserve values: No unexpected resets when data temporarily unavailable
-
Proper NA handling: Invalid data marked as NA and excluded from metrics
-
Temperature accuracy: Fixed zero temperature issue by using actual amber timestamps
Validation Integration Benefits
-
Synchronized processing: Performance metrics and validation issues always in sync
-
Rich context: Issue descriptions provide actionable corrective actions
-
Efficient processing: O(n+m) complexity prevents performance degradation
-
Smart caching: Validation cache independent of performance metrics cache
Support
Troubleshooting
-
Check CVT Service: Ensure Cable Validation service is running
-
Verify Topology: Confirm network topology is loaded
-
Agent Status: Check that agents are deployed and collecting data
-
Network Connectivity: Verify switch/host accessibility
Performance Monitoring
# Check metrics endpoint response time time curl -k https://cvt-server/cablevalidation/metrics > /dev/null
Contact Information
-
Development Team: Cable Validation Engineering
-
Documentation: [Internal Wiki Link]
-
Support: [Support Channel/Email]
This endpoint provides comprehensive cable validation metrics for modern monitoring and observability workflows, enabling proactive network health management and automated alerting.
Last updated: