NVIDIA NVOS User Manual for InfiniBand Switches

gNMI Streaming

The gRPC Network Management Interface (gNMI) can collect and export system resources, interface, and counter information from NVOS to your gNMI client.

Configure the gNMI Agent using NVUE CLI Commands

The gNMI server feature state can be set over NVOS using simple NVUE CLI commands:

Show command:

nvos@switch:~$ nv show system gnmi-server
             operational    applied      
-----------  -------------  -----------
state        enabled        enabled         
certificate  self-signed    self-signed 
is-running   yes                                    
version      4.13.0-3000-2                          

Set command:

nvos@switch:~$ nv set system gnmi-server state <enabled | disabled>

Unset command:

nvos@switch:~$ nv unset system gnmi-server state

The state is enabled by default and the unset command will restore the state to enabled, if it is not already. 

Supported Subscription Modes

NVOS supports the following gNMI subscription modes:

  • STREAM Mode: In this mode, the client subscribes to receive updates whenever there is a change in the telemetry data. This mode is suitable for scenarios where you need real-time notifications of data changes.

  • ONCE Mode: This mode retrieves the data once and then terminates the subscription. It's ideal for scenarios where a single snapshot of the data is needed without ongoing updates.

  • POLL Mode: In this mode, the client periodically requests data from the server. This mode allows clients to fetch data at defined intervals, providing a balance between real-time and scheduled updates.

Supported stream modes:

  • ON_CHANGE—When a subscription is defined to be "on change", data updates are only sent when the value of the data item changes.

  • SAMPLE —This mode allows clients to receive periodic samples of telemetry data at specified intervals. This mode is beneficial for scenarios where continuous streaming of data is not necessary, but periodic updates are required for monitoring and analytics.

Key Parameters for STREAM SAMPLE Mode:

  • sample_interval (mandatory): Defines the interval at which samples are sent to the client. This parameter controls the frequency of data transmission.

  • suppress_redundant (optional, default false): Determines whether redundant data updates, which have not changed since the last sample, should be suppressed. This helps in reducing unnecessary data transmission and optimizing network usage.

  • heartbeat_interval (optional, default disabled): Specifies the interval for sending heartbeat messages to indicate that the connection is still active. Heartbeats help in monitoring the health of the connection and detecting failures.

Subscription Parameter Constraints:

 To ensure stable telemetry streaming, the gNMI server enforces the following validation checks. A subscription request will fail if any of these conditions are met:

  • Invalid Mode Constraints: Neither  sample_interval nor heartbeat_interval can be configured if the subscription mode is not STREAM (e.g., if the mode is ONCE or POLL).

  • Minimum Interval: The sample_interval for a STREAM subscription is below the minimum allowed value of 1 second (1s).

  • Heartbeat Constraints: The sample_interval is strictly greater than the heartbeat_interval  

  • ON_CHANGE Restrictions: The sample_interval or suppress_redundant parameters are configured when the stream mode is set to ON_CHANGE.

gRPC Tunnel (gNMI Dial-out)

The gRPC Tunnel (gNMI Dial-out) feature enables outbound telemetry streaming from the switch to remote collectors or controllers. It allows devices to establish secure, client-initiated connections, making it possible to operate in environments with NAT or restrictive firewall policies where inbound connectivity is not feasible.

The feature provides a configurable framework for managing multiple tunnel endpoints, supporting secure communication using TLS with certificate-based authentication. It integrates with the system’s configuration and service infrastructure to handle connection lifecycle, retries, and operational state tracking.

The gRPC Tunnel can be controlled using the following CLI commands:

Display gRPC Tunnel servers: 

nvos@switch:~$ nv show system grpc-tunnel server

Configure new server:

nvos@switch:~$ nv set system grpc-tunnel server <test-server> address <address>
nvos@switch:~$ nv set system grpc-tunnel server <test-server> port <port>

Set certificate;

nvos@switch:~$ nv set system grpc-tunnel server <test-server> certificate my-client-cert

Supported Models

Models Overview

The NVOS gNMI Model is based on OpenConfig YANG models, extended with NVIDIA-specific augments where required.
It provides a consistent, vendor-neutral telemetry structure while allowing NVIDIA to expose additional InfiniBand, platform, and diagnostic data.

The gNMI YANG models consist of:

  • Standard OpenConfig models (baseline support)

  • NVIDIA Models (NVOS-specific enrichment)

  • Legacy NVIDIA models retained for backward compatibility

openconfig-platform

Model

Supported Data

openconfig-interfaces

Base interface configuration, state, and counters: Name, Description, AdminStatus, OperStatus, Enabled, IfIndex, LoopbackMode, and base interface counters (InPkts, OutPkts, InOctets, OutOctets, InUnicastPkts, OutUnicastPkts, InMulticastPkts, OutMulticastPkts, InBroadcastPkts, OutBroadcastPkts, InDiscards, OutDiscards, InErrors, OutErrors), plus InfiniBand-specific interface state (IBSpeed, Speed, IBSubnet, LogicalPortState, PhysicalPortState, MaintenanceState, MTU, MaxSupportedMTUs, SupportedIBSpeeds, SupportedWidths, VLCapabilities, OperationalVL) and InfiniBand port counters (SymbolErrorCounter, XmitWait, RcvErrors, RcvRemotePhyErrors, RcvSwitchRelayErrors, LocalLinkIntegrityErrors, ExcessiveBufferOverrun, LinkErrorRecovery, LinkDowned, QP1Dropped, VL15Dropped and related IB statistics).

openconfig-system

System identity, software, and resource usage: Hostname, BootTime, SoftwareVersion, Location, Contact, RoutingMAC, CPU utilization (aggregate Total/Average), and system memory usage (Physical, Used).

openconfig-platform

Chassis, ASIC, PSU, fan, storage, and other hardware inventory: Component Name, Type, Description, ModelName, PartNo, SerialNo, FirmwareVersion, OperStatus, Temperature, plus component-specific data for fans (Speed, Status), PSUs (Enabled, InputVoltage, InputCurrent, OutputVoltage, OutputCurrent, OutputPower, Status), ASICs (Name, Temperature), chassis/switch (SerialNo, ModelName, PartNo, OperStatus), storage (TotalSize), and platform health (Health Status, LastUnhealthy, UnhealthyCount).
And last reboot reason (LastRebootTime, LastRebootReason)

openconfig-platform-transceiver

Optical transceiver module and channel monitoring: module presence and identity (Present, FormFactor, VendorPart, SerialNo), electrical and thermal telemetry (SupplyVoltage, LaserTemperature, module temperature thresholds – Lower, Upper), per-channel optical DOM data (InputPower, OutputPower, LaserBiasCurrent), and per-channel / host-lane status flags (RxCDRLoL, RxLOS, TxCDRLoL, TxLOS, TxFault, TxAdEqFault) and module temperature / voltage alarm flags.

openconfig-platform-healthz

Component health status and history: Status, LastUnhealthy, UnhealthyCount.

NVIDIA Models

These models extend OpenConfig to expose NVIDIA-specific telemetry that is not covered by the base OpenConfig schemas.

Model

Supported Data

nvidia-interfaces-infiniband

InfiniBand-specific interface configuration and state: IBSpeed, Speed, IBSubnet, LogicalPortState, PhysicalPortState, MaintenanceState, MTU, MaxSupportedMTUs, SupportedIBSpeeds, SupportedWidths, VLCapabilities, OperationalVL, SpeedNegotiate and related InfiniBand admin/oper fields.

nvidia-interfaces-infiniband- errors-ext

InfiniBand-specific error and status counters: ExcessiveBufferOverrun, LinkErrorRecovery, LinkDowned, LocalLinkIntegrityErrors, RcvErrors, RcvRemotePhyErrors, RcvSwitchRelayErrors, QP1Dropped, VL15Dropped and similar InfiniBand-specific port error counters.

nvidia-system-augments

NVIDIA-specific system metadata: system Location and Contact, plus other NVIDIA system-level extensions modeled as augments to the openconfig-system tree (superseding the legacy platform-general location/contact fields).

nvidia-system-events

Structured system event reporting: EventId, TypeId, Text, Resource, Severity, TimeCreated.

nvidia-if-phy-augments

Enhanced physical-layer diagnostics and BER/FEC telemetry: general PHY and BER state (TimeSinceLastClear, EffectiveErrors, ReceivedBits, SymbolErrors, RawBER, EffectiveBER, SymbolBER, ProfileFECInUse, ZeroHist), per-lane BER and error counters (per-channel RawBER and RawErrors), RS histogram bins (RSCorrectedError counters), link-down statistics (TotalEvents, IntentionalEvents, UnintentionalEvents) and reasons (Local/Remote reason code and status), recovery statistics (LastLogicRecoveryAttempts, LastSerdesEqRecoveryAttempts, TimeBetweenLastTwoRecoveries, TimeInLastLogicRecoveryEvent, TimeInLastSerdesEqRecoveryEvent, TimeSinceLastRecovery, TotalSuccessfulRecoveryEvents), PLR metrics (PLR_BW_LossPercent, PLR_CodesLoss, PLR_RcvCodes, PLR_RcvCodeErr, PLR_RcvUncorrectableCode, PLR_SyncEvents, PLR_XmitCodes, PLR_XmitRetryCodes, PLR_XmitRetryEvents, PLR_XmitRetryEventsWithinTsecMax), and InfiniBand port error and port statistic counters (PortBufferOverrunErrors, PortDLIDMappingErrors, PortInactiveDiscards, PortLocalPhysicalErrors, PortLoopingErrors, PortMalformedPacketErrors, PortNeighborMTUDiscards, PortVLMapp­ingErrors, PortRcvData, PortRcvPkts, PortUnicastRcvPkts, PortUnicastXmitPkts, PortMulticastRcvPkts, PortMulticastXmitPkts, PortXmitData, PortXmitPkts, RQGeneralError, SyncHeaderErrorCounter).

nvidia-platform-integrated-circuit-augments

ASIC power telemetry over standard integrated-circuit model: LongTermAvgPower, ShortTermAvgPower (average power values per monitoring interval on ASIC integrated-circuit power).

nvidia-platform-storage- augments

Switch-local storage utilization: TotalSize for the logical switch storage device

nvidia-platform-transceiver- augments

Transceiver firmware and alarm model: DataPathFirmwareFault, ModuleFirmwareFault, ModuleErrorType and generic alarm state (AlarmStatus, AlarmSeverity, AlarmThreshold) for module temperature and supply voltage, and for channel InputPower, OutputPower and LaserBiasCurrent (replacing legacy module/channel-specific alarm flags).

nvidia-ib-router

InfiniBand (IB) router state model: router enabled status and SWID count, per-subnet operational state (logical-state, physical-state, subnet-prefix, validity, per-ASIC GIDs), and per-subnet interface counters (in/out octets, packets, discards, errors, unicast/multicast packets, excessive-buffer-overrun, xmit-wait, link-downed, link-error-recovery, rcv-switch-relay-errors, rcv-constraints-errors, local-link-integrity-errors).

nvidia-system-image-augments

System Image Partitions' Information: Partition Id, Image build-id, Current and next partitions, Partition description, Partition install disk location, Software Release.

nvidia-platform-reboot

Reboot reason details: last-reboot-details 
And counters: power-failure, critical-error, user-initiated, total

Legacy NVIDIA Models

NVOS exposes a set of legacy NVIDIA YANG models for backward compatibility.
These models exist only to support deprecated gNMI xpaths. All data is available through the Model above, and these models are planned for removal in a future NVOS release.

Model

Supported Data (legacy)

nvidia-platform-general-ext

Legacy platform-wide system and resource information: Contact, Location, NOSVersion, PlatformName, MemoryTotalSize, MemoryUsed, DiskTotalSize, DiskUsed, AmbientTemperature and LeakSensor Id/State.

nvidia-platform-general- ext-versions

Legacy system component firmware inventory: FWVersionBIOS, FWVersionBMC, FWVersionFPGA, FWVersionEROT and FWVersionCPLD / FWVersionSMA entries (per-id version and id).

nvidia-platform-asic

Legacy ASIC-specific telemetry model: ASICName, ASICTemp, LongTermAvgPower, ShortTermAvgPower.

nvidia-if-phy-diag

Legacy PHY diagnostic model: CableProtoCapExt, CoreToPhyLinkProtoEnabled, CoreToPhyLinkWidthEnabled, ETH-AN/IB-PHY/PD/PHY-HST/PHY-Manager FSM and link mode fields, LoopbackMode, FECModeRequest, ProfileFECInUse, EffectiveBER, RawBER, SymbolBER, EffectiveErrors, PhyReceivedBits, SymbolErrors, RS histogram bins (RS_Num_Corr_Err_Bin0–Bin15), PLR_* metrics, InfiniBand port-errors and port-statistics counters, link-down and recovery metrics (LinkDown, IntentionalLinkDownEvents, UnintentionalLinkDownEvents, LinkDownReasonCode/Status Local/Remote, TimeSinceLastClear, TimeBetweenLastTwoRecoveries, TimeInLastLogic/ SerdesEqRecoveryEvent, TimeSinceLastRecovery, TotalSuccessfulRecoveryEvents, ZeroHist), and related PHY diagnostic leaves.

nvidia-platform-transceiver-diag

Legacy transceiver diagnostics model: ModuleOperStatus, DataPathFirmwareFault, ModuleFirmwareFault, ModuleErrorType, module TemperatureHigh/Low Alarm and Warning flags, VccHigh/Low Alarm and Warning flags, and channel-level flags for TxAdEqFault, TxFault, TxCDRLoL, TxLOS, RxCDRLoL, RxLOS. 

YANG Model Availability

{#000000|The YANG models above are available on the }NVIDIA Enterprise Support Portal{#000000| → Downloads → Switches and Gateways → Switch Software → }QM-3 NVOS InfiniBand → More files{#000000|.}

NVOS YANG Package Structure

The NVOS YANG package is provided as a tar archive with the following structure:

models/
  ietf                      IETF standard base YANG models
  openconfig                OpenConfig models with NVIDIA Model augments
  nvos                      NVOS-specific OpenConfig augments kept for legacy
                            backward compatibility
  not-supported             Deviation modules that mark non-supported leaves and
                            nodes in the models above
  gnmi-supported-paths.html Reference list of all gNMI-supported paths in this
                            release

gNMI Connection and Rate Limiting

The gNMI service enforces limitations on the number of active and incoming gRPC connections to ensure system stability and optimal resource usage.

  • Maximum Established Connections:
    The gNMI server supports a maximum of 10 concurrently established gRPC connections at any given time. Once this limit is reached, new connection attempts will be rejected until at least one of the existing connections is terminated.

  • Source IP–Based Rate Limiting:
    The gNMI server allows up to 10 concurrent TCP connections from the same source IP address. If additional connection requests are initiated from that IP while the limit is reached, those connection attempts will be dropped automatically. The new connections will only be accepted when the number of active TCP sessions from that IP drops below the configured threshold.

  • To enhance the security of gNMI communications, it is strongly recommended to implement mutual TLS (mTLS) authentication together with SPIFFE (Secure Production Identity Framework For Everyone):

    • Mutual TLS (mTLS): Ensures that both client and server authenticate each other using trusted X.509 certificates, thereby preventing unauthorized access and man‑in‑the‑middle attacks.

    • SPIFFE Integration: Leverages SPIFFE IDs to provide consistent, identity-based authentication and authorization across services. This minimizes dependence on static credentials and simplifies certificate management.

gNMI Client Requests

gNMI client on a host can request capabilities and data from the switch. The examples below use the gNMIc client.

The following example shows a gNMIc STREAM SAMPLE mode request for specific Interface data, with a sample interval of 30 seconds, suppress redundant flag enabled, and heartbeat interval of 120 seconds:

gnmic -a "IP" --port 9339 --skip-verify subscribe --prefix "interfaces"  --path "/interface[name=sw1p1]"  --target nvos -u admin -p ***** --mode stream --stream-mode sample --sample-interval 30s --suppress-redundant --heartbeat-interval 120s

The following example shows a gNMIc STREAM ON-CHANGE mode request for system events, with an updates-only flag enabled:

gnmic -a "IP" --port 9339 --skip-verify subscribe --prefix "/system-events"  --path "" --target nvos -u admin -p ***** --mode stream --stream-mode on-change --updates-only

The following example shows a gNMIc ONCE mode request and server response for{#000000| IB} interface MTU (-d for debug mode):

gnmic -a "IP" --port 9339 --skip-verify subscribe --prefix "interfaces"  --path "/interface[name=sw1p1]/infiniband/state/mtu" -d --target nvos -u admin -p ***** --mode once
{
  "source": "IP",
  "subscription-name": "default-1709707931",
  "timestamp": 1709707925858795109,
  "time": "2024-03-06T08:52:05.858795109+02:00",
  "prefix": "interfaces/interface[name=sw1p1]",
  "target": "nvos",
  "updates": [
    {
      "Path": "infiniband/state/mtu",
      "values": {
        "infiniband/state/mtu": 256
      }
    }
  ]
} 

The following example shows a gNMIc ONCE request for all supported paths:

gnmic -a "IP" --port 9339 --skip-verify subscribe --prefix "/"  --path "" --target nvos -u admin -p ***** --mode once

The following example shows a gNMIc POLL mode request and server response for FAN1/1 speed:

gnmic -a "IP" --port 9339 --skip-verify subscribe --prefix "components"  --path "component[name=FAN1/1]/fan/state/speed"  --target nvos -u admin -p *****  --format flat --mode poll
components/component[name=FAN1/1]/fan/state/speed: 33

The following example shows a gNMIc STREAM mode request for specific system-event "text" leaf with PROTO encoding: 

gnmic -a "IP" --port 9339 --skip-verify subscribe --prefix "system-events"  --path "system-event[event-id=38]/state/text"  --target nvos -u admin -p ***** --encoding proto --format prototext --mode stream
 
sync_response:  true

update:  {
  timestamp:  1719295967820127958
  prefix:  {
    elem:  {
      name:  "system-events"
    }
    elem:  {
      name:  "system-event"
      key:  {
        key:  "event-id"
        value:  "38"
      }
    }
    target:  "nvos"
  }
  update:  {
    path:  {
      elem:  {
        name:  "state"
      }
      elem:  {
        name:  "text"
      }
    }
    val:  {
      string_val:  "Interface admin state is up"
    }
  }
}

A list of supported events can be found in the Event Management page.

The following example shows a gRPC curl command to describe the server using gRPC reflection service:

docker run fullstorydev/grpcurl -H username:admin -H password:***** -insecure "IP":9339 describe

gnmi.gNMI is a service:
service gNMI {
  rpc Capabilities ( .gnmi.CapabilityRequest ) returns ( .gnmi.CapabilityResponse );
  rpc Get ( .gnmi.GetRequest ) returns ( .gnmi.GetResponse );
  rpc Set ( .gnmi.SetRequest ) returns ( .gnmi.SetResponse );
  rpc Subscribe ( stream .gnmi.SubscribeRequest ) returns ( stream .gnmi.SubscribeResponse );
}
grpc.reflection.v1.ServerReflection is a service:
service ServerReflection {
  rpc ServerReflectionInfo ( stream .grpc.reflection.v1.ServerReflectionRequest ) returns ( stream .grpc.reflection.v1.ServerReflectionResponse );
}
grpc.reflection.v1alpha.ServerReflection is a service:
service ServerReflection {
  rpc ServerReflectionInfo ( stream .grpc.reflection.v1alpha.ServerReflectionRequest ) returns ( stream .grpc.reflection.v1alpha.ServerReflectionResponse );
}

The following example shows a gNMIc ONCE mode request for all the supported paths:

gnmic -a "IP" --port 9339 --skip-verify subscribe --prefix "/"  --path ""  --target nvos -u admin -p ***** --mode once --format flat 

The following example shows a gNMIc Capabilities request to retrieve the set of capabilities that is supported by the server:

gnmic -a "IP" --port 9339 --skip-verify capabilities -u admin -p *****

Related Information

gNMI Streaming Commands

Last updated: