DOCA SDK Documentation

HBN Service Troubleshooting


HBN Container Stuck in init-sfs

The HBN container starts as init-sfs and should transition to doca-hbn within 2 minutes as can be seen using crictl ps. But sometimes it may remain as init-sfs.

This can happen if interface p0_if is missing. Run the command ip -br link show dev p0_if in BlueField and inside the container to check if p0_if is present or not. If its missing, make sure the firmware is upgraded to the latest version. Perform BlueField system-level reset for the new firmware to take effect.

Host-side PF/VF Down After BlueField Reboot

In general, the host can use any interface manager to manage host interfaces belonging to BlueField. When the host uses an interface manager other than Netplan or NetworkManager, some ports may remain down after BlueField reboot.

Apply the following workaround if interfaces stay down:

  1. Restart openibd: 

    systemctl restart openibd
    


  2. Recreate SR-IOV interfaces if they are needed.

  3. Replay interface config. For example:If using ifupdown2:  ifreload -a  If using Netplan: netplan apply

BGP Session not Establishing

One of the main causes of a BGP session not getting established is a mismatch in MTU configuration. Make sure the MTU on all interfaces is the same. For example, if BGP is failing on p0, check and verify that there is a matching MTU value for p0, p0_if_r, p0_if, and the remote peer of p0.

Generating Support Information

The HBN container image can be collected from /etc/image-version using the hbn-support command inside container:

root@bf2:/tmp# hbn-support
Please send /var/support/hbn_support_doca-hbn-service-bf2-s15-1-ipmi_20240820_211214.txz to Cumulus support.

The generated dump would be available under /var/support in the HBN container and should contain any process core dump and log files. The generated cores can be found under /var/support/core and collected by hbn-support. The /var/support directory is also mounted on the BlueField Arm side at /var/lib/hbn/var/support.

For BlueField, the BFB version can be checked from /etc/mlnx-release.

The firmware version can be collect from mlxfwmanager.

BlueField support dump can be collect using the sos command:

root@bf2:/tmp/#sos report -a --all-logs --batch

Example output:

sos report (version 4.8.0)

This command will collect system configuration and diagnostic
information from this Ubuntu system.
...
...
  Finished running plugins

Creating compressed archive...

Your sos report has been generated and saved in:
        /tmp/sosreport-bf2-s15-1-ipmi-2024-08-20-cpdvegw.tar.xz

 Size   19.37MiB
 Owner  root
 sha256 0890a855623a1a2dd5089c9cd6d57d81e71f3805ac06c2d9fc0dab556ccd5ffc

Please send this file to your support representative.

SFC Troubleshooting

To troubleshoot flows going through SFC interfaces, the first step is to disable the nl2doca service in the HBN container:

root@bf2:/tmp# supervisorctl stop nl2doca
nl2doca: stopped

Stopping nl2doca effectively stops hardware offloading and switches to software forwarding. All packets would appear on tcpdump capture on BlueField interfaces.

tcpdump can be performed on SF interfaces as well as VLAN, VXLAN, and uplinks to determine where a packet gets dropped or which flow a packet is taking.

General nl2doca Troubleshooting

The following steps can be used to make sure the nl2doca daemon is up and running:

  1. Make sure there are no errors in the nl2doca log file at /var/log/hbn/nl2docad.log.

  2. To check the status of the nl2doca daemon under supervisor, run:

    supervisorctl status nl2doca
    


  3. Use ps to check that the actual nl2doca process is running:

    ps -eaf | grep nl2doca
    root          18       1  0 06:31 ?        00:00:00 /bin/bash /usr/bin/nl2doca-docker-start
    root        1437      18  0 06:31 ?        00:05:49 /usr/sbin/nl2docad
    


  4. The core file should be in /var/support/core/.

  5. Check if the /cumulus/nl2docad/run/stats/punt​ is accessible. Otherwise, nl2doca may be stuck and should be restarted:

    supervisorctl restart nl2doca
    


nl2doca Offload Troubleshooting

If a certain traffic flow does not work as expected, disable nl2doca (i.e., disable hardware offloading):

supervisorctl stop nl2doca​

​With hardware offloading disabled, you can confirm it is an offloading issue if the traffic starts working. If it is not an offloading issue, use tcpdump on various interfaces to see where the packet gets dropped. 

Offloaded entries can be checked in following files, which contain the programming status of every IP prefix and MAC address known to system.

  • Bridge entries are available in the file /cumulus/nl2docad/run/software-tables/17​. It includes all the MAC addresses in the system including local and remote MAC addresses.

    Example format:

    - flow-entry: 0xaaab0cef4190​
          flow-pattern:​
            fid: 112​
            dst mac: 00:00:5e:00:01:01​
          flow-actions:​
            SET VRF: 2​         
            OUTPUT-PD-PORT: 20(TO_RTR_INTF)         
            STATS:​
              pkts: 1719​
              bytes: 191286​
    


  • Router entries are available in the file /cumulus/nl2docad/run/software-tables/18​. It includes all the IP prefixes known to the system.

    Example format for Entry with ECMP:

    Entry with ECMP:
    - flow-entry: 0xaaaada723700
      flow-pattern:
         IPV6: LPM
         VRF: 0
         destination-ip: ::/0
      flow-actions :
         ECMP: 2
         STATS:
            pkts: 0
            bytes: 0​
    
    Entry without ECMP: - flow-entry: 0xaaaada7e1400
        flow-pattern:
           IPV4: LPM
           VRF: 0
           destination-ip: 60.1.0.93/32
        flow-actions :
            SET FID: 200
            SMAC: 00:04:4b:a7:88:00
            DMAC: 00:03:00:08:00:12
            OUTPUT-PD-PORT: 19(TO_BR_INTF)
       STATS:
           pkts: 0
           bytes: 0
    


  • ECMP entries are available in the file /cumulus/nl2docad/run/software-tables/19​. It includes all the next hops in the system.

    Example format:

    - ECMP: 2
      ref-count: 2
      num-next-hops: 2
      entries:
      - { index: 0, fid: 4100, src mac: 'b8:ce:f6:99:49:6a', dst mac: '00:02:00:00:00:0a' }
      - { index: 1, fid: 4101, src mac: 'b8:ce:f6:99:49:6b', dst mac: '00:02:00:00:00:0e' }
          
    


To check counters for packets going to the kernel, run:

cat /cumulus/nl2docad/run/stats/punt
​PUNT miss pkts:3154 bytes:312326
PUNT miss drop pkts:0 bytes:0
PUNT control pkts:31493 bytes:2853186
PUNT control drop pkts:0 bytes:0
ACL PUNT pkts:68 bytes:7364
ACL drop pkts:0 bytes:0

For a specific type of packet flow, programming can be referenced in block specific files. The typical flow is as follows:

For example, to check L2 EVPN ENCAP flows for remote MAC 8a:88:d0:b1:92:b1 on port pf0vf0_if, the basic offload flow should look as follows: RxPort (pf0vf0_if) -> BR (Overlay) -> RTR (Underlay) -> BR (Underlay) -> TxPort​ (one of the uplink p0_if or p1_if based on ECMP hash).

Step-by-step procedure:

  1. Navigate to the interface file /cumulus/nl2docad/run/software-tables/20.

  2. Check for the RxPort (pf0vf0_if):

    Interface: pf0vf0_if​
        PD PORT: 6​
        HW PORT: 16
        NETDEV PORT: 11
        Bridge-id: 61​
        Untagged FID: 112​
    

    FID 112 is given to the receive port​.

  3. Check the bridge table file /cumulus/nl2docad/run/software-tables/17 with destination MAC 8a:88:d0:b1:92:b1 and FID 112:

    flow-pattern:​
          fid: 112​
            dst mac: 8a:88:d0:b1:92:b1​
          flow-actions:​
            VXLAN ENCAP:​
              ENCAP dst ip: 6.0.0.26​
              ENCAP vni id: 1000112​
            SET VRF: 0​
            OUTPUT-PD-PORT: 20(TO_RTR_INTF)​
            STATS:​
              pkts: 100​
              bytes: 10200​
    


  4. Check the router table file /cumulus/nl2docad/run/software-tables/18 with destination IP 6.0.0.26 and VRF 0:

    flow-pattern:​
            IPV4: LPM​
            VRF: 0​
            ip dst: 6.0.0.26/32​
          flow-actions :​
            ECMP: 1​
            OUTPUT PD PORT: 2(TO_BR_INTF)​
            STATS:​
              pkts: 300​
              bytes: 44400​
    


  5. Check the ECMP table file /cumulus/nl2docad/run/software-tables/19 with ECMP 1:

    - ECMP: 1​
          ref-count: 7​
          num-next-hops: 2
          entries:​
            - { index: 0, fid: 4100, src mac: 'b8:ce:f6:99:49:6a', dst mac: '00:02:00:00:00:2f' }​
            - { index: 1, fid: 4115, src mac: 'b8:ce:f6:99:49:6b', dst mac: '00:02:00:00:00:33' }​
    


  6. The ECMP hash calculation picks one of these paths for next-hop rewrite. Check bridge table file for them (fid=4100, dst mac: 00:02:00:00:00:2f or fid=4115, dst mac: 00:02:00:00:00:33):

    flow-pattern:​
            fid: 4100​
            dst mac: 00:02:00:00:00:2f​
    flow-actions:​
        OUTPUT-PD-PORT: 36(p0_if)​
        STATS:​
           pkts: 1099​
           bytes: 162652​
    

    This will show the packet going out on the uplink.

NVUE Troubleshooting

To check the status of the NVUE daemon, run:

supervisorctl status nvued

To restart the NVUE daemon, run:

supervisorctl restart nvued


Last updated: