Networking Solutions

Technology Preview for K8s cluster using NVIDIA DPUs and Host Base Networking (HBN)


Scope

This technical preview document is intended for previewing NVIDIA Host-Based Networking (HBN) service running on BlueField DPU in a Kubernetes use case.

Abbreviations and Acronyms

Term

Definition

Term

Definition

BGP

Border Gateway Protocol

LACP

Link Aggregation Control Protocol

CNI

Container Network Interface

LAG

Link Aggregation

DOCA

Datacenter-on-a-Chip Architecture

PF

Physical Function

DPU

Data Processing Unit

SDN

Software Defined Networking

ECMP

Equal-Cost Multi Pathing

SRIOV

Single-Root IO Virtualization

EVPN

Ethernet Virtual Private Network

VF

Virtual Function

HBN

Host-Based Networking

VXLAN

Virtual Extensible Local-Area-Network

K8s

Kubernetes



Introduction

The BlueField®-2 data processing unit (DPU) provides innovative offload, acceleration, security, and efficiency in every host.

BlueField-2 combines the power of ConnectX®-6 with programmable Arm cores and hardware acceleration engines for software-defined storage, networking, security, and management workloads.

NVIDIA HBN is a service running on the DPU, which simplifies network provisioning by terminating the Ethernet layer 2 network in the DPU, allowing the physical network to become more of a "plug-and-play" utilizing a BGP-managed layer 3 network.

With HBN, the workload servers are connected to the physical switches over router ports, using unnumbered BGP configuration, which is automatic and does not require any IP subnet and address allocation for the underlay network, and provides a built-in active-active high-availability and load balancing based on ECMP.

The DPUs in the servers act as virtual tunnel endpoints (VTEPs) for the host network, providing a stretched layer 2 between all the nodes in the cluster over VXLAN using EVPN technology.

The configuration of HBN on the DPUs is almost identical to the configuration of physical NVIDIA switches, as it uses the same NVUE CLI commands, or NVUE programmatic API.

At the time of publishing this document, there are some throughput limitations with HBN that will be addressed in the upcoming releases. RoCE support by HBN will be added in future releases as well.

References

Solution Architecture

Logical Design

HBN.png


Our deployment uses HBN to create two stretched layer 2 networks: One for the primary network (Calico running in IP-in-IP configuration) using VLAN 60 and one for a secondary SR-IOV network using VLAN 50.

The primary network runs over the physical function (PF) of the DPU, using veth-based kernel interfaces (Virtual Ethernet Interfaces) for the pods named eth0, while the secondary network allocates virtual functions (VFs) for each pod, named net1, which are allocated out of a pool of up to 16 VFs supported by HBN per DPU. This network can utilize the DPU's hardware acceleration capabilities. Support for RoCE traffic will be added through it soon.

The external access for the deployment (i.e., lab network access and Internet connectivity) is achieved through a gateway node connected to a leaf and acting as the default gateway for the management network 172.60.0.0/24 on VLAN 60.

The traffic that traverses between the servers through the leaf switches (with an exception of the gateway node) solely uses layer 3 (packets routed between BGP neighbors), and utilizes ECMP for high-availability and load balancing.

The gateway node used in this setup uses a VTEP in the leaf.

Please note that the configuration and deployment of the gateway node used to provide external access in this example is not included in the scope of this document.

Please note that the gateway solution used in this example does not support a high-availability scenario. It is recommended to use a high availablity gateway deployment in a production environment (i.e., connected to more than a single leaf switch).

The presented deployment which represents a single-rack sized deployment, can be easily scaled out by adding additional switches, to create a large scale, multi-rack deployment that can accommodate hundreds of nodes.

The main advantage of using HBN in the deployment is that very little switch configuration is required: Each switch added needs a unique /32 address and possibly an AS number (leaf switches), but otherwise no additional configuration is needed.

The same applies for the configuration of HBN on each DPU, as it requires only a unique /32 address and a unique AS number.

BGP neighboring uses the "unnumbered" mode, which automatically identifies the neighbors and allocates local IPv6 subnets on every link—a kind of "plug-and-play" connectivity.


Used IP addresses:

Below are the IP addresses set on various interfaces in the setup.

Please note that tmfifo_net0 interfaces are the virtual interfaces automatically created on top of the RShim connection between the host and the DPU, in both sides.

We can use these interfaces to install DPU software from the host, as an alternative to the out-of-band 1GbE physical port on the DPU.


interfaces.png


Device

Description

Interfaces

IP Addresses

Master

K8s master node and deployment node

ens2f0np0

tmfifo_net0

172.60.0.11/24

192.168.100.1/30

Worker1

K8s worker node 1

ens2f0np0

tmfifo_net0

172.60.0.12/24

192.168.100.1/30

Worker2

K8s worker node 2

ens2f0np0

tmfifo_net0

172.60.0.13/24

192.168.100.1/30

DPU

Any of the used DPUs

tmfifo_net0

192.168.100.2/30


Switch and HBN configuration and connectivity:

Switch

Description

Router ID

AS Number

Links

Leaf1

Leaf (TOR) switch 1

10.10.10.1/32

65101

To DPUs: swp1-3

To spines: swp31-32

To gateway node: swp30

Leaf2

Leaf (TOR) switch 2

10.10.10.2/32

65102

To DPUs: swp1-3

To spines: swp31-32

Spine1

Spine switch 1

10.10.10.101/32

65199

To leafs: swp1-2

Spine2

Spine switch 2

10.10.10.102/32

65199

Master HBN

HBN on the master node DPU

10.10.10.11/32

65111

To leafs: p0_sf, p1_sf

Worker 1 HBN

HBN on Worker1 node DPU

10.10.10.12/32

65112

Worker 2 HBN

HBN on Worker2 node DPU

10.10.10.13/32

65113


VTEPs configuration:

VTEP

Interfaces

VLAN

VNI

Leaf1

swp30

60

10060

HBN (Any)

pf0vf0_sf - pf0vf15_sf

50

10050

pf0hpf_sf

60

10060

Software Stack Components

hbn-sw-stack.png

Bill of Materials

bom.png

Deployment and Configuration

Configuring the Network Switches

  • The NVIDIA® SN3700 switches are installed with NVIDIA® Cumulus® Linux 5.3 OS

  • Each node is connected to two TOR switches over two 100Gb/s router ports (interfaces swp1-3 on the TOR switches)

  • The TOR switches are connected to the two spine switches using router ports (interfaces swp31-32 on the TORs and swp1-2 on the spines)

  • In addition, a gateway node is connected through a VTEP to one of the TORs, providing external network access and Internet connectivity on the management network (VLAN60, interface swp30 on Leaf1)

Configure the switches using the following NVUE commands:

Leaf1 Console
nv set interface lo ip address 10.10.10.1/32
nv set interface swp30,swp31-32,swp1-3
nv set interface swp30 link mtu 9000
nv set interface swp30 bridge domain br_default
nv set interface swp30 bridge domain br_default access 60
nv set bridge domain br_default vlan 60
nv set bridge domain br_default vlan 60 vni 10060
nv set nve vxlan source address 10.10.10.1
nv set nve vxlan arp-nd-suppress on
nv set system global anycast-mac 44:38:39:BE:EF:AA
nv set evpn enable on
nv set router bgp autonomous-system 65101
nv set router bgp router-id 10.10.10.1
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp31 peer-group underlay
nv set vrf default router bgp neighbor swp32 peer-group underlay
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp peer-group hbn remote-as external
nv set vrf default router bgp neighbor swp1 peer-group hbn
nv set vrf default router bgp neighbor swp2 peer-group hbn
nv set vrf default router bgp neighbor swp3 peer-group hbn
nv set vrf default router bgp peer-group hbn address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv config apply -y
Leaf2 Console
nv set interface lo ip address 10.10.10.2/32
nv set interface swp31-32,swp1-3
nv set router bgp autonomous-system 65102
nv set router bgp router-id 10.10.10.2
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp31 peer-group underlay
nv set vrf default router bgp neighbor swp32 peer-group underlay
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp peer-group hbn remote-as external
nv set vrf default router bgp neighbor swp1 peer-group hbn
nv set vrf default router bgp neighbor swp2 peer-group hbn
nv set vrf default router bgp neighbor swp3 peer-group hbn
nv set vrf default router bgp peer-group hbn address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv config apply -y
Spine1 Console
nv set interface lo ip address 10.10.10.101/32
nv set interface swp1-2
nv set router bgp autonomous-system 65199
nv set router bgp router-id 10.10.10.101
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp1 peer-group underlay
nv set vrf default router bgp neighbor swp2 peer-group underlay
nv set vrf default router bgp address-family l2vpn-evpn enable on
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv config apply -y
Spine2 Console
nv set interface lo ip address 10.10.10.102/32
nv set interface swp1-4
nv set router bgp autonomous-system 65199
nv set router bgp router-id 10.10.10.102
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp1 peer-group underlay
nv set vrf default router bgp neighbor swp2 peer-group underlay
nv set vrf default router bgp address-family l2vpn-evpn enable on
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv config apply -y

Host Preparation

  1. Install Ubuntu 22.04 on the servers and make sure it is up-to-date:

    Server Console

    $ sudo apt update
    $ sudo apt upgrade
    $ sudo reboot
    
  2. To allow password-less sudo, add the local user to the sudoers file on each host:

    Server Console

    $ sudo vi /etc/sudoers
    
    #includedir /etc/sudoers.d
    #K8s cluster deployment user with sudo privileges without password
    user ALL=(ALL) NOPASSWD:ALL
    
  3. On the deployment node, generate an SSH key and copy it to each node. For example:

    Deployment Node Console (Master node is used)

    $ ssh-keygen
    
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/user/.ssh/id_rsa):
    Created directory '/home/user/.ssh'.
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/user/.ssh/id_rsa.
    Your public key has been saved in /home/user/.ssh/id_rsa.pub.
    The key fingerprint is:
    SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@master
    The key's randomart image is:
    +---[RSA 2048]----+
    |      ...+oo+o..o|
    |      .oo   .o. o|
    |     . .. . o  +.|
    |   E  .  o +  . +|
    |    .   S = +  o |
    |     . o = + o  .|
    |      . o.o +   o|
    |       ..+.*. o+o|
    |        oo*ooo.++|
    +----[SHA256]-----+
    
    $ ssh-copy-id -i ~/.ssh/id_rsa user@10.10.0.1
    
    /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub" The authenticity of host 10.10.0.1 (10.10.0.1)' can't be established.
    ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8.
    Are you sure you want to continue connecting (yes/no)? yes
    /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
    /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
    user@10.10.0.1's password:
    
    Number of key(s) added: 1
    
  4. Now try logging into the machine, and verify that you can log in without a password.

Deploying the BFB

  1. Download the DOCA host drivers packages form the DOCA webpage by scrolling down to the bottom of the page and selecting the relevant package. It will install all the necessary software to access and install the DPU from the host.  doca-host.png

  2. Install the DOCA host drivers package you downloaded:

    Server Console

    # wget https://www.mellanox.com/downloads/DOCA/DOCA_v1.5.1/doca-host-repo-ubuntu2204_1.5.1-0.1.8.1.5.1007.1.5.8.1.1.2.1_amd64.deb
    # dpkg -i doca-host-repo-ubuntu2204_1.5.1-0.1.8.1.5.1007.1.5.8.1.1.2.1_amd64.deb
    # apt-get update
    # apt install doca-runtime
    # apt install doca-tools
    
  3. Download BFB 4.0.3 with DOCA 2.0.2v2:
    bfb.png

  4. Create the config file bf.cfg:

    Server Console

    # echo 'ENABLE_SFC_HBN=yes' > bf.cfg
    
  5. Install the BFB (the DPU's operating system image):

    Server Console

    # bfb-install -c bf.cfg -r rshim0 -b DOCA_2.0.2_BSP_4.0.3_Ubuntu_22.04-10.23-04.prod.bfb
    
  6. After the installation is complete, perform a full power cycle to the servers, allowing the DPU firmware to reboot and upgrade if needed.

  7. After the servers return, assign IP addresses to the first interface of the DPU on each host using netplan. This is an example for the master node. The same should be done for the workers (172.60.0.12 and 172.60.0.13). Notice the default route to the gateway node (172.60.0.254) to provide external/Internet connectivity:

    Server Console

    # vi /etc/netplan/00-installer-config.yaml
    
    network:
      ethernets:
        eno1:
          dhcp4: true
        eno2:
          dhcp4: true
        eno3:
          dhcp4: true
        eno4:
          dhcp4: true
        ens2f0np0:
          dhcp4: false
          mtu: 9000
          addresses: [172.60.0.11/24]
          nameservers:
            addresses: [8.8.8.8]
            search: []
          routes:
            - to: default
              via: 172.60.0.254
         ens2f1np1:
          dhcp4: false
      version: 2
    
  8. Apply the settings:

    Server Console

    # netplan apply
    

Installing DOCA Container Configs Package

Log into the DPU in one of the following ways:

  • Using the OOB management interface (if connected and obtained an IP address over DHCP)

  • Using the built-in network interface over RShim (tmfifo_net0)

This example uses the RShim option:

  1. Use the following command to assign an IP address to the tmfifo_net0 interface:

    Server Console

    # ip address add 192.168.100.1/30 dev tmfifo_net0
    

    When you first log into the DPU, use ubuntu/ubuntu as credentials.

  2. You will be asked to modify the password:

    Server Console

    # ssh ubuntu@192.168.100.2
    The authenticity of host '192.168.100.2 (192.168.100.2)' can't be established.
    ED25519 key fingerprint is SHA256:S2gzl4QzVUY0g3GRsl9VLi3tYHQdIe7oQ+8I8tr95c4.
    This key is not known by any other names
    Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
    Warning: Permanently added '192.168.100.2' (ED25519) to the list of known hosts.
    ubuntu@192.168.100.2's password: 
    You are required to change your password immediately (administrator enforced)
    Welcome to Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-1049-bluefield aarch64)
    
     * Documentation:  https://help.ubuntu.com
     * Management:     https://landscape.canonical.com
     * Support:        https://ubuntu.com/advantage
    
      System information as of Mon Jan 16 12:53:57 UTC 2023
    
      System load:                  0.09
      Usage of /:                   6.7% of 58.00GB
      Memory usage:                 13%
      Swap usage:                   0%
      Processes:                    485
      Users logged in:              0
      IPv4 address for docker0:     172.17.0.1
      IPv4 address for mgmt:        127.0.0.1
      IPv6 address for mgmt:        ::1
      IPv4 address for oob_net0:    10.10.7.37
      IPv4 address for tmfifo_net0: 192.168.100.2
    
    0 updates can be applied immediately.
    
    
    
    The programs included with the Ubuntu system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.
    
    Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
    applicable law.
    
    WARNING: Your password has expired.
    You must change your password now and login again!
    Changing password for ubuntu.
    Current password: 
    New password: 
    Retype new password: 
    passwd: password updated successfully
    Connection to 192.168.100.2 closed.
    
  3. Once you are logged into the DPU, download the DOCA container configs package and install it. This package includes the necessary scripts and configurations for activating the HBN container on the DPU:

    DPU Console

    # wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/doca/doca_container_configs/versions/2.0.2v2/zip -O doca_container_configs_2.0.2v2.zip --no-check-certificate
    # unzip -o doca_container_configs_2.0.2v2.zip -d doca_container_configs_2.0.2v2
    # cd doca_container_configs_2.0.2v2/scripts/doca_hbn/1.4.0
    # chmod +x hbn-dpu-setup.sh
    # ./hbn-dpu-setup.sh
    # cd ../../../configs/2.0.2/
    # cp doca_hbn.yaml /etc/kubelet.d/
    

    You will not be able to pull the zip file from the Internet if the DPU's out-of-band management interface is not used in your setup.

    In this case, please pull it to the host and use scp to copy it to the DPU over the RShim's network interface (tmfifo_net0).

    On the host console:

    # wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/doca/doca_container_configs/versions/2.0.2v2/zip -O doca_container_configs_2.0.2v2.zip --no-check-certificate
    # scp  doca_container_configs_2.0.2v2.zip ubuntu@192.168.100.2:/home/ubuntu/
    
  4. Reboot the DPU:

    DPU Console

    # reboot
    

Configuring HBN

After the DPU returns, it will run the "doca-hbn" container.

  1. To find its ID (will appear in the first column under CONTAINER), run:

    DPU Console

    # crictl ps
    
  2. Connect to that container:

    DPU Console

    # crictl exec -it <container-id> bash
    
  3. Use the following NVUE commands to configure HBN (must be done on each DPU):

    nv set bridge domain br_default vlan 50 vni 10050
    nv set bridge domain br_default vlan 60 vni 10060
    nv set evpn enable on
    nv set interface lo ip address 10.10.10.11/32
    nv set interface p0_sf,p1_sf link state up
    nv set interface p0_sf,p1_sf,pf0hpf_sf,pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf type swp
    nv set interface pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf bridge domain br_default access 50
    nv set interface pf0hpf_sf,pf1hpf_sf bridge domain br_default access 60
    nv set nve vxlan arp-nd-suppress on
    nv set nve vxlan enable on
    nv set nve vxlan source address 10.10.10.11
    nv set vrf default router bgp peer-group underlay remote-as external
    nv set vrf default router bgp neighbor p0_sf peer-group underlay
    nv set vrf default router bgp neighbor p1_sf peer-group underlay
    nv set router bgp autonomous-system 65111
    nv set router bgp enable on
    nv set router bgp router-id 10.10.10.11
    nv set router policy route-map LOOPBACK rule 1 action permit
    nv set router policy route-map LOOPBACK rule 1 match interface lo
    nv set system global anycast-mac 44:38:39:BE:EF:AA
    nv set vrf default router bgp address-family ipv4-unicast enable on
    nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
    nv set vrf default router bgp address-family l2vpn-evpn enable on
    nv set vrf default router bgp enable on
    nv set vrf default router bgp neighbor p0_sf type unnumbered
    nv set vrf default router bgp neighbor p1_sf type unnumbered
    nv set vrf default router bgp path-selection multipath aspath-ignore on
    nv set vrf default router bgp peer-group underlay address-family ipv4-unicast policy outbound route-map LOOPBACK
    nv set vrf default router bgp peer-group underlay address-family ipv4-unicast soft-reconfiguration on
    nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
    nv config apply -y
    
    nv set bridge domain br_default vlan 50 vni 10050
    nv set bridge domain br_default vlan 60 vni 10060
    nv set evpn enable on
    nv set interface lo ip address 10.10.10.12/32
    nv set interface p0_sf,p1_sf link state up
    nv set interface p0_sf,p1_sf,pf0hpf_sf,pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf type swp
    nv set interface pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf bridge domain br_default access 50
    nv set interface pf0hpf_sf,pf1hpf_sf bridge domain br_default access 60
    nv set nve vxlan arp-nd-suppress on
    nv set nve vxlan enable on
    nv set nve vxlan source address 10.10.10.12
    nv set vrf default router bgp peer-group underlay remote-as external
    nv set vrf default router bgp neighbor p0_sf peer-group underlay
    nv set vrf default router bgp neighbor p1_sf peer-group underlay
    nv set router bgp autonomous-system 65112
    nv set router bgp enable on
    nv set router bgp router-id 10.10.10.12
    nv set router policy route-map LOOPBACK rule 1 action permit
    nv set router policy route-map LOOPBACK rule 1 match interface lo
    nv set system global anycast-mac 44:38:39:BE:EF:AA
    nv set vrf default router bgp address-family ipv4-unicast enable on
    nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
    nv set vrf default router bgp address-family l2vpn-evpn enable on
    nv set vrf default router bgp enable on
    nv set vrf default router bgp neighbor p0_sf type unnumbered
    nv set vrf default router bgp neighbor p1_sf type unnumbered
    nv set vrf default router bgp path-selection multipath aspath-ignore on
    nv set vrf default router bgp peer-group underlay address-family ipv4-unicast policy outbound route-map LOOPBACK
    nv set vrf default router bgp peer-group underlay address-family ipv4-unicast soft-reconfiguration on
    nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable onnv config apply -y
    
    nv set bridge domain br_default vlan 50 vni 10050
    nv set bridge domain br_default vlan 60 vni 10060
    nv set evpn enable on
    nv set interface lo ip address 10.10.10.13/32
    nv set interface p0_sf,p1_sf link state up
    nv set interface p0_sf,p1_sf,pf0hpf_sf,pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf type swp
    nv set interface pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf bridge domain br_default access 50
    nv set interface pf0hpf_sf,pf1hpf_sf bridge domain br_default access 60
    nv set nve vxlan arp-nd-suppress on
    nv set nve vxlan enable on
    nv set nve vxlan source address 10.10.10.13
    nv set vrf default router bgp peer-group underlay remote-as external
    nv set vrf default router bgp neighbor p0_sf peer-group underlay
    nv set vrf default router bgp neighbor p1_sf peer-group underlay
    nv set router bgp autonomous-system 65113
    nv set router bgp enable on
    nv set router bgp router-id 10.10.10.13
    nv set router policy route-map LOOPBACK rule 1 action permit
    nv set router policy route-map LOOPBACK rule 1 match interface lo
    nv set system global anycast-mac 44:38:39:BE:EF:AA
    nv set vrf default router bgp address-family ipv4-unicast enable on
    nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
    nv set vrf default router bgp address-family l2vpn-evpn enable on
    nv set vrf default router bgp enable on
    nv set vrf default router bgp neighbor p0_sf type unnumbered
    nv set vrf default router bgp neighbor p1_sf type unnumbered
    nv set vrf default router bgp path-selection multipath aspath-ignore on
    nv set vrf default router bgp peer-group underlay address-family ipv4-unicast policy outbound route-map LOOPBACK
    nv set vrf default router bgp peer-group underlay address-family ipv4-unicast soft-reconfiguration on
    nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
    nv config apply -y
    
  4. You can exit back to the host and validate the connectivity between the hosts by pinging them over the high-speed interface:

    Server Console

    $ ping 172.60.0.11
    $ ping 172.60.0.12
    $ ping 172.60.0.13
    

Deploying Kubernetes

Now we will deploy Kubernetes on the hosts using kubespray.

  1. On the deployment node (master node can also be used as the deployment node), run:

    Deployment Node Console (Master node is used)

    $ cd ~
    $ sudo apt -y install python3-pip jq
    $ wget https://github.com/kubernetes-sigs/kubespray/archive/refs/tags/v2.22.0.tar.gz
    $ tar -zxf v2.20.0.tar.gz
    $ cd kubespray-2.20.0
    $ sudo pip3 install -r requirements.txt
    
  2. Create the initial cluster configuration for three nodes. We use the addresses we assigned to the DPU interface. The high-speed network is used for both the primary and the secondary network of our Kubernetes cluster:

    Deployment Node Console (Master Node is used)

    $ cp -rfp inventory/sample inventory/mycluster
    $ declare -a IPS=(172.60.0.11 172.60.0.12 172.60.0.13)
    $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
    
  3. Edit the hosts list as follows:

    Deployment Node Console (Master Node is used)

    $ vi inventory/mycluster/hosts.yaml
    
    all:
      hosts:
        master:
          ansible_host: 172.60.0.11
          ip: 172.60.0.11
          access_ip: 172.60.0.11
        worker1:
          ansible_host: 172.60.0.12
          ip: 172.60.0.12
          access_ip: 172.60.0.12
        worker2:
          ansible_host: 172.60.0.13
          ip: 172.60.0.13
          access_ip: 172.60.0.13
      children:
        kube_control_plane:
          hosts:
            master:
        kube_node:
          hosts:
            worker1:
            worker2:
        etcd:
          hosts:
            master:
        k8s_cluster:
          children:
            kube_control_plane:
            kube_node:
        calico_rr:
          hosts: {}
    
  4. Edit the Calico configuration to use IP-in-IP:

    Deployment Node Console (Master Node is used)

    $ vi inventory/mycluster/group_vars/k8s_cluster/k8s-net-calico.yml  
    
    calico_network_backend: "bird"  
    calico_ipip_mode: "Always"  
    calico_vxlan_mode: "Never" 
    
  5. Deploy the cluster:

    Deployment Node Console (Master Node is used)

    $ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml
    
  6. Label the worker nodes as workers:

    Master Node Console

    # kubectl label nodes worker1 node-role.kubernetes.io/worker=
    # kubectl label nodes worker2 node-role.kubernetes.io/worker=
    

Installing the NVIDIA Network Operator

  1. On the master node, install helm and add the Mellanox repo:

    Master Node Console

    # snap install helm --classic
    # helm repo add mellanox https://mellanox.github.io/network-operator && helm repo update
    
  2. Create the values.yaml file:

    # nano values.yaml  
    
    nfd:
      enabled: true
    
    sriovNetworkOperator:
      enabled: true
    
    deployCR: true
    ofedDriver:
      deploy: false
    
    nvPeerDriver:
      deploy: false
    
    rdmaSharedDevicePlugin:
      deploy: false
    
    sriovDevicePlugin:
      deploy: false
    
    secondaryNetwork:
      deploy: true
    
  3. Install the operator:

    Master Node Console

    # helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator --generate-name
    
  4. Validate the installation:

    Master Node Console

    # kubectl -n network-operator get pods
    
  5. Create the policy.yaml file:

    # nano policy.yaml  
    
    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: mlnxnic
      namespace: network-operator
    spec:
      nodeSelector:
        feature.node.kubernetes.io/custom-rdma.capable: "true"   
      resourceName: mlnxnet
      priority: 99
      mtu: 9000
      numVfs: 16
      nicSelector:
        pfNames: [ "ens2f0np0" ]
      deviceType: netdevice
      isRdma: true
    
  6. Apply it: 

    Master Node Console

    # kubectl apply -f policy.yaml
    
  7. Create the network.yaml file: 

    # nano network.yaml  
    
    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: mlnx-network
      namespace: network-operator
    spec:
      ipam: |
        {
          "datastore": "kubernetes",
          "kubernetes": {
            "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
          },
          "log_file": "/tmp/whereabouts.log",
          "log_level": "debug",
          "type": "whereabouts",
          "range": "172.50.0.0/24"
        }
      networkNamespace: default
      resourceName: mlnxnet
    
  8. Apply it: 

    Master Node Console

    # kubectl apply -f network.yaml
    
  9. Wait a few minutes for the configuration to complete and then validate the network and its resources:

    Master Node Console

    # kubectl get network-attachment-definitions.k8s.cni.cncf.io
    # kubectl get node worker1 -o json | jq '.status.allocatable' 
    # kubectl get node worker2 -o json | jq '.status.allocatable'
    

    It may be necessary to perform a full power cycle on the servers if firmware configuration changes are required.

Validating the Deployment

Running Test Daemon-Set

On the master node:

  1. Create the following yaml file:

    # nano testds.yaml  apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: example-daemon
      labels:
        app: example-ds
    spec:
      selector:
        matchLabels:
          app: example-ds
      template:
        metadata:
          labels:
            app: example-ds
          annotations:
            k8s.v1.cni.cncf.io/networks: mlnx-network
        spec:
          containers:
          - image: ubuntu
            name: example-ds-pod
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              limits:
                memory: 16Gi
                cpu: 8
                nvidia.com/mlnxnet: '1'
              requests:
                memory: 16Gi
                cpu: 8
                nvidia.com/mlnxnet: '1'
            command:
            - sleep
            - inf
    
  2. Then create the deployment:

    Master Node Console

    # kubectl create -f testds.yaml
    

Running TCP Throughput Test

Now we will run a TCP throughput test on top of the high-speed secondary network between our pods running on different worker nodes.

Open an additional terminal window to the master node and connect to each of the pods:

  1. Check the names of the pods:

    Master Node Console

    # kubectl get pods
    
  2. Connect to the desired pod:

    Master Node Console

    # kubectl exec -it <pod-name> -- bash
    
  3. Install and run iperf TCP test between the two pods over the high-speed secondary network:

    First Pod Console

    # apt update
    # apt install iproute2 iperf -y
    
  4. Check the IP address of the first pod on the high speed secondary network net1:

    First Pod Console

    # ip addr
    
  5. Run iperf on it:

    First Pod Console

    # iperf -s
    
  6. On the second pod, install iperf and run it in client mode, connecting to the first pod:

    Second Pod Console

    # apt update
    # apt install iperf -y
    # iperf -c <server-address> -P 10
    


    Done!

Authors


SD.jpg

Shachar Dor

Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management. 

Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company. 








Last updated: