RDG for Canonical Charmed OpenStack with NVIDIA Networking and Accelerated OVN for High Performance Workloads

Created on Sep 7, 2022

Scope

This article is covering the full design, scale considerations and deployment steps of the Canonical Charmed OpenStack cloud solution based on Ubuntu 22.04 with inbox network drivers and OpenStack Yoga packages over highly available 100GbE NVIDIA networking with OVN hardware acceleration.

Abbreviations and Acronyms

Term	Definition	Term	Definition
AI	Artificial Intelligence	ML2	Modular Layer 2 Openstack Plugin
ASAP²	Accelerated Switching and Packet Processing®	MLAG	Multi-Chassis Link Aggregation
BGP	Border Gateway Protocol	MLNX_OFED	NVIDIA Mellanox OpenFabrics Enterprise Distribution for Linux (network driver)
BOM	Bill of Materials	NFV	Network Functions Virtualization
CPU	Central Processing Unit	NIC	Network Interface Card
CUDA	Compute Unified Device Architecture	OS	Operating System
DHCP	Dynamic Host Configuration Protocol	OVN	Open Virtual Network
DPDK	Data Plane Development Kit	OVS	Open vSwitch
EVPN	Ethernet VPN	PF	Physical Function
EVPN-MH	EVPN Multihoming	RDG	Reference Deployment Guide
FW	FirmWare	RDMA	Remote Direct Memory Access
GPU	Graphics Processing Unit	RoCE	RDMA over Converged Ethernet
HA	High Availability	SDN	Software Defined Networking
IP	Internet Protocol	SR-IOV	Single Root Input/Output Virtualization
IPMI	Intelligent Platform Management Interface	VF	Virtual Function
L3	IP Network Layer 3	VF-LAG	Virtual Function Link Aggregation
LACP	Link Aggregation Control Protocol	VLAN	Virtual LAN
MGMT	Management	VM	Virtual Machine

Introduction

Canonical Charmed OpenStack is an enterprise cloud platform based on Ubuntu OS with OpenStack packages, and OpenStack Charmed Operators for simplified deployment and operations.

This Reference Deployment Guide (RDG) demonstrates a step by step full deployment of a multi-tenant Charmed OpenStack cloud solution for common as well as high performance workloads. The deployment uses Nvidia highly available 100GbE fabric with hardware accelerated ML2/OVN as SDN with stateful Firewall and NAT services.

The use cases covered in this article include Geneve for East-West traffic and Floating IP DNAT for North-South, both fully accelerated with cloud Security Policy enforcement at line rate.

The demonstrated benchmark validation tests can be used as a reference for multiple workload use cases such as NFV, Big Data and AI, TCP/UDP, DPDK, RoCE/RDMA and GPUDirect/RDMA stacks.

References

Canonical Charmed OpenStack

OpenStack Charms Deployment Guide

Canonical Juju Charm Hub

NVIDIA Cumulus EVPN-Multihoming

NVIDIA GPUDirect

Data Plane Development Kit (DPDK) Home

Solution Architecture

Key Components and Technologies

NVIDIA A100 GPU
NVIDIA A100 Tensor Core GPU delivers unprecedented acceleration at every scale to power the world’s highest-performing elastic data centers for AI, data analytics, and HPC. Powered by the NVIDIA Ampere Architecture, A100 is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands. Available in 40GB and 80GB memory versions, A100 80GB debuts the world’s fastest memory bandwidth at over 2 terabytes per second (TB/s) to run the largest models and datasets.

NVIDIA ConnectX SmartNICs
10/25/40/50/100/200 and 400G Ethernet Network Adapters
The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

NVIDIA Cumulus Linux
NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

NVIDIA LinkX Cables
The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

Canonical Charmed OpenStack
Canonical Charmed OpenStack is an enterprise cost-effective cloud platform, designed to run mission-critical workloads for telcos, financial institutions, hardware manufacturers, government institutions and enterprise.

Logical Design

Image description: Logical Design Main Components

Note

In this reference design, we used dedicated nodes for MAAS and Juju controllers without any HA configuration. In general, It is possible to co-locate MAAS and Juju controllers on the same physical machines, and the controllers can also be configured for HA.

Image description: Logical Design OpenStack Components

Note

In this reference design, we configured one of the OpenStack nodes to serve as a dedicated "controller" running only OpenStack control services, and two nodes to run compute services and host VMs. Juju charms allows flexible application distribution across nodes.

The only OpenStack applications configured as HA cluster in the charm bundle used to deploy the solution described in this article, are OVN-Central and MySQL DB. A full-HA application deployment is supported as well by Canonical Charmed OpenStack.

Network Fabric Design

Reference Network Architecture

The reference network architecture used for the solution described in this Reference Deployment Guide contains the following building blocks:

Fully stretched EVPN Multihoming (EVPN-MH) networking architecture. It is a standards-based replacement for MLAG in data centers deploying CLOS topologies with many advantages over existing solutions - For more information, refer to NVIDIA Cumulus EVPN-MH
2 x MSN3700C Spine switches
2 x MSN3700C Leaf switches per rack with Multihoming configuration and without any inter-leafs peer-links.
Host servers with 2 x 100Gbps ports, configured with LACP Active-Active bonding
- The bond interface is used in Openvswitch running on the Host
- Multiple L3 Openvswitch VLAN interfaces are configured on the Host
- The IP Subnets of the VLAN interfaces are mapped into MAAS spaces and used for Juju multi-space deployment
A converged and highly available 100GbE high speed fabric is used for all data, control, provisioning and management networks stretched over the datacenter using EVPN
1GbE dedicated IPMI fabric
MAAS node is configured as cloud components default gateway for internet access over the OAM network space
A dedicated node is configured as a default gateway of the Public network
MAAS and Juju Controller nodes were not deployed in server HA configuration or with bond networking
The entire fabric is configured to support Jumbo Frames (optional)

Image description: Reference Fabric Small Scale

Image description: Network Architecture Diagram

Note

The External Gateway node is using only as a default gateway for the Public network and not related to the VM's Floating IPs
For extreme message-rate workloads, additional network architectures, such as Routed Leaf Spine without an overlay, are available.

Large Scale

Image description: Large Scale Fabric

Note

Maximum Scale for 2 layers leaf spine fabric with the selected switches:

16 x MSN3700C switches as Spine
32 x MSN3700C switches as Leaf
16 x Racks
256 x Nodes (16 per rack)

This is a Non-Blocking scale topology without requiring any inter-leafs peer-link, due to EVPN-MH architecture.

Host Accelerated Bonding Logical Design

In the solution described in this article, enhanced SR-IOV with bonding support (ASAP² VF-LAG) is used to offload network processing from the host and VM into the network adapter hardware, while providing fast data plane with high availability functionality.

Two Virtual Functions, each on a different physical port of the same NIC, are bonded and allocated to the VM as a single LAGed VF. The bonded interface is connected to a single or multiple ToR switches, using Active-Standby or Active-Active bond modes.

Image description: VF-LAG components

For additional information, please refer to QSG: High Availability with Mellanox ASAP2 Enhanced SR-IOV (VF-LAG)..

Host and Application Logical Design

Image description: Host HW/SW Components

Compute host components:

NVIDIA A100 GPU Devices
NVIDIA ConnectX6-Dx High Speed NIC with a dual physical port, configured with LACP bonding in MLAG topology, and providing VF-LAG redundancy to the VM
Storage Drives for local OS usage
Ubuntu 22.04 as a base OS
OpenStack Yoga packages
Charmed OpenStack Platform software stack with:
- KVM-based hypervisor
- Openvswitch (OVS) with hardware offload support
- ML2/OVN Mechanism Driver

Virtual Machine components:

Ubuntu 22.04 as base OS
NVIDIA GPU devices allocated using PCI passthrough, allowing to bypass the compute server hypervisor
NVIDIA SR-IOV Virtual Function (VF) allocated using PCI passthrough, allowing to bypass the compute server hypervisor
NVIDIA cUDA and MLNX_OFED drivers for GPUDirect RDMA use case
DPDK user space libraries for accelerated network processing use case with VM kernel bypass
Performance and benchmark testing toolset, including iperf3, dpdk-apps and perftest-tools

Software Stack Components

Image description: Solution SW Stack Components

Bill of Materials (BOM)

Image description: Bill of Material Inventory

Deployment and Configuration

Wiring

Image description: Deployment Wiring

Network Fabric

NIC Firmware Upgrade and Settings

Please make sure to upgrade the ConnectX NIC firmware to the latest release, as listed here.

There are multiple ways to update the NIC firmware. One or them is by installing the mstflint package on the server hosting the NIC - Firmware Update Instructions.

In the following RDG, firmware update is not automated as part of the deployment. However, MAAS commissioning scripts can be used for this purpose - for additional information, refer to Canonical MAAS Commissioning Script Reference

Switch NOS Upgrade

Please make sure to upgrade Cumulus Linux to the latest release. Use the following links for further instructions and details: Upgrading Cumulus Linux or Installing a New Cumulus Linux Image.

Note

Starting from Cumulus Linux 4.2.0, the default password for the cumulus user account has changed to "cumulus", and must be changed upon first login.

Switch Configuration - Summary

Note

The tables in this section are aimed to explain the switches configurations and naming terminology used in the full configuration files.

For example in Leaf switch "Leaf0-1" which is located in Rack 0, VLANs 10 which is used for Internal network, is configured on interface swp9-swp10 which are members in BOND interfaces bond1-bond2, respectively, with MTU of 9000. This bond is configured to use EVPN multihoming segment mac-address 44:38:39:BE:EF:01.

Detailed switch configuration can be found in the next sections, and the tables below are introduced as a complementary visual tool for the full configuration files.

Networks Identifiers

Network	VLAN ID	EVPN VNI	MAAS Space
PXE/OAM	3000	3000	oam-space
Public	9	9	public-space
Internal	10	10	internal-space
Geneve overlay tenant	40	40	overlay-space
Provider-vlan tenant	101	101	N/A

Leaf-Host Interfaces

Rack-Leaf	Leaf Interface	Bond Interface	VLANs and Mode	MTU	MH Segment MAC
0-1	swp9	bond1	Tagged: 10,40,101,9 Untagged: 3000	9000	44:38:39:BE:EF:01
0-1	swp10	bond2	Tagged: 10,40,101,9 Untagged: 3000	9000	44:38:39:BE:EF:01
0-1	swp22	bond_ext	Untagged: 9	9000	44:38:39:BE:EF:01
0-1	swp23-24	N/A	Tagged: 9, Untagged: 3000	9000	N/A
0-2	swp9	bond1	Tagged: 10,40,101,9 Untagged: 3000	9000	44:38:39:BE:EF:01
0-2	swp10	bond2	Tagged: 10,40,101,9 Untagged: 3000	9000	44:38:39:BE:EF:01
0-2	swp22	bond_ext	Untagged: 9	9000	44:38:39:BE:EF:01
1-1	swp9	bond1	Tagged: 10,40,101,9 Untagged: 3000	9000	44:38:39:BE:EF:02
1-1	swp10	bond2	Tagged: 10,40,101,9 Untagged: 3000	9000	44:38:39:BE:EF:02
1-2	swp9	bond1	Tagged: 10,40,101,9 Untagged: 3000	9000	44:38:39:BE:EF:02
1-2	swp10	bond2	Tagged: 10,40,101,9 Untagged: 3000	9000	44:38:39:BE:EF:02

Leaf-Spine Interfaces

Rack-Leaf	Leaf Interfaces	Spine0 Interface	Spine1 Interface	MTU
0-1	swp31, swp32	swp13	swp13	9216 (default)
0-2	swp31, swp32	swp14	swp14	9216 (default)
1-1	swp31, swp32	swp15	swp15	9216 (default)
1-2	swp31, swp32	swp16	swp16	9216 (default)

Switch Interfaces Topology

Image description: Switch Interfaces Topology

Switch Configuration - Detailed

Note

The configuration below is provided as an NVUE commands set, and is matching the reference network architecture used in this article.

Rack1 configuration contains two bond interfaces although only a single host is used in the current reference network architecture.

Leaf0-1

Click to expand...

nv set interface lo ip address 10.10.10.1/32
nv set interface swp9-10,swp22,swp31-32
nv set interface bond1 bond member swp9
nv set interface bond2 bond member swp10
nv set interface bond_ext bond member swp22
nv set interface bond1 bond lacp-bypass on
nv set interface bond2 bond lacp-bypass on
nv set interface bond_ext bond lacp-bypass on
nv set interface bond1 link mtu 9000
nv set interface bond2 link mtu 9000
nv set interface bond_ext link mtu 9000
nv set interface bond_ext description External_GW_bond
nv set interface bond1-2 bridge domain br_default vlan 9,10,40,101,3000
nv set interface bond1-2 bridge domain br_default untagged 3000
nv set interface bond_ext bridge domain br_default vlan 9
nv set interface bond_ext bridge domain br_default untagged 9
nv set bridge domain br_default vlan 9 vni 9
nv set bridge domain br_default vlan 10 vni 10
nv set bridge domain br_default vlan 40 vni 40
nv set bridge domain br_default vlan 101 vni 101
nv set bridge domain br_default vlan 3000 vni 3000
nv set nve vxlan source address 10.10.10.1
nv set nve vxlan arp-nd-suppress on 
nv set system global anycast-mac 44:38:39:BE:EF:AA
nv set evpn enable on
nv set router bgp autonomous-system 65101
nv set router bgp router-id 10.10.10.1
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp31 peer-group underlay
nv set vrf default router bgp neighbor swp32 peer-group underlay
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv set evpn multihoming enable on
nv set interface bond1 evpn multihoming segment local-id 1
nv set interface bond2 evpn multihoming segment local-id 2
nv set interface bond_ext evpn multihoming segment local-id 9
nv set interface bond1-2 evpn multihoming segment mac-address 44:38:39:BE:EF:01
nv set interface bond_ext evpn multihoming segment mac-address 44:38:39:BE:EF:01
nv set interface bond1-2 evpn multihoming segment df-preference 50000
nv set interface bond_ext evpn multihoming segment df-preference 50000
nv set interface swp31-32 evpn multihoming uplink on
nv set qos roce
nv set interface swp23 description juju
nv set interface swp24 description maas
nv set interface swp23-24 bridge domain br_default vlan 3000,9
nv set interface swp23-24 bridge domain br_default untagged 3000
nv config apply  -y

Leaf0-2

Click to expand...

nv set interface lo ip address 10.10.10.2/32
nv set interface swp9-10,swp22,swp31-32
nv set interface bond1 bond member swp9
nv set interface bond2 bond member swp10
nv set interface bond_ext bond member swp22
nv set interface bond1 bond lacp-bypass on
nv set interface bond2 bond lacp-bypass on
nv set interface bond_ext bond lacp-bypass on
nv set interface bond1 link mtu 9000
nv set interface bond2 link mtu 9000
nv set interface bond_ext link mtu 9000
nv set interface bond_ext description External_GW_bond
nv set interface bond1-2 bridge domain br_default vlan 9,10,40,101,3000
nv set interface bond1-2 bridge domain br_default untagged 3000
nv set interface bond_ext bridge domain br_default vlan 9
nv set interface bond_ext bridge domain br_default untagged 9
nv set bridge domain br_default vlan 9 vni 9
nv set bridge domain br_default vlan 10 vni 10
nv set bridge domain br_default vlan 40 vni 40
nv set bridge domain br_default vlan 101 vni 101
nv set bridge domain br_default vlan 3000 vni 3000
nv set nve vxlan source address 10.10.10.2
nv set nve vxlan arp-nd-suppress on 
nv set system global anycast-mac 44:38:39:BE:EF:AA
nv set evpn enable on
nv set router bgp autonomous-system 65102
nv set router bgp router-id 10.10.10.2
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp31 peer-group underlay
nv set vrf default router bgp neighbor swp32 peer-group underlay
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv set evpn multihoming enable on
nv set interface bond1 evpn multihoming segment local-id 1
nv set interface bond2 evpn multihoming segment local-id 2
nv set interface bond_ext evpn multihoming segment local-id 9
nv set interface bond1-2 evpn multihoming segment mac-address 44:38:39:BE:EF:01
nv set interface bond_ext evpn multihoming segment mac-address 44:38:39:BE:EF:01
nv set interface bond1-2 evpn multihoming segment df-preference 50000
nv set interface bond_ext evpn multihoming segment df-preference 50000
nv set interface swp31-32 evpn multihoming uplink on
nv set qos roce
nv config apply  -y

Leaf1-1

Click to expand...

nv set interface lo ip address 10.10.10.3/32
nv set interface swp9-10,swp31-32
nv set interface bond1 bond member swp9
nv set interface bond2 bond member swp10
nv set interface bond1 bond lacp-bypass on
nv set interface bond2 bond lacp-bypass on
nv set interface bond1 link mtu 9000
nv set interface bond2 link mtu 9000
nv set interface bond1-2 bridge domain br_default vlan 9,10,40,101,3000
nv set interface bond1-2 bridge domain br_default untagged 3000
nv set bridge domain br_default vlan 9 vni 9
nv set bridge domain br_default vlan 10 vni 10
nv set bridge domain br_default vlan 40 vni 40
nv set bridge domain br_default vlan 101 vni 101
nv set bridge domain br_default vlan 3000 vni 3000
nv set nve vxlan source address 10.10.10.3
nv set nve vxlan arp-nd-suppress on 
nv set system global anycast-mac 44:38:39:BE:EF:AA
nv set evpn enable on
nv set router bgp autonomous-system 65103
nv set router bgp router-id 10.10.10.3
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp31 peer-group underlay
nv set vrf default router bgp neighbor swp32 peer-group underlay
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv set evpn multihoming enable on
nv set interface bond1 evpn multihoming segment local-id 1
nv set interface bond2 evpn multihoming segment local-id 2
nv set interface bond1-2 evpn multihoming segment mac-address 44:38:39:BE:EF:02
nv set interface bond1-2 evpn multihoming segment df-preference 50000
nv set interface swp31-32 evpn multihoming uplink on
nv set qos roce
nv config apply  -y

Leaf1-2

Click to expand...

nv set interface lo ip address 10.10.10.4/32
nv set interface swp9-10,swp31-32
nv set interface bond1 bond member swp9
nv set interface bond2 bond member swp10
nv set interface bond1 bond lacp-bypass on
nv set interface bond2 bond lacp-bypass on
nv set interface bond1 link mtu 9000
nv set interface bond2 link mtu 9000
nv set interface bond1-2 bridge domain br_default vlan 9,10,40,101,3000
nv set interface bond1-2 bridge domain br_default untagged 3000
nv set bridge domain br_default vlan 9 vni 9
nv set bridge domain br_default vlan 10 vni 10
nv set bridge domain br_default vlan 40 vni 40
nv set bridge domain br_default vlan 101 vni 101
nv set bridge domain br_default vlan 3000 vni 3000
nv set nve vxlan source address 10.10.10.4
nv set nve vxlan arp-nd-suppress on 
nv set system global anycast-mac 44:38:39:BE:EF:AA
nv set evpn enable on
nv set router bgp autonomous-system 65104
nv set router bgp router-id 10.10.10.4
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp31 peer-group underlay
nv set vrf default router bgp neighbor swp32 peer-group underlay
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv set evpn multihoming enable on
nv set interface bond1 evpn multihoming segment local-id 1
nv set interface bond2 evpn multihoming segment local-id 2
nv set interface bond1-2 evpn multihoming segment mac-address 44:38:39:BE:EF:02
nv set interface bond1-2 evpn multihoming segment df-preference 50000
nv set interface swp31-32 evpn multihoming uplink on
nv set qos roce
nv config apply  -y

Spine0

Click to expand...

nv set interface lo ip address 10.10.10.101/32
nv set interface swp13-16
nv set router bgp autonomous-system 65199
nv set router bgp router-id 10.10.10.101
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp13 peer-group underlay
nv set vrf default router bgp neighbor swp14 peer-group underlay
nv set vrf default router bgp neighbor swp15 peer-group underlay
nv set vrf default router bgp neighbor swp16 peer-group underlay
nv set vrf default router bgp address-family l2vpn-evpn enable on
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv set qos roce
nv config apply  -y

Spine1

Click to expand...

nv set interface lo ip address 10.10.10.102/32
nv set interface swp13-16
nv set router bgp autonomous-system 65199
nv set router bgp router-id 10.10.10.102
nv set vrf default router bgp peer-group underlay remote-as external
nv set vrf default router bgp neighbor swp13 peer-group underlay
nv set vrf default router bgp neighbor swp14 peer-group underlay
nv set vrf default router bgp neighbor swp15 peer-group underlay
nv set vrf default router bgp neighbor swp16 peer-group underlay
nv set vrf default router bgp address-family l2vpn-evpn enable on
nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
nv set qos roce
nv config apply  -y

Verification

Confirm the interfaces status on the Leaf switches. Make sure all interfaces are UP and configured with the correct MTU. Verify the correct LLDP neighbors:

Leaf0-2

Leaf0-2$ net show int
State  Name        Spd   MTU    Mode        LLDP                           Summary
-----  ----------  ----  -----  ----------  -----------------------------  -------------------------
UP     lo          N/A   65536  Loopback                                   IP: 127.0.0.1/8
       lo                                                                  IP: 10.10.10.2/32
       lo                                                                  IP: ::1/128
UP     eth0        1G    1500   Mgmt             						   Master: mgmt(UP)
       eth0                                                                IP: /24(DHCP)
PRTDN  swp1        N/A   9216   Default
PRTDN  swp2        N/A   9216   Default
PRTDN  swp3        N/A   9216   Default
UP     swp9        100G  9000   BondMember                                 Master: bond1(UP)
UP     swp10       100G  9000   BondMember                                 Master: bond2(UP)
UP     swp22       100G  9000   BondMember                                 Master: bond_ext(UP)
UP     swp31       100G  9216   Default     Spine0 (swp16)
UP     swp32       100G  9216   Default     Spine1 (swp16)
UP     bond1       100G  9000   802.3ad                                    Master: br_default(UP)
       bond1                                                               Bond Members: swp9(UP)
UP     bond2       100G  9000   802.3ad                                    Master: br_default(UP)
       bond2                                                               Bond Members: swp10(UP)
UP     bond_ext    100G  9000   802.3ad                                    Master: br_default(UP)
       bond_ext                                                            Bond Members: swp22(UP)
UP     br_default  N/A   9216   Bridge/L2
UP     mgmt        N/A   65536  VRF                                        IP: 127.0.0.1/8
       mgmt                                                                IP: ::1/128
UP     vxlan48     N/A   9216   Trunk/L2                                   Master: br_default(UP)

Confirm the BGP/EVPN neighbors discovery on all switches:

Leaf0-2

Leaf0-2$ net show bgp summary 
show bgp ipv4 unicast summary
=============================
BGP router identifier 10.10.10.2, local AS number 65102 vrf-id 0
BGP table version 18
RIB entries 11, using 2200 bytes of memory
Peers 2, using 46 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor           V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
Spine0(swp31)      4      65199   1212494   1212671        0    0    0 05w3d22h            4        6
Spine11(swp32)     4      65199   1212534   1212696        0    0    0 05w3d22h            4        6

Total number of neighbors 2


show bgp ipv6 unicast summary
=============================
% No BGP neighbors found


show bgp l2vpn evpn summary
===========================
BGP router identifier 10.10.10.2, local AS number 65102 vrf-id 0
BGP table version 0
RIB entries 63, using 12 KiB of memory
Peers 2, using 46 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor           V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
Spine0(swp31)      4      65199   1212494   1212671        0    0    0 05w3d22h          124      192
Spine11(swp32)     4      65199   1212534   1212696        0    0    0 05w3d22h          124      192

Total number of neighbors 2

Host

Prerequisites

Hardware specifications per host are described in the Bill Of Materials section.
ConnectX-6 Dx adapters configuration:
- MAAS/Juju/OpenStack Controller Node
  - Latest Firmware
  - Ports are set to operate in Ethernet mode (LINK_TYPE_P0/1 Firmware parameter is set to ETH)
  - PXE boot enabled on the ports on Flexboot BIOS
- OpenStack Compute Nodes
  - Latest Firmware
  - Ports are set to operate in Ethernet mode (LINK_TYPE_P0/1 Firmware parameter is set to ETH)
  - PXE boot enabled on the ports on Flexboot BIOS
  - SRIOV_EN firmware parameter is set to True
  - NUM_OF_VFS firmware parameter is set to a value matching the number of Virtual Functions used in the OpenStack bundle deployment file
  - ADVANCED_PCI_SETTINGS firmware parameter is set to True, and MAX_ACC_OUT_READ firmware parameter is set to a value of 44 for optimized bandwidth test results
  - ATS_ENABLED firmware parameter is set to True - For GPUDirect RDMA usage in Virtual Machines context
BIOS Configuration:
- OpenStack Controller Nodes
  - PXE boot is set in server boot order
- OpenStack Compute Nodes
  - Virtualization and SR-IOV enabled
  - Hyperthreading disabled
  - PXE boot is set in server boot order
  - ACS enabled - For GPUDirect RDMA usage in Virtual Machine

Cloud Deployment

MAAS Controller

MAAS Node Installation

Install Ubuntu 22.04 OS on the node, and log into it using SSH
- Configure IP addresses on the interface connected to the high speed fabric
  - IP address from the PXE/OAM subnet (untagged) - In our case, we used 192.168.25.1
  - VLAN IP for the public subnet (in our case, VLAN ID 9)
Follow the instructions specified in the MAAS Installation Guide in order to complete the steps:
- Install MAAS from a snap
- Install and setup PostgreSQL
- Initialize MAAS and verify the services are running

MAAS Node Configuration

Follow the instructions specified in the MAAS Installation Guide in order to complete the steps (Make sure to select "CLI" method under How to configure MAAS section):
- Create an admin user
- Generate an API-key, and login to MAAS CLI
- Set an upstream DNS
- Set up SSH for the admin user
- Import images
- Enable DHCP on the PXE untagged VLAN (subnet 192.168.25.0/24)

MAAS Networking Configuration

Login to MAAS UI and apply the following settings under the Subnets tab:
- Locate the auto-discovered PXE/OAM fabric, and change its name to "fabric-high-speed". Make sure its untagged VLAN appears with MAAS-Provided DHCP
- Add the following spaces:
  - oam-space
  - internal-space
  - overlay-space
  - public-space
- Edit the fabric-high-speed untagged VLAN
  - Name: untagged-pxe-oam
  - Space: oam-space
  - MTU: 9000
- Edit the fabric-high-speed Subnet
  - Name: pxe-oam-subnet
  - Gateway IP: 192.168.25.1
    
    This is the IP address we assigned to the MAAS node on the interface connected to the high speed fabric, as in our solution example we use the MAAS node as the default Gateway of the deployed machines on the OAM network.
- Add and edit the following VLANs
  - v9-public
    - VID: 9
    - Space: public-space
    - MTU: 9000
    - DHCP: Disabled
  - v10-internal
    - VID: 10
    - Space: internal-space
    - MTU: 9000
    - DHCP: Disabled
  - v40-overlay
    - VID: 40
    - Space: overlay-space
    - MTU: 9000
    - DHCP: Disabled
- Add and edit the following Subnets
  - public-subnet
    - CIDR: 10.7.208.0/24
    - Fabric: fabric-high-speed
    - VLAN: 9(v9-public)
    - Reserved Ranges: per network requirements
      
      In our solution example, public network IPs are assigned by the MAAS to the deployed machines. However, they are also assigned by OpenStack Neutron to the Virtual Instances as Floating IPs. Make sure to reserve the IP range used by Neutron.
  - internal-subnet
    - CIDR: 172.18.0.0/24
    - Fabric: fabric-high-speed
    - VLAN: 10(v10-internal)
  - overlay-subnet
    - CIDR: 172.16.0.0/24
    - Fabric: fabric-high-speed
    - VLAN: 40(v40-overlay)

Image description: MAAS Fabrics

BareMetal MAAS Machines

Machines Inventory Creation and Commissioning

Under the MAAS UI "Machines" tab, add and commision four machines:
- Juju Controller node
- OpenStack Compute node 1
- OpenStack Compute node 2
- OpenStack Controller node

It is also possible to add and commision all machines by running the following maas-cli script on the MAAS node:

Note

Edit the file below with the servers IPMI IPs, username and password.

#Juju controller
maas admin machines create \
    hostname=controller \
    architecture=amd64 \
    power_type=ipmi \
    power_parameters_power_driver=LAN_2_0 \
    power_parameters_power_user=***** \
    power_parameters_power_pass=***** \
    power_parameters_power_address=192.168.0.10

#OpenStack Compute servers

maas admin machines create \
    hostname=node1 \
    architecture=amd64 \
    power_type=ipmi \
    power_parameters_power_driver=LAN_2_0 \
    power_parameters_power_user=***** \
    power_parameters_power_pass=***** \
    power_parameters_power_address=192.168.0.11

maas admin machines create \
    hostname=node2 \
    architecture=amd64 \
    power_type=ipmi \
    power_parameters_power_driver=LAN_2_0 \
    power_parameters_power_user=***** \
    power_parameters_power_pass=***** \
    power_parameters_power_address=192.168.0.12

#Openstack Controller server

maas admin machines create \
    hostname=node3 \
    architecture=amd64 \
    power_type=ipmi \
    power_parameters_power_driver=LAN_2_0 \
    power_parameters_power_user=***** \
    power_parameters_power_pass=***** \
    power_parameters_power_address=192.168.0.13

ubuntu@maas-opstk:~$ chmod +x nodes-inventory.sh 
ubuntu@maas-opstk:~$ ./nodes-inventory.sh

Machines Network Configuration

Note

The configuration below is using Open vSwitch bridge type in order to utilize the OVS-based HW Offload capabilities for all workloads and traffic types on the high speed NIC.

Once the machines are commissioned and in "Ready" state, proceed with the following Network configuration per machine:
- controller (Juju Controller node)
  - Physical -> Edit: Fabric fabric-high-speed, VLAN untagged, Subnet pxe-oam-subnet, IP Mode Auto assign
  - Physical -> Add VLAN: VLAN v9-public, Subnet public-subnet, IP Mode Auto assign
- node1 (OpenStack Compute node 1)
  - 2 x Physical -> Create Bond: Name bond0, Bond mode 802.3ad, IP Mode Unconfigured
  - bond0 -> Create Bridge: Name br-nvda, Type Open vSwitch (ovs), Fabric fabric-high-speed, VLAN untagged, Subnet pxe-oam-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v9-public, Subnet public-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v10-internal, Subnet internal-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v40-overlay, Subnet overlay-subnet, IP Mode Auto assign
- node2 (OpenStack Compute node 2)
  - 2 x Physical -> Create Bond: Name bond0, Bond mode 802.3ad, IP Mode Unconfigured
  - bond0 -> Create Bridge: Name br-nvda, Type Open vSwitch (ovs), Fabric fabric-high-speed, VLAN untagged, Subnet pxe-oam-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v9-public, Subnet public-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v10-internal, Subnet internal-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v40-overlay, Subnet overlay-subnet, IP Mode Auto assign
- node 3 (OpenStack Controller node)
  - 2 x Physical -> Create Bond: Name bond0, Bond mode 802.3ad, IP Mode Unconfigured
  - bond0 -> Create Bridge: Name br-nvda, Type Open vSwitch (ovs), Fabric fabric-high-speed, VLAN untagged, Subnet pxe-oam-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v9-public, Subnet public-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v10-internal, Subnet internal-subnet, IP Mode Auto assign
  - br-nvda -> Add VLAN: VLAN v40-overlay, Subnet overlay-subnet, IP Mode Auto assign
  - The image below is an example for node network configuration: Image description: MAAS Machine Network Configuration

Machines Tagging

Create Tags and assign to the machines according to their roles.

In the example below, the MAAS CLI is used for creating and assigning tags. It is also possible to use the MAAS UI.

Create new tags.

ubuntu@maas-opstk:~$ maas admin tags create name=juju comment="Juju Controller"
ubuntu@maas-opstk:~$ maas admin tags create name=controller  comment="OpenStack Controller"
ubuntu@maas-opstk:~$ maas admin tags create name=compute_sriov  comment="Performance tuning kernel parameters" kernel_opts="default_hugepagesz=1G hugepagesz=1G hugepages=96 intel_iommu=on iommu=pt blacklist=nouveau rd.blacklist=nouveau isolcpus=2-23"

Note

The compute nodes tag includes kernel parameters setting for performance tuning, such as hugepages and isolated CPUs. The isolated cores are correlated with OpenStack Nova cpu-dedicated-set configuration used in the charm deployment bundle.

Identify machines system IDs:

ubuntu@maas-opstk:~$ sudo apt install jq 
ubuntu@maas-opstk:~$ maas admin machines read | jq '.[] | .hostname, .system_id'
"controller"
"nsfyqw"
"node1"
"gxmyax"
"node2"
"yte7xf"
"node3"
"s67rep"

Assign tags to the relevant machines using its system IDs:

ubuntu@maas-opstk:~$ maas admin tag update-nodes juju add=nsfyqw
ubuntu@maas-opstk:~$ maas admin tag update-nodes controller add=s67rep
ubuntu@maas-opstk:~$ maas admin tag update-nodes compute_sriov add=gxmyax add=yte7xf

Juju Controller Bootstrap

Bootstrap the juju controller machine tagged with "juju":

ubuntu@maas-opstk:~$ juju bootstrap --bootstrap-series=focal --constraints tags=juju mymaas maas-controller --debug

OpenStack Charm Bundle File Configuration

The following "openstack-bundle-jammy-multi-space-nvidia-network.yaml" bundle file was used in our solution to allow our desired deployment according to the solution design guidelines. It includes the following main characteristics:
- Jammy-based OS image for the deployed nodes with OpenStack Yoga
- Multi-space charms configuration matching the solution network design
- 2 x Compute machines, 1 x Control machine
- Hardware Offload Enabled
- NVIDIA CX6-Dx and A100 GPU in NOVA PCI whitelist for device passthrough allocation
- Hybrid VF Pool (NVIDIA CX6-Dx PFs are used for both Geneve overlay and vlan-provider accelerated VFs)
- Dedicated CPU cores from the same NUMA node associated with the NVIDIA ConnectX6-Dx NIC for enhanced performance VMs
- Jumbo MTU

openstack-bundle-jammy-multi-space-nvidia-network.yaml

# Please refer to the OpenStack Charms Deployment Guide for more information.
# https://docs.openstack.org/project-deploy-guide/charm-deployment-guide
#
#
# *** MULTI-SPACE OPENSTACK CLOUD DEPLOYMENT ***
#
# Note: mysql-innodb-cluster cluster space should be the same space of shared-db and db-router of all other router apps
#

series: &series jammy

variables:
  openstack-origin: &openstack-origin distro
  worker-multiplier: &worker-multiplier 0.25
  
  #Network spaces
  oam-space:           &oam-space           oam-space
  public-space:        &public-space        public-space
  internal-space:      &internal-space      internal-space
  overlay-space:       &overlay-space       overlay-space
  #As internal space ise used as default space for most application, space constrains is ued to make sure all applications can communicate with their default GW located on the OAM space 
  space-constr:        &space-constr    spaces=oam-space
  
machines:
  '0':
    constraints: spaces=oam-space,public-space,internal-space,overlay-space tags=compute_sriov
  '1':
    constraints: spaces=oam-space,public-space,internal-space,overlay-space tags=compute_sriov
  '2':
    constraints: spaces=oam-space,public-space,internal-space,overlay-space tags=controller
    
applications:
  
  glance-mysql-router:
    charm: ch:mysql-router
    channel: 8.0/stable
    bindings:
      "": *internal-space
      
  glance:
    charm: ch:glance
    num_units: 1
    options:
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
    to:
    - lxd:2
    channel: yoga/stable
    bindings:
      "": *internal-space
      admin: *oam-space
      public: *public-space
      
  keystone-mysql-router:
    charm: ch:mysql-router
    channel: 8.0/stable
    bindings:
      "": *internal-space
      
  keystone:
    charm: ch:keystone
    num_units: 1
    options:
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
    to:
    - lxd:2
    channel: yoga/stable
    bindings:
      "": *internal-space
      admin: *oam-space
      public: *public-space
      
  neutron-mysql-router:
    charm: ch:mysql-router
    channel: 8.0/stable
    bindings:
      "": *internal-space
      
  neutron-api-plugin-ovn:
    charm: ch:neutron-api-plugin-ovn
    channel: yoga/stable
    options:
      enable-distributed-floating-ip: false
      dns-servers: 8.8.8.8
    bindings:
      "": *internal-space
      neutron-plugin: *internal-space
      
  neutron-api:
    charm: ch:neutron-api
    num_units: 1
    options:
      neutron-security-groups: true
      enable-ml2-dns: true
      flat-network-providers: ''
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
      enable-ml2-port-security: true
      enable-hardware-offload: true
      global-physnet-mtu: 9000
      vlan-ranges: tenantvlan:100:200
    to:
    - lxd:2
    channel: yoga/stable
    bindings:
      "": *internal-space
      admin: *oam-space
      public: *public-space
      neutron-plugin-api-subordinate: *internal-space
      
  placement-mysql-router:
    charm: ch:mysql-router
    channel: 8.0/stable
    bindings:
      "": *internal-space
      
  placement:
    charm: ch:placement
    num_units: 1
    options:
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
    to:
    - lxd:2
    channel: yoga/stable
    bindings:
      "": *internal-space
      admin: *oam-space
      public: *public-space
      
  nova-mysql-router:
    charm: ch:mysql-router
    channel: 8.0/stable
    bindings:
      "": *internal-space
      
  nova-cloud-controller:
    charm: ch:nova-cloud-controller
    num_units: 1
    options:
      network-manager: Neutron
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
      pci-alias: '{"vendor_id":"10de","product_id":"20f1","name":"a100-gpu","device_type":"type-PF"}'
    to:
    - lxd:2
    channel: yoga/stable
    bindings:
      "": *internal-space
      admin: *oam-space
      public: *public-space
      
  nova-compute:
    charm: ch:nova-compute
    num_units: 2
    options:
      config-flags: default_ephemeral_format=ext4
      enable-live-migration: true
      enable-resize: true
      migration-auth-type: ssh
      openstack-origin: *openstack-origin
      pci-passthrough-whitelist: '[{"devname": "enp63s0f0", "physical_network": null}, {"devname": "enp63s0f1", "physical_network": "tenantvlan"}, {"vendor_id": "10de", "product_id": "20f1"}]'
      pci-alias: '{"vendor_id":"10de","product_id":"20f1","name":"a100-gpu","device_type":"type-PF"}'
      cpu-dedicated-set: 2-23
    to:
    - '0'
    - '1'
    channel: yoga/stable
    bindings:
      "": *internal-space
      
  dashboard-mysql-router:
    charm: ch:mysql-router
    channel: 8.0/stable
    bindings:
      "": *internal-space
      
  openstack-dashboard:
    charm: ch:openstack-dashboard
    num_units: 1
    options:
      openstack-origin: *openstack-origin
    to:
    - lxd:2
    constraints: *space-constr
    channel: yoga/stable
    bindings:
      "": *internal-space
      public: *public-space
      cluster: *public-space 
      
  rabbitmq-server:
    charm: ch:rabbitmq-server
    channel: 3.9/stable
    num_units: 1
    to:
    - lxd:2
    constraints: *space-constr
    bindings:
      "": *internal-space
      
  mysql-innodb-cluster:
    charm: ch:mysql-innodb-cluster
    num_units: 3
    to:
    - lxd:0
    - lxd:1
    - lxd:2
    channel: 8.0/stable
    constraints: *space-constr
    bindings:
      "": *internal-space
      
  ovn-central:
    charm: ch:ovn-central
    num_units: 3
    options:
      source: *openstack-origin
    to:
    - lxd:0
    - lxd:1
    - lxd:2
    channel: 22.03/stable
    constraints: *space-constr
    bindings:
      "": *internal-space
      
  ovn-chassis:
    charm: ch:ovn-chassis
    # Please update the `bridge-interface-mappings` to values suitable for the
    # hardware used in your deployment. See the referenced documentation at the
    # top of this file.
    options:
      ovn-bridge-mappings: tenantvlan:br-nvda
      bridge-interface-mappings: br-nvda:bond0
      enable-hardware-offload: true
      sriov-numvfs: "enp63s0f0:8 enp63s0f1:8"
    channel: 22.03/stable
    bindings:
      "": *internal-space
      data: *overlay-space
      
  vault-mysql-router:
    charm: ch:mysql-router
    channel: 8.0/stable
    bindings:
      "": *internal-space
      
  vault:
    charm: ch:vault
    channel: 1.7/stable
    num_units: 1
    to:
    - lxd:2
    constraints: *space-constr
    bindings:
      "": *internal-space
      access: *public-space
      
relations:
- - nova-compute:amqp
  - rabbitmq-server:amqp
- - nova-cloud-controller:identity-service
  - keystone:identity-service
- - glance:identity-service
  - keystone:identity-service
- - neutron-api:identity-service
  - keystone:identity-service
- - neutron-api:amqp
  - rabbitmq-server:amqp
- - glance:amqp
  - rabbitmq-server:amqp
- - nova-cloud-controller:image-service
  - glance:image-service
- - nova-compute:image-service
  - glance:image-service
- - nova-cloud-controller:cloud-compute
  - nova-compute:cloud-compute
- - nova-cloud-controller:amqp
  - rabbitmq-server:amqp
- - openstack-dashboard:identity-service
  - keystone:identity-service
- - nova-cloud-controller:neutron-api
  - neutron-api:neutron-api
- - placement:identity-service
  - keystone:identity-service
- - placement:placement
  - nova-cloud-controller:placement
- - keystone:shared-db
  - keystone-mysql-router:shared-db
- - glance:shared-db
  - glance-mysql-router:shared-db
- - nova-cloud-controller:shared-db
  - nova-mysql-router:shared-db
- - neutron-api:shared-db
  - neutron-mysql-router:shared-db
- - openstack-dashboard:shared-db
  - dashboard-mysql-router:shared-db
- - placement:shared-db
  - placement-mysql-router:shared-db
- - vault:shared-db
  - vault-mysql-router:shared-db
- - keystone-mysql-router:db-router
  - mysql-innodb-cluster:db-router
- - nova-mysql-router:db-router
  - mysql-innodb-cluster:db-router
- - glance-mysql-router:db-router
  - mysql-innodb-cluster:db-router
- - neutron-mysql-router:db-router
  - mysql-innodb-cluster:db-router
- - dashboard-mysql-router:db-router
  - mysql-innodb-cluster:db-router
- - placement-mysql-router:db-router
  - mysql-innodb-cluster:db-router
- - vault-mysql-router:db-router
  - mysql-innodb-cluster:db-router
- - neutron-api-plugin-ovn:neutron-plugin
  - neutron-api:neutron-plugin-api-subordinate
- - ovn-central:certificates
  - vault:certificates
- - ovn-central:ovsdb-cms
  - neutron-api-plugin-ovn:ovsdb-cms
- - neutron-api:certificates
  - vault:certificates
- - ovn-chassis:nova-compute
  - nova-compute:neutron-plugin
- - ovn-chassis:certificates
  - vault:certificates
- - ovn-chassis:ovsdb
  - ovn-central:ovsdb
- - vault:certificates
  - neutron-api-plugin-ovn:certificates
- - vault:certificates
  - glance:certificates
- - vault:certificates
  - keystone:certificates
- - vault:certificates
  - nova-cloud-controller:certificates
- - vault:certificates
  - openstack-dashboard:certificates
- - vault:certificates
  - placement:certificates
- - vault:certificates
  - mysql-innodb-cluster:certificates

OpenStack Cloud Deployment

Verify the MAAS configured spaces were loaded by juju:

ubuntu@maas-opstk:~$ juju spaces
Name            Space ID  Subnets        
alpha           0                        
oam-space       1         192.168.25.0/24
public-space    2         10.7.208.0/24  
overlay-space   3         172.16.0.0/24  
internal-space  4         172.18.0.0/24

Note

Run "juju reload-spaces" to force the operation in case it was not loaded.

Create a new model for the OpenStack cloud deployment:

ubuntu@maas-opstk:~$ juju add-model --config default-series=focal openstack

Deploy the OpenStack cloud using the prepared deployment bundle file:

ubuntu@maas-opstk:~$ juju deploy ./openstack-bundle-jammy-multi-space-nvidia-network.yaml

Follow deployment progress and status:

ubuntu@maas-opstk:~$ juju debug-log --replay
ubuntu@maas-opstk:~$ juju status

Post Deployment Operations

Once the deployment is stabilized, and the "juju status" indicates all applications are Active, collect the Vault app Public Address from the output, and proceed with the actions below.

Vault Initialization/CA Certificate

Install the vault client on the MAAS node:

ubuntu@maas-opstk:~$ sudo snap install vault

Initialize Vault:

ubuntu@maas-opstk:~$ export VAULT_ADDR="http://<Vault App Public Address>:8200"
ubuntu@maas-opstk:~$ vault operator init -key-shares=5 -key-threshold=3

Unseal Vault:

ubuntu@maas-opstk:~$ vault operator unseal <Key1>
ubuntu@maas-opstk:~$ vault operator unseal <Key2>
ubuntu@maas-opstk:~$ vault operator unseal <Key3>

Authorize the Vault Charm:

ubuntu@maas-opstk:~$ export VAULT_TOKEN=<Vault Initial Root Token>
ubuntu@maas-opstk:~$ vault token create -ttl=10m
ubuntu@maas-opstk:~$ juju run-action --wait vault/leader authorize-charm token=$VAULT_TOKEN

Add CA certificate:

ubuntu@maas-opstk:~$ juju run-action --wait vault/leader generate-root-ca

Monitor the "juju status" output until all units are in Ready state:
```
ubuntu@maas-opstk:~$ juju status
```
Note

For more details regarding Vault initialization and adding CA certificate, refer to:

https://opendev.org/openstack/charm-vault/src/branch/master/src/README.md#post-deployment-tasks

https://docs.openstack.org/charm-guide/latest/admin/security/tls.html#add-a-ca-certificate

SR-IOV and Hardware Acceleration Enablement Verification

The following file should be created on each compute node after the ovn-chassis app is ready:

/etc/netplan/150-charm-ovn.yaml

###############################################################################
# [ WARNING ]
# Configuration file maintained by Juju. Local changes may be overwritten.
# Config managed by ovn-chassis charm
###############################################################################
network:
  version: 2
  ethernets:
    enp63s0f0:
      virtual-function-count: 8
      embedded-switch-mode: switchdev
      delay-virtual-functions-rebind: true
    
    enp63s0f1:
      virtual-function-count: 8
      embedded-switch-mode: switchdev
      delay-virtual-functions-rebind: true

For optimal performance benchmark, increase the number of NIC MSIX queues per SR-IOV Virtual Function on every compute node

Install the mstflint package.
```
# apt install mstflint -y
```

Locate the Connect-X Adapter PCI ID.

root@node1:/home/ubuntu# lspci | grep -i nox
3f:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
3f:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]

Configure NUM_VF_MSIX NIC FW Parameter

root@node1:/home/ubuntu# mstconfig -d 3f:00.0 s NUM_VF_MSIX=63

Device #1:
----------

Device type:    ConnectX6DX     
Name:           MCX623106AC-CDA_Ax
Description:    ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0 x16; Crypto and Secure Boot
Device:         3f:00.0         

Configurations:                              Next Boot       New
         NUM_VF_MSIX                         11              63              

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

In order to apply the SR-IOV and hardware acceleration configuration, reboot the compute nodes, one at a time:
```
root@node1:/home/ubuntu# reboot
```

After the compute node boots up, verify the configuration was applied:

root@node1:/home/ubuntu# lspci | grep -i nox
3f:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
3f:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
3f:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:00.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:00.4 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:00.5 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:00.6 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:00.7 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:01.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:01.1 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:08.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:08.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:08.4 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:08.5 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:08.6 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:08.7 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:09.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
3f:09.1 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function

root@node1:/home/ubuntu# lspci | grep "Virtual Function" | wc -l
16

root@node1:/home/ubuntu# devlink dev eswitch show pci/0000:3f:00.0
pci/0000:3f:00.0: mode switchdev inline-mode none encap-mode basic

root@node1:/home/ubuntu# devlink dev eswitch show pci/0000:3f:00.1
pci/0000:3f:00.1: mode switchdev inline-mode none encap-mode basic

Monitor the "juju status" output, and verify all units and applications are recovered:
```
ubuntu@maas-opstk:~$ juju status
```

QoS Settings

Apply the following QoS configuration on both compute nodes:

Note

The following configuration section is required for optimal RDMA benchmark testing using Lossless RoCE configuration,and it is tuned for prioritizing a specific DSCP marker that will be used later on in the benchmark test.

Please notice it will not survive a reboot.

Configure OVS to copy the inner DSCP into the Geneve encapsulated header:

root@node1:/home/ubuntu# ovs-vsctl set Open_vSwitch . external_ids:ovn-encap-tos=inherit

Configure both physical bond interfaces with PFC/DSCP configuration adjusted for RDMA:

root@node1:/home/ubuntu# apt-get install python2
root@node1:/home/ubuntu# git clone https://github.com/Mellanox/mlnx-tools
root@node1:/home/ubuntu# cd mlnx-tools/python/
root@node1:/home/ubuntu# cat /proc/net/bonding/bond0 | grep "Slave Int"
Slave Interface: enp63s0f0
Slave Interface: enp63s0f1
root@node1:/home/ubuntu# python2 mlnx_qos -i enp63s0f0 --trust=dscp --pfc=0,0,0,1,0,0,0,0
root@node1:/home/ubuntu# python2 mlnx_qos -i enp63s0f1 --trust=dscp --pfc=0,0,0,1,0,0,0,0

OpenStack Cloud Operations Verification

Install OpenStack client on the MAAS node:

ubuntu@maas-opstk:~$ sudo apt install python3-openstackclient -y

Create cloud access credentials:

ubuntu@maas-opstk:~$ sudo git clone https://github.com/openstack-charmers/openstack-bundles ~/openstack-bundles
ubuntu@maas-opstk:~$ source ~/openstack-bundles/stable/openstack-base/openrc

Confirm you can access the cloud from the command line:

ubuntu@maas-opstk:~$ openstack service list

+----------------------------------+-----------+-----------+
| ID                               | Name      | Type      |
+----------------------------------+-----------+-----------+
| 23fe81313c3b476cbf5bd29d5f0570fe | nova      | compute   |
| bb0783870a314bd0a171c3714d7cf44b | neutron   | network   |
| d3060dd804184a98b91e79191c27b8e3 | keystone  | identity  |
| d5c3d2c0b3f04602b728fc1480eee879 | glance    | image     |
| d946299bca114c6c9e16ffe3109bf7e1 | placement | placement |
+----------------------------------+-----------+-----------+

Applications and Use Cases

Accelerated OVN Packet Processing (SDN Acceleration)

Note

The following use cases demonstrate SDN layer acceleration using hardware offload capabilities. The tests include a Telco grade benchmark that aims to push SDN offload into optimal performance and validate its functionality.

Use Case Topology

The following topology describes VM instances located on remote compute nodes with hardware accelerated bond, and running different workloads over Geneve overlay tenant network.

Image description: SDN Acceleration Use Case Topology

Use Case Configuration

VM image

Upload the Ubuntu VM cloud image to the image store:

Note

The VM image was built using a disk-image-builder tool. The following article can be used as a reference: How-to: Create OpenStack Cloud Image with NVIDIA GPU and Network Drivers.
```
$ openstack image create --container-format bare --disk-format qcow2 --file ~/images/ubuntu-perf.qcow2 ubuntu-perf
```

VM Flavor

Create a flavor:

$ openstack flavor create m1.packet --ram 8192 --disk 20 --vcpus 20

Set hugepages and cpu-pinning parameters:

$ openstack flavor set m1.packet --property hw:mem_page_size=large
$ openstack flavor set m1.packet --property hw:cpu_policy=dedicated

Security Policy

Create a stateful security group policy to apply on the management network ports:

$ openstack security group create mgmt_policy
$ openstack security group rule create mgmt_policy --protocol tcp --ingress --dst-port 22

Create a stateful security group policy to apply on the data network ports:

$ openstack security group create data_policy
$ openstack security group rule create data_policy --protocol icmp --ingress
$ openstack security group rule create data_policy --protocol icmp --egress

SSH Keys

Create an SSH key pair:

$ openstack keypair create --public-key ~/.ssh/id_rsa.pub bastion

VM Networks and Ports

Create a management overlay network:

$ openstack network create gen_mgmt --provider-network-type geneve --share
$ openstack subnet create gen_mgmt_subnet --dhcp --network gen_mgmt --subnet-range 22.22.22.0/24

Create 2 normal management network ports with management security policy:

$ openstack port create normal1 --network gen_mgmt --security-group mgmt_policy
$ openstack port create normal2 --network gen_mgmt --security-group mgmt_policy

Create a data overlay network:

$ openstack network create gen_data --provider-network-type geneve --share
$ openstack subnet create gen_data_subnet --dhcp --network gen_data --subnet-range 33.33.33.0/24 --gateway none

Create 3 data network accelerated ports with data security policy:

$ openstack port create direct_overlay1 --vnic-type=direct --network gen_data --binding-profile '{"capabilities":["switchdev"]}' --security-group data_policy
$ openstack port create direct_overlay2 --vnic-type=direct --network gen_data --binding-profile '{"capabilities":["switchdev"]}' --security-group data_policy

VM Instances

Create 2 VM instances, one on each compute node:

$ openstack server create --key-name bastion --flavor m1.packet --image ubuntu-perf --port normal1 --port direct_overlay1 vm1 --availability-zone nova:node1.maas

$ openstack server create --key-name bastion --flavor m1.packet --image ubuntu-perf --port normal2 --port direct_overlay2 vm2 --availability-zone nova:node2.maas

VM Public Access

Create a vlan-provider external network and subnet:

Note

Make sure to use the VLAN ID configured in the network for allowing public access. In our solution VLAN ID 9 is used.

$ openstack network create vlan_public --provider-physical-network tenantvlan --provider-network-type vlan --provider-segment 9 --share --external  
$ openstack subnet create public_subnet --no-dhcp --network vlan_public --subnet-range 10.7.208.0/24 --allocation-pool start=10.7.208.65,end=10.7.208.85 --gateway 10.7.208.1

Create a public router, and attach the public and the management subnets:

$ openstack router create public_router 
$ openstack router set public_router --external-gateway vlan_public
$ openstack router add subnet public_router gen_mgmt_subnet

Create Floating IPs from the public subnet range, and attach them to the VM instances:

$ openstack floating ip create --floating-ip-address 10.7.208.90 vlan_public 
$ openstack floating ip create --floating-ip-address 10.7.208.91 vlan_public
$ openstack server add floating ip vm1 10.7.208.90 
$ openstack server add floating ip vm2 10.7.208.91

Use Case Setup Validation

Verify that the instances were created successfully:

$ penstack server list
+--------------------------------------+------+--------+-----------------------------------------------------------+-------------+-----------+
| ID                                   | Name | Status | Networks                                                  | Image       | Flavor    |
+--------------------------------------+------+--------+-----------------------------------------------------------+-------------+-----------+
| d19fe378-01a8-4b5e-ab7b-dd2c85edffbf | vm1  | ACTIVE | gen_data=33.33.33.219; gen_mgmt=10.7.208.90, 22.22.22.220 | ubuntu-perf | m1.packet |
| 3747aa17-6a97-4bde-9fbe-bd553155a73c | vm2  | ACTIVE | gen_data=33.33.33.84;  gen_mgmt=10.7.208.91, 22.22.22.181 | ubuntu-perf | m1.packet |
+--------------------------------------+------+--------+-----------------------------------------------------------+-------------+-----------+

$ ssh -i ~/.ssh/id_rsa 10.7.208.90

$ ssh -i ~/.ssh/id_rsa 10.7.208.91

Verify that the VM instances can ping each other over the accelerated data interface:

$ ubuntu@vm1:~$ ping -c 4 33.33.33.84
PING 33.33.33.84 (33.33.33.84) 56(84) bytes of data.
64 bytes from 33.33.33.84: icmp_seq=1 ttl=64 time=0.492 ms
64 bytes from 33.33.33.84: icmp_seq=2 ttl=64 time=0.466 ms
64 bytes from 33.33.33.84: icmp_seq=3 ttl=64 time=0.437 ms
64 bytes from 33.33.33.84: icmp_seq=4 ttl=64 time=0.387 ms

--- 33.33.33.84 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3075ms
rtt min/avg/max/mdev = 0.387/0.445/0.492/0.038 ms

Verify Jumbo Frame connectivity over the accelerated data network:

[root@vm1:/home/ubuntu# ip link show ens5 | grep mtu
3: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc mq state UP mode DEFAULT group default qlen 1000

root@vm1:/home/ubuntu# ping -M do -s 8914 33.33.33.84
PING 33.33.33.84 (33.33.33.84) 8914(8942) bytes of data.
8922 bytes from 33.33.33.84: icmp_seq=1 ttl=64 time=0.295 ms
8922 bytes from 33.33.33.84: icmp_seq=2 ttl=64 time=0.236 ms
8922 bytes from 33.33.33.84: icmp_seq=3 ttl=64 time=0.236 ms
8922 bytes from 33.33.33.84: icmp_seq=4 ttl=64 time=0.214 ms
^C
--- 33.33.33.84 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3053ms
rtt min/avg/max/mdev = 0.214/0.245/0.295/0.030 ms

Verify that the security policy is enforced on the data network by trying to establish iperf connection between the instances - Such a connection should be blocked:

root@vm2:/home/ubuntu# iperf3 -s -p 5101
-----------------------------------------------------------
Server listening on 5101
-----------------------------------------------------------

root@vm1:/home/ubuntu# iperf3 -c 33.33.33.84 -p 5101

tcp connect failed: Connection timed out

Add to the data security policy a rule allowing iperf TCP ports:

$ openstack security group rule create data_policy --protocol tcp --ingress --dst-port 5001:5200

Verify that iperf connection is now allowed:

root@vm2:/home/ubuntu# iperf3 -s -p 5101
-----------------------------------------------------------
Server listening on 5101
-----------------------------------------------------------

root@vm1:/home/ubuntu# iperf3 -c 33.33.33.84 -p 5101 
Connecting to host 33.33.33.84, port 5101
[  5] local 33.33.33.219 port 47498 connected to 33.33.33.84 port 5101
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.42 GBytes  38.0 Gbits/sec    0   2.20 MBytes       
[  5]   1.00-2.00   sec  4.24 GBytes  36.5 Gbits/sec    0   2.42 MBytes       
[  5]   2.00-3.00   sec  4.47 GBytes  38.4 Gbits/sec    0   2.42 MBytes       
[  5]   3.00-4.00   sec  4.47 GBytes  38.4 Gbits/sec    0   2.54 MBytes       
[  5]   4.00-5.00   sec  4.08 GBytes  35.1 Gbits/sec    0   2.54 MBytes       
[  5]   5.00-6.00   sec  4.66 GBytes  40.0 Gbits/sec    0   2.54 MBytes       
[  5]   6.00-7.00   sec  4.13 GBytes  35.5 Gbits/sec    0   2.54 MBytes       
[  5]   7.00-8.00   sec  4.65 GBytes  39.9 Gbits/sec    0   2.54 MBytes       
[  5]   8.00-9.00   sec  4.66 GBytes  40.1 Gbits/sec    0   2.54 MBytes       
[  5]   9.00-10.00  sec  4.67 GBytes  40.1 Gbits/sec    0   2.54 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  44.5 GBytes  38.2 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  44.5 GBytes  38.0 Gbits/sec                  receiver

Verify that the iperf traffic is offloaded to the hardware by capturing traffic on the physical bond interfaces used for Geneve encapsulation:
```
root@node1:/home/ubuntu# tcpdump -en -i enp63s0f0 vlan 40 | grep 5101

root@node1:/home/ubuntu# tcpdump -en -i enp63s0f1 vlan 40 | grep 5101
```
- Only the first connection packets that are flowing via slow path until the connection is offloaded to the hardware will be seen in the capture.
- Hardware Offload of North-South NAT traffic over the Floating IP can be validated as well using the same method

Use Case Benchmarks

TCP Throughput

The following section is describing an iperf3 TCP throughput benchmark test between two VMs hosted on remote compute nodes with hardware acceleration and the configuration steps required to assure an optimized result over the accelerated bond used topology.

Note

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Create 2 VM instances as instructed in previous steps:

Note

The VM image used for this test is based on Ubuntu 22.04 and includes iperf3 and sysstat packages

On both compute nodes hosting the VM instances, verify that CPU pinning was applied and that the VM was allocated with host isolated cores on the same NUMA node as the NIC (2-23 in this case):

#root@node2:/home/ubuntu# virsh list --all
 Id   Name                State
-----------------------------------
 4    instance-00000008   running

root@node2:/home/ubuntu# virsh vcpupin 4
 VCPU   CPU Affinity
----------------------
 0      2
 1      5
 2      11
 3      8
 4      14
 5      17
 6      23
 7      20
 8      4
 9      15
 10     7
 11     10
 12     16
 13     13
 14     19
 15     22
 16     12
 17     3
 18     9
 19     6

root@node2:/home/ubuntu# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 128784 MB
node 0 free: 93679 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 129012 MB
node 1 free: 92576 MB
node distances:
node   0   1 
  0:  10  32 
  1:  32  10
 
root@node2:/home/ubuntu# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-5.15.0-48-generic root=UUID=19167c2d-e067-44a7-9176-2c784af688bc ro default_hugepagesz=1G hugepagesz=1G hugepages=64 intel_iommu=on iommu=pt blacklist=nouveau rd.blacklist=nouveau isolcpus=2-23

In our example, the ConnectX NIC is associated with Numa node 0, which is hosting CPU cores 0-23.

Cores 2-23 were isolated from the hypervisor (grub file), and dedicated to Nova instance usage (see "cpu-dedicated-set" in the cloud deployment bundle file).

On the VM instances, verify that the number of accelerated interface channels (MSIX queues) is identical to the number of vCPUs allocated to the VM (20 in this case):

root@vm1:/home/ubuntu#  ethtool -l ens5
Channel parameters for ens5:
Pre-set maximums:
RX:             n/a
TX:             n/a
Other:          n/a
Combined:       20
Current hardware settings:
RX:             n/a
TX:             n/a
Other:          n/a
Combined:       20

On both VM instances, run the following performance tunning script to set IRQ affinity per vCPU:

Note

Make sure the "P0" variable is configured with the accelerated interface name as appears in the VM.

The script should be executed after every VM reboot.

#!/bin/bash

P0=ens5

#1. Stop services 
systemctl stop irqbalance

#2. Set IRQ affinity
function int2hex
{
        CHUNKS=$(( $1/64 ))
        COREID=$1
        HEX=""
        for (( CHUNK=0; CHUNK<${CHUNKS} ; CHUNK++ ))
        do
                HEX=$HEX"0000000000000000"
                COREID=$((COREID-64))
        done
        printf "%x$HEX" $(echo $((2**$COREID)) )
}

for PF in $P0 
do
        PF_PCI=`ls -l /sys/class/net/$PF/device | tr "/" " " | awk '{print $NF}'`
        IRQ_LIST=`cat /proc/interrupts | grep $PF_PCI | tr ":" " " | awk '{print $1}'`
        CORE=0
        for IRQ in $IRQ_LIST
        do
                affinity=$( int2hex $CORE )
                echo $affinity > /proc/irq/$IRQ/smp_affinity
                CORE=$(((CORE+1)%20))
        done
done

#3. Enable aRFS

echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
ethtool -K $P0 ntuple on
for f in /sys/class/net/$P0/queues/rx-*/rps_flow_cnt; do echo 32768 > $f; done

root@vm1:/home/ubuntu# ./perf_tune.sh

root@vm2:/home/ubuntu# ./perf_tune.sh

On VM2, run the following script to start iperf3 server thread per dedicated vCPU (20 in this case):

Note

Change the number of threads/vCPUs to use per requirement. Make sure to use ports that are allowed by the security policy.
```
#!/bin/bash

for I in {0..19} 
do
( taskset -c $I iperf3 -s -p $((5001+I*2)) > /dev/null  & )
done
```
```
root@vm2:/home/ubuntu# ./iperf3S.sh
```
On VM1, run the following script to start iperf3 client thread per dedicated vCPU (20 in this case), and guarantee optimal traffic distribution between LAG ports:

Note

Set a VM2 IP address for the "IPERF_SERVER" parameter.

Change the number of threads/vCPUs, Duration and Size per requirement.

Make sure to use ports that are allowed by the security policy.

The script requires a package which is installed as part of the perf_tune.sh script, for measuring the average idle CPU during the test.
```
#!/bin/bash
#set -x
DUR=60
IPERF_SERVER="33.33.33.84"
VCPUS=20
SIZE=256K

echo Numer of vCPU iperf3 threads $VCPUS
echo Running a test with size of $SIZE for $DUR sec
echo ""
echo AVG_CPU_IDLE   TOTAL_THROUGHPUT
#echo "Total THROUGHPUT:"
for I in `seq 0 $((VCPUS-1))` ;
do (taskset -c $((I)) iperf3 -c $IPERF_SERVER -p $((5001+I*2)) -i 1 -l $SIZE -t $DUR -f g -Z & ); done | grep sender | awk '{ SUM+=$7 } END { print SUM}' &
CPU_IDLE=$(sar 1 $((DUR)) | grep Average | awk '{print $NF}')
echo -n "$CPU_IDLE        "
wait
```
```
root@vm1:/home/ubuntu# ./iperf3C.sh
Numer of vCPU iperf3 threads 20
Running a test with size of 256K for 60 sec

AVG_CPU_IDLE TOTAL_THROUGHPUT 
83.42        179.45
```
Notes
- The test is demonstrating a throughput of around 180Gbps with a very low CPU usage over a leaf-spine fabric with Geneve encapsulation
- The test above was executed on compute nodes with Intel Xeon 8380 CPU @ 2.30GHz (40-Cores)
- Full traffic hardware offload was verified during the test
- A rate of 170Gbps is reached when running the same test with Security Group policy applied

RDMA (RoCE) Bandwidth and Latency

The following section is describing RDMA bandwidth and latency benchmark tests between two VMs hosted on remote compute nodes with hardware acceleration and the configuration steps required to assure an optimized result over the used topology.

Note

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Create 2 VM instances as instructed in previous steps.

Note

The VM image used for this test is based on Ubuntu 22.04, and includes perftest tools - Please refer to perftest for more information
On both compute nodes hosting the VM instances, verify that the QoS configuration described in the "QoS Settings" section above was applied

Create a stateless security group policy. This is required for running RoCE workloads:

$ openstack security group create data_sl_policy --stateless
$ openstack security group rule create data_sl_policy --protocol icmp --ingress
$ openstack security group rule create data_sl_policy --protocol icmp --egress
$ openstack security group rule create data_sl_policy --protocol udp --ingress --dst-port 4000:6000
$ openstack security group rule create data_sl_policy --protocol udp --egress --dst-port 4000:6000

Change the stateful security policy applied on the accelerated VM ports to the newly created stateless security policy:

$ openstack port set direct_overlay1 --no-security-group --disable-port-security
$ openstack port set direct_overlay2 --no-security-group --disable-port-security
$ openstack port set  direct_overlay1 --security-group data_sl_policy --enable-port-security
$ openstack port set  direct_overlay2 --security-group data_sl_policy --enable-port-security

Start the ib_write_bw server on VM2 using the following command:

root@vm2:/home/ubuntu# ib_write_bw -F -q 2048 --tclass=96 --report_gbits  -R

Start the ib_write_bw client on VM1 using the following command to guarantee optimal traffic distribution between LAG ports for a 60 second duration bandwidth test:

root@vm1:/home/ubuntu# ib_write_bw -F -q 2048 --tclass=96 --report_gbits -D 60 33.33.33.84 -R

.
.
.
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:84
 remote address: LID 0000 QPN 0x0b57 PSN 0x58f526
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:84
 remote address: LID 0000 QPN 0x0b58 PSN 0x36f467
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:84
 remote address: LID 0000 QPN 0x0b59 PSN 0xde7a43
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:84
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      10941107         0.00               191.21             0.364705
---------------------------------------------------------------------------------------

Notes

The test is demonstrating an average bandwidth of 191Gbps over a leaf-spine fabric with Geneve encapsulation
The test above was executed on compute nodes with Intel Xeon 8380 CPU @ 2.30GHz (40-Cores)

Now, start the ib_write_lat server on VM2 using the following command:

root@vm2:/home/ubuntu# ib_write_lat -F --tclass=96 --report_gbits  -R

Start the ib_write_lat client on VM1 using the following command for a 60 second duration latency test:

root@vm1:/home/ubuntu# ib_write_lat -F --tclass=96 --report_gbits -D 60 33.33.33.84 -R

---------------------------------------------------------------------------------------
                    RDMA_Write Latency Test
 Dual-port       : OFF          Device         : rocep0s5
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: OFF
 ibv_wr* API     : ON
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 220[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 Waiting for client rdma_cm QP to connect
 Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0b61 PSN 0x14c131
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:84
 remote address: LID 0000 QPN 0x0168 PSN 0xbee28f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:02
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
 2       1000          3.50           12.40        3.61                4.58             1.28            8.72                    12.40  
---------------------------------------------------------------------------------------

Notes

The test is performed between compute nodes over a leaf-spine fabric with Geneve encapsulation
The test above was executed on compute nodes with Intel Xeon 8380 CPU @ 2.30GHz (40-Cores)

DPDK Frame Rate

The following section is describing DPDK frame rate benchmark test for small frames between two VMs hosted on remote compute nodes with hardware acceleration and the configuration steps required to assure an optimized result over the used topology.

Note

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Create 2 VM instances as instructed in previous steps. This time create VM2 with two accelerated ports, as required by the TREX testing tool
Note
- The VM image used for this test is based on Ubuntu 22.04. It includes the following software: DPDK v21.11.2 and TREX traffic generator v2.87. It is configured with 2 x hugepages of 1G size.
- Disable security groups on the accelerated ports.

On the Receiver VM1 (the instance with the single accelerated port), verify that hugepages were allocated, and start the TestPMD application:

Note

Use the PCI ID of the SR-IOV VF inside the VM.

Collect the MAC address of the port from the output of the command below.

# cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       2
HugePages_Free:        2
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:         2097152 kB  

# dpdk-testpmd -c 0x1ff -n 4 -m 1024 -a 00:05.0 -- --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=4 --txq=4 --nb-cores=4 --rss-udp --forward-mode=5tswap -a -i

On the Transmitter TRex VM2:

Verify that hugepages are allocated as instructed in the previous step

Install the TREX traffic generator

root@vm2:/home/ubuntu# mkdir /root/trex
root@vm2:~/trex# cd /root/trex
root@vm2:~/trex# wget --no-check-certificate https://trex-tgn.cisco.com/trex/release/v2.87.tar.gz
root@vm2:~/trex# tar -xzvf v2.87.tar.gz
root@vm2:~/trex# chmod 777 /root -R
root@vm2:~/trex# ln -s -f /usr/lib/x86_64-linux-gnu/libc.a /usr/lib/x86_64-linux-gnu/liblibc.a

Create the following UDP packet stream configuration file under the /root/trex/<version> directory:

Note

Change the IP src to the IP of the first accelerated port on VM2, and dst to the IP of the accelerated port on VM1.

from trex_stl_lib.api import *
 
class STLS1(object):
 
    def create_stream (self):
        
        pkt = Ether()/IP(src="33.33.33.111",dst="33.33.33.222")/UDP(dport=5999)/(18*'x')
                  
        vm = STLScVmRaw( [
                                STLVmFlowVar(name="v_port",
                                                min_value=4337,
                                                  max_value=5337,
                                                  size=2, op="inc"),
                                STLVmWrFlowVar(fv_name="v_port",
                                            pkt_offset= "UDP.sport" ),
                                STLVmFixChecksumHw(l3_offset="IP",l4_offset="UDP",l4_type=CTRexVmInsFixHwCs.L4_TYPE_UDP),
 
                            ]
                        )
 
        return STLStream(packet = STLPktBuilder(pkt = pkt ,vm = vm ) ,
                                mode = STLTXCont(pps = 8000000) )
 
 
    def get_streams (self, direction = 0, **kwargs):
        # create 1 stream
        return [ self.create_stream() ]
 
 
# dynamic load - used for trex console or simulator
def register():
    return STLS1()

Run the DPDK port setup interactive wizard by following the steps specified below. When requested, use the MAC address of the TestPMD VM1 you collected in previous steps:

root@vm2:~# cd /root/trex/v2.87 
root@vm2:~/trex/v2.87# ./dpdk_setup_ports.py -i
By default, IP based configuration file will be created. Do you want to use MAC based config? (y/N)y
+----+------+---------+-------------------+------------------------------------------+------------+----------+----------+
| ID | NUMA |   PCI   |        MAC        |                   Name                   |   Driver   | Linux IF |  Active  |
+====+======+=========+===================+==========================================+============+==========+==========+
| 0  | -1   | 00:03.0 | fa:16:3e:25:ab:57 | Virtio network device                    | virtio-pci | ens3     | *Active* |
+----+------+---------+-------------------+------------------------------------------+------------+----------+----------+
| 1  | -1   | 00:05.0 | fa:16:3e:44:85:c1 | ConnectX Family mlx5Gen Virtual Function | mlx5_core  | ens5     |          |
+----+------+---------+-------------------+------------------------------------------+------------+----------+----------+
| 2  | -1   | 00:06.0 | fa:16:3e:15:4e:67 | ConnectX Family mlx5Gen Virtual Function | mlx5_core  | ens6     |          |
+----+------+---------+-------------------+------------------------------------------+------------+----------+----------+
Please choose an even number of interfaces from the list above, either by ID, PCI or Linux IF
Stateful will use order of interfaces: Client1 Server1 Client2 Server2 etc. for flows.
Stateless can be in any order.
Enter list of interfaces separated by space (for example: 1 3) : 1 2

For interface 1, assuming loopback to its dual interface 2.
Destination MAC is fa:16:3e:15:4e:67. Change it to MAC of DUT? (y/N).y
Please enter a new destination MAC of interface 1: FA:16:3E:E0:10:06
For interface 2, assuming loopback to its dual interface 1.
Destination MAC is fa:16:3e:44:85:c1. Change it to MAC of DUT? (y/N).y
Please enter a new destination MAC of interface 2: FA:16:3E:E0:10:06
Print preview of generated config? (Y/n)
### Config file generated by dpdk_setup_ports.py ###

- version: 2
  interfaces: ['00:05.0', '00:06.0']
  port_info:
      - dest_mac: fa:16:3e:e0:10:06
        src_mac:  fa:16:3e:44:85:c1
      - dest_mac: fa:16:3e:e0:10:06
        src_mac:  fa:16:3e:15:4e:67

  platform:
      master_thread_id: 0
      latency_thread_id: 1
      dual_if:
        - socket: 0
          threads: [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]


Save the config to file? (Y/n)Y
Default filename is /etc/trex_cfg.yaml
Press ENTER to confirm or enter new file: 
Saved to /etc/trex_cfg.yaml.

Run the TRex application in the background. In this case we used 8 out of 20 allocated cores:
```
root@vm2:~/trex/v2.87# nohup ./t-rex-64 --no-ofed-check -i -c 8 &
```

Run the TRex Console:

root@vm2:~/trex/v2.87# ./trex-console  

Using 'python3' as Python interpeter


Connecting to RPC server on localhost:4501                   [SUCCESS]


Connecting to publisher server on localhost:4500             [SUCCESS]


Acquiring ports [0, 1]:                                      [SUCCESS]


Server Info:

Server version:   v2.87 @ STL
Server mode:      Stateless
Server CPU:       8 x Intel Xeon Processor (Icelake)
Ports count:      2 x 100Gbps @ ConnectX Family mlx5Gen Virtual Function

-=TRex Console v3.0=-

Type 'help' or '?' for supported actions

trex>

Run the TRex Console UI (TUI):
```
trex>tui
```
Start a 30MPPS stream using the stream configuration file created in previous steps:
```
tui>start -f udp_rss.py -m 30mpps -p 0
```

Check the test results:

Global Statistitcs

connection   : localhost, Port 4501                       total_tx_L2  : 15.38 Gbps                     
version      : STL @ v2.87                                total_tx_L1  : 20.18 Gbps                     
cpu_util.    : 26.39% @ 8 cores (8 per dual port)         total_rx     : 14.36 Gbps                     
rx_cpu_util. : 0.0% / 0 pps                               total_pps    : 30.03 Mpps                     
async_util.  : 0.05% / 10.87 Kbps                         drop_rate    : 0 bps                          
total_cps.   : 0 cps                                      queue_full   : 0 pkts

The test above demonstrates a 30Mpps frame rate for small UDP frames DPDK workload with 0 drop rate.
This test was executed on compute nodes with Intel Xeon 8380 CPU @ 2.30GHz (40-Cores)
Frame rate of 20Mpps is reached when running the same test with Security Group policy applied

Accelerated Data Processing (GPU)

GPUDirect RDMA

GPUDirect RDMA provides direct communication between NVIDIA GPUs in remote systems.

It bypasses the system CPUs and eliminates the required buffer copies of data via the system memory, resulting in a significant performance boost.

Image description: GPUDirect RDMA Flow

Use Case Topology

The following topology describes VM instances with hardware accelerated bond interface and A100 GPU located on remote compute nodes, and running GPUDirect RDMA workload over Geneve overlay tenant network.

Image description: GPUDirect RDMA Acceleration Use Case Topology

Use Case Benchmarks

GPUDirect-enabled RDMA Bandwidth

Note

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Notes

Performing an optimal GPUDirect RDMA Benchmark test requires a server with PCIe Bridges. The network adapter and GPU used in this test should be located under the same PCIe Bridge device and associated with the same CPU NUMA node.
- The "lspci -tv" command can be used to display the device hierarchy and verify that the adapter/GPU PCI devices are hosted under the same PCIe Bridge.
- "lspci -vvv -s <PCI_Device_ID>" can be used to identify the NUMA node associated with the adapter/GPU PCI devices.
GPUDirect RDMA in a virtual environment requires enablement of ATS (Address Translation Services) on the Network adapter, as well as ACS (Access Control Services) on the PCIe Bridge and server BIOS.
In the servers used for this test, the Network-RDMA device (ConnectX-6Dx) and GPU device (PCIe A100) share NUMA Node 0, and are connected under the same PCIe Bridge device.
For the GPUDirect RDMA benchmark test described in this section, the virtual instance guest OS must include CUDA and MLNX_OFED v5.6, or later.
Some of the configurations applied in this section are not persistent, and therefore, have to be reapplied after a server/instance reboot.
NVIDIA Multi-Instance GPU (MIG) must be disabled for this test.
On AMD-based servers, it is required to set the network adapter ATC registry in order to optimize the GPUDirect RDMA Benchmark results.

Prepare the setup for running a GPUDirect RDMA test over a virtualized environment by applying the following steps on both compute nodes:
- - Delete any existing instance on the compute nodes.
  - Install the mstflint package.
    # apt install mstflint
  - Locate the network adapter PCI ID, and enable ATS in firmware.
    # lspci | grep -i nox 3f:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] 3f:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] # mstconfig -d 3f:00.0 set ATS_ENABLED=true
  - Reboot the compute nodes to apply the new firmware configuration.
  - Stop the server during the boot process in the BIOS menu, and make sure that ACS is enabled.

Image description: BIOS ACS Configuration Example

Once the server is rebooted, verify that the adapter firmware parameters have been applied.

# mstconfig -d 3f:00.0 q | grep "ATS_ENABLED"
  ATS_ENABLED                         True(1)

1. - Enable ACS on the PCIe Bridge device that is hosting the adapter and GPU.
    Note
    In many server architectures, there are multiple chained PCIe Bridge devices serving a bulk of PCIe slots. It may be possible that the adapter and GPU will be connected to different sub devices in this PCIe bridge chain.
    
    The provided command will enable ACS on ALL PCIe Bridge devices in the system.
    
    This step is not persistent, and has to be re-applied every time the server is rebooted while there are no virtual instances running.
    # for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do setpci -v -s ${BDF} ECAP_ACS+0x6.w=0x5D ; done;
  - Verify that the ACS Direct Translation was enabled on the PCIe Bridge device hosting the adapter and GPU.
    # lspci -s <PCIe_Bridge_Device_ID> -vvv | grep ACSCtl ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
  - On servers with AMD CPU, set the network adapter ATC Registry.
    # mstmcra 3f:00.0 0xa1334.16:5 24
  - To verify that it was set:
    # mstmcra 3f:00.0 0xa1334.16:5 0x00000018

Create a new flavor with A100 GPU alias and ratio:

$ openstack flavor create m100.gpu --ram 8192 --disk 20 --vcpus 10
$ openstack flavor set m100.gpu --property "pci_passthrough:alias"="a100-gpu:1" 
$ openstack flavor set m100.gpu --property hw:cpu_policy=dedicated
$ openstack flavor set m100.gpu --property hw:mem_page_size=large

Create 2 VM instances, as instructed in previous steps.
Note
- The same networks and ports created in the previous use-case can be used in this use-case as well.
- Use the newly created flavor.
- Use a stateless security group for this use case.
- The VM image used for this test is based on Ubuntu 22.04, and includes the following software: CUDA v11.7, MLNX_OFED v5.7-1.0.2.0, perftest tool set compiled with CUDA support. The VM image was built using a disk-image-builder tool. The following article can be used as a reference: How-to: Create OpenStack Cloud Image with NVIDIA GPU and Network Drivers.

# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Red Hat, Inc. Virtio GPU (rev 01)
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:05.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
00:06.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
00:08.0 Unclassified device [00ff]: Red Hat, Inc. Virtio RNG

On both VMs, load the nvidia-peermem module:

# modprobe nvidia-peermem
# lsmod | grep -i peermem 
nvidia_peermem         16384  0
ib_core               430080  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              40816640  18 nvidia_uvm,nvidia_peermem,nvidia_modeset

On both VMs, disable MIG, enable the GPU device persistence mode, and lock the GPU clock on the maximum allowed speed.
Note
- Apply the following settings only when the bandwidth test result is not satisfactory.
- MIG disable is required for A100 GPUs
- Do NOT set a value higher than allowed per specific GPU device.
  - "nvidia-smi -i <device id> -q -d clock" command can be used to identify the Max Allowed Clock of a device.
  - For the A100 device we used in this test, the Max Allowed Clock is 1410 MHz.
```
# nvidia-smi -i 0 -mig 0
Disabled MIG Mode for GPU 00000000:00:06.0
All done.

# nvidia-smi -i 0 -pm 1
Persistence mode is already Enabled for GPU 00000000:00:06.0.
All done.

# nvidia-smi -i 0 -lgc 1410
GPU clocks set to "(gpuClkMin 1410, gpuClkMax 1410)" for GPU 00000000:00:06.0
All done.
```
Start the GPUDirect RDMA ib_write_bw server on one of the virtual instances:
Note
- ib_write_bw is provided as part of the perftest tool set compiled with CUDA support on the VM image with CUDA and MLNX_OFED
- It is possible to run a network-based test without GPUDirect RDMA by omitting the "use_cuda" flag
```
root@vm2:/home/ubuntu# ib_write_bw -F -q 4096 --tclass=96 --report_gbits -R --use_cuda=0

************************************
* Waiting for client to connect... *
************************************
```

Start the GPUDirect ib_write_bw client on the second instance by specifying the IP of the remote instance and a test packet size:

root@vm1:/home/ubuntu# ib_write_bw -F -q 4096 --tclass=96 --report_gbits  -R -D 30 33.33.33.17  --use_cuda=0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 00:06

Picking device No. 0
[pid = 2471, dev = 0] device name = [NVIDIA A100-PCIE-40GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 536870912 bytes GPU buffer
allocated GPU buffer address at 00007f8aa0000000 pointer=0x7f8aa0000000
.
.
.
 remote address: LID 0000 QPN 0x2ccd PSN 0xface62
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:17
 remote address: LID 0000 QPN 0x2cce PSN 0x61e9a4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:17
 remote address: LID 0000 QPN 0x2ccf PSN 0x7d428
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:17
 remote address: LID 0000 QPN 0x2cd0 PSN 0x3904a8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:17
 remote address: LID 0000 QPN 0x2cd1 PSN 0x6b878c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:17
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5681224          0.00               186.16             0.355076
---------------------------------------------------------------------------------------
deallocating RX GPU buffer 00007f8aa0000000
destroying current CUDA Ctx

The test is demonstrating an average bandwidth of 186Gbps for a packet size of 32KB over a leaf-spine fabric with Geneve encapsulation
The test above was executed on AMD servers with PCIe Gen4 support, that are optimized for GPUDirect RDMA.
Similar result was achieved for an RDMA bandwidth test without GPUDirect on the same servers.

Authors

Itai Levy

Over the past few years, Itai Levy has worked as a Solutions Architect and member of the NVIDIA Networking “Solutions Labs” team. Itai designs and executes cutting-edge solutions around Cloud Computing, Software-Defined Networking, Storage and Security. His main areas of expertise include NVIDIA BlueField Data Processing Unit (DPU) solutions and accelerated K8s/OpenStack platforms.

Last updated: June 30, 2026