InfiniBand Cluster Bring-up Procedure

Creating a Point-to-Point Excel File

The Point-to-Point Excel file centralizes all the physical information of the project and explicitly describes how to connect each cable. For the list of supported cables, see LinkX Cables and Transceivers | NVIDIA.

To create the excel file:

  1. Open an Excel file (Welcome to download and use this Template file: https://content.mellanox.com/PTP Template/ptp-example.xls )

  2. Create 2 sheets as explained below:

    • Legend – describes basic properties for each element of the cluster. Each element should include the following properties: 

      • Name – describes the naming convention for each element, best practice is to include the element basic name and * before and after the name

      • Model – element model

        The “Model” is the “device format” as described inside the “/usr/share/ibdm2.1.1/ibnl”. If the model used is not part of the supported list, please create a new one as follow: 

        https://linux.die.net/man/1/ibdm-topo-file

        https://linux.die.net/man/1/ibdm-ibnl-file


      • Switch/HCA - whether it is a switch or HCA

      • Speed – element speed

      • Comments – general comments

        NDR Example:

        Name

        Model

        Switch/HCA

        Speed

        Comments

        *dgx*

        HCA_12

        hca

        4x-100G

        NDR

        *clf*

        MQM9700

        switch

        4x-100G

        NDR

        *csp*

        MQM9700

        switch

        4x-100G

        NDR

        XDR Example:

        Name

        Model

        Switch/HCA

        Speed

        Comments

        *dgx*

        HCA_12

        hca

        4x-200G

        XDR

        *clf*

        Q3400-RA

        switch

        4x-200G

        XDR

        *csp*

        Q3400-RA

        switch

        4x-200G

        XDR


    • PTP - explicitly describes how to connect each cable. The table has two main parts, Source and Destination, each one contains mostly the same columns. Each Line should include the following for each end of the cable:Rack - device rackU - device location in the rackName – name of the device (must comply with the naming convention as specified for the device type in the Label sheet)HCA/port - HCA name and port (in Destination part only port)SourceDestinationRackUNameHCA/portRackUNamePortSU1-1 A223cl02s01dgx011Leaves SU1 A3825cl02s01clf011SU1-1 A223cl02s01dgx012Leaves SU1 A3827cl02s01clf021SU1-1 A223cl02s01dgx013Leaves SU1 A3829cl02s01clf031SU1-1 A223cl02s01dgx014Leaves SU1 A3831cl02s01clf041SU1-1 A223cl02s01dgx015Leaves SU1 A3833cl02s01clf051SU1-1 A223cl02s01dgx016Leaves SU1 A3835cl02s01clf061SU1-1 A223cl02s01dgx017Leaves SU1 A3837cl02s01clf071SU1-1 A223cl02s01dgx018Leaves SU1 A3839cl02s01clf081

NOTES:

  • destination device should always be a switch (HCAs should always be specified in source)

  • for switches, use real/physical port numbers

  • HCA ports can be named/enumerated as you wish, and you have to verify that there is a proper mapping from HCA port enumeration to real HCA interface name (will be referred in next step page)


In the provided examples, the element name *dgx* denotes the device with the identifier cl02s01dgx01.


Make sure to have clear and meaningful names, well described element, its role, and its location in both the topology and in the cluster.


Last updated: