DOCA Platform Framework

Reporting Issues with sosreport

Environment

  • DOCA Platform Framework Kubernetes cluster

Issue

  • What is the recommended way for generating a sosreport in DOCA Platform Framework

  • It may not be possible to connect to DOCA Platform Framework nodes via SSH from outside the cluster by default but sosreport may need to be run for troubleshooting purposes.

Generating a sos report with a debug pod

Target Host cluster

Create a secret containing the kubeconfig

In order to run sosreport, a kubeconfig is needed to access the API Server.

  1. Create a secret containing the kubeconfig

kubectl create secret generic admin-config --from-file=kubeconfig=<path_to_kubeconfig>

Deploy sos-report

  1. Display the list of nodes in the cluster and export the selected node. The following command will display the list of nodes:

kubectl get nodes
  1. Then create a debug pod by deploying the following manifest:

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: dpf-sosreport
spec:
  nodeName: ${NODE_NAME}
  containers:
  - name: sosreport
    image: ghcr.io/nvidia/sosreport:latest
    env:
    - name: CASE_ID
      value: "${CASE_ID}"
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
      runAsUser: 0
    volumeMounts:
      - mountPath: /host
        name: host
      - mountPath: /run
        name: run
      - mountPath: /var/log
        name: varlog
        # sosreport check if this file exist before executing the kubernetes plugin,
        # without it no kubernetes output will be available.
      - mountPath: /etc/kubernetes/admin.conf
        name: adminconf
        subPath: kubeconfig
      - mountPath: /etc/localtime
        name: localtime
      - mountPath: /etc/machine-id
        name: machineid
      - mountPath: /boot
        name: boot
      - mountPath: /usr/lib/modules/
        name: modules
  volumes:
    - hostPath:
        path: /
      name: host
    - hostPath:
        path: /run
      name: run
    - hostPath:
        path: /boot
      name: boot
    - hostPath:
        path: /usr/lib/modules/
      name: modules
    - hostPath:
        path: /var/log
      name: varlog
    - secret:
        secretName: admin-config
      name: adminconf
    - hostPath:
        path: /etc/localtime
      name: localtime
    - hostPath:
        path: /etc/machine-id
      name: machineid
  restartPolicy: Never
  hostIPC: true
  hostNetwork: true
  hostPID: true
EOF

Target Tenant Cluster

Find the tenant cluster kubeconfig

In order to run sosreport, a kubeconfig is needed to access the API Server. When the report has to be generated for a tenant cluster, we have to retrieve the kubeconfig from the host cluster.

  1. Get the kubeconfig name from the dpucluster spec.

export KUBECONFIG_NAME=$(kubectl get dpucluster -n ${NAMESPACE} ${CLUSTER_NAME} -o jsonpath='{.spec.kubeconfig}')
  1. Create the kubeconfig from the secret data

kubectl get secrets -n ${NAMESPACE} ${KUBECONFIG_NAME} -o json \
  | jq -r '.data["admin.conf"]' \
  | base64 --decode \
  > /tmp/${NAMESPACE}-${CLUSTER_NAME}.kubeconfig
  1. Create a secret containing the kubeconfig in the tenant cluster

kubectl create secret generic admin-config --from-file=kubeconfig=/tmp/${NAMESPACE}-${CLUSTER_NAME}.kubeconfig \
  --kubeconfig=/tmp/${NAMESPACE}-${CLUSTER_NAME}.kubeconfig

Deploy sos-report

  1. Display the list of nodes in the cluster and export the selected node. The following command will display the list of nodes:

kubectl get nodes
  1. Then create a debug pod by deploying the following manifest:

cat <<EOF | kubectl --kubeconfig=/tmp/${NAMESPACE}-${CLUSTER_NAME}.kubeconfig create -f -
apiVersion: v1
kind: Pod
metadata:
  name: dpf-sosreport
spec:
  nodeName: ${NODE_NAME}
  containers:
  - name: sosreport
    image: ghcr.io/nvidia/sosreport:latest
    env:
    - name: CASE_ID
      value: "${CASE_ID}"
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
      runAsUser: 0
    volumeMounts:
      - mountPath: /host
        name: host
      - mountPath: /run
        name: run
      - mountPath: /var/log
        name: varlog
        # sosreport check if this file exist before executing the kubernetes plugin,
        # without it no kubernetes output will be available.
      - mountPath: /etc/kubernetes/admin.conf
        name: adminconf
        subPath: kubeconfig
      - mountPath: /etc/localtime
        name: localtime
      - mountPath: /etc/machine-id
        name: machineid
      - mountPath: /boot
        name: boot
      - mountPath: /usr/lib/modules/
        name: modules
  volumes:
    - hostPath:
        path: /
      name: host
    - hostPath:
        path: /run
      name: run
    - hostPath:
        path: /boot
      name: boot
    - hostPath:
        path: /usr/lib/modules/
      name: modules
    - hostPath:
        path: /var/log
      name: varlog
    - secret:
        secretName: admin-config
      name: adminconf
    - hostPath:
        path: /etc/localtime
      name: localtime
    - hostPath:
        path: /etc/machine-id
      name: machineid
  restartPolicy: Never
  hostIPC: true
  hostNetwork: true
  hostPID: true
EOF

Retrieve the generated report

The final repost archive is available under /tmp in the node filesystem.

In order to untar it, run :

tar -x --xz -f sosreport-<node_name>-<case_id>-<date>-xxx.tar.xz

Last updated: