Environment
-
DOCA Platform Framework Kubernetes cluster
Issue
-
What is the recommended way for generating a sosreport in DOCA Platform Framework
-
It may not be possible to connect to DOCA Platform Framework nodes via SSH from outside the cluster by default but sosreport may need to be run for troubleshooting purposes.
Generating a sos report with a debug pod
Target Host cluster
Create a secret containing the kubeconfig
In order to run sosreport, a kubeconfig is needed to access the API Server.
-
Create a secret containing the
kubeconfig
kubectl create secret generic admin-config --from-file=kubeconfig=<path_to_kubeconfig>
Deploy sos-report
-
Display the list of nodes in the cluster and export the selected node. The following command will display the list of nodes:
kubectl get nodes
-
Then create a debug pod by deploying the following manifest:
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: dpf-sosreport
spec:
nodeName: ${NODE_NAME}
containers:
- name: sosreport
image: ghcr.io/nvidia/sosreport:latest
env:
- name: CASE_ID
value: "${CASE_ID}"
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
runAsUser: 0
volumeMounts:
- mountPath: /host
name: host
- mountPath: /run
name: run
- mountPath: /var/log
name: varlog
# sosreport check if this file exist before executing the kubernetes plugin,
# without it no kubernetes output will be available.
- mountPath: /etc/kubernetes/admin.conf
name: adminconf
subPath: kubeconfig
- mountPath: /etc/localtime
name: localtime
- mountPath: /etc/machine-id
name: machineid
- mountPath: /boot
name: boot
- mountPath: /usr/lib/modules/
name: modules
volumes:
- hostPath:
path: /
name: host
- hostPath:
path: /run
name: run
- hostPath:
path: /boot
name: boot
- hostPath:
path: /usr/lib/modules/
name: modules
- hostPath:
path: /var/log
name: varlog
- secret:
secretName: admin-config
name: adminconf
- hostPath:
path: /etc/localtime
name: localtime
- hostPath:
path: /etc/machine-id
name: machineid
restartPolicy: Never
hostIPC: true
hostNetwork: true
hostPID: true
EOF
Target Tenant Cluster
Find the tenant cluster kubeconfig
In order to run sosreport, a kubeconfig is needed to access the API Server. When the report has to be generated for a tenant cluster, we have to retrieve the kubeconfig from the host cluster.
-
Get the
kubeconfigname from the dpuclusterspec.
export KUBECONFIG_NAME=$(kubectl get dpucluster -n ${NAMESPACE} ${CLUSTER_NAME} -o jsonpath='{.spec.kubeconfig}')
-
Create the
kubeconfigfrom the secret data
kubectl get secrets -n ${NAMESPACE} ${KUBECONFIG_NAME} -o json \
| jq -r '.data["admin.conf"]' \
| base64 --decode \
> /tmp/${NAMESPACE}-${CLUSTER_NAME}.kubeconfig
-
Create a secret containing the
kubeconfigin the tenant cluster
kubectl create secret generic admin-config --from-file=kubeconfig=/tmp/${NAMESPACE}-${CLUSTER_NAME}.kubeconfig \
--kubeconfig=/tmp/${NAMESPACE}-${CLUSTER_NAME}.kubeconfig
Deploy sos-report
-
Display the list of nodes in the cluster and export the selected node. The following command will display the list of nodes:
kubectl get nodes
-
Then create a debug pod by deploying the following manifest:
cat <<EOF | kubectl --kubeconfig=/tmp/${NAMESPACE}-${CLUSTER_NAME}.kubeconfig create -f -
apiVersion: v1
kind: Pod
metadata:
name: dpf-sosreport
spec:
nodeName: ${NODE_NAME}
containers:
- name: sosreport
image: ghcr.io/nvidia/sosreport:latest
env:
- name: CASE_ID
value: "${CASE_ID}"
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
runAsUser: 0
volumeMounts:
- mountPath: /host
name: host
- mountPath: /run
name: run
- mountPath: /var/log
name: varlog
# sosreport check if this file exist before executing the kubernetes plugin,
# without it no kubernetes output will be available.
- mountPath: /etc/kubernetes/admin.conf
name: adminconf
subPath: kubeconfig
- mountPath: /etc/localtime
name: localtime
- mountPath: /etc/machine-id
name: machineid
- mountPath: /boot
name: boot
- mountPath: /usr/lib/modules/
name: modules
volumes:
- hostPath:
path: /
name: host
- hostPath:
path: /run
name: run
- hostPath:
path: /boot
name: boot
- hostPath:
path: /usr/lib/modules/
name: modules
- hostPath:
path: /var/log
name: varlog
- secret:
secretName: admin-config
name: adminconf
- hostPath:
path: /etc/localtime
name: localtime
- hostPath:
path: /etc/machine-id
name: machineid
restartPolicy: Never
hostIPC: true
hostNetwork: true
hostPID: true
EOF
Retrieve the generated report
The final repost archive is available under /tmp in the node filesystem.
In order to untar it, run :
tar -x --xz -f sosreport-<node_name>-<case_id>-<date>-xxx.tar.xz
Last updated: