Overview
The DPF Operator requires several prerequisite components to function properly in a Kubernetes environment. This document provides comprehensive guidance on the Helm chart dependencies and their configuration values needed for a successful DPF Operator deployment.
Important Note
Starting with DPF v25.7, all Helm dependencies have been removed from the DPF chart. This means that all dependencies must be installed manually before installing the DPF chart itself.
Prerequisites Overview
The following table lists all required, conditional, and optional Helm chart dependencies with their specific versions and purposes:
Helm Chart
Version
Description
Required
Post/Pre-installation
cert-manager
v1.19.3
Certificate management for Kubernetes, provides automatic TLS certificate issuance and renewal
Yes
Pre-installation
argo-cd
9.4.1
GitOps continuous delivery tool for Kubernetes, necessary for DPUService integration
Yes
Pre-installation
node-feature-discovery
0.18.3
Discovers and advertises hardware features and capabilities of DPUs in the cluster
Yes
Pre-installation
maintenance-operator
0.3.0
Manages node maintenance operations and ensures graceful handling of node updates
Yes
Pre-installation
kamaji
1.2.0
Kubernetes cluster management platform for creating and managing the DPU Kubernetes clusters
Conditional
Pre-installation
local-path-provisioner
0.0.34
Provides the local-path storage class used by the default Kamaji etcd configuration
Conditional
Pre-installation
kube-state-metrics
5.25.1
Exposes DPF Operator related objects as metrics
No
Post-installation
kube-prometheus-stack
80.4.1
Complete monitoring stack with Prometheus and Grafana for collecting and visualizing metrics
No
Post-installation
loki
6.53.0
Kubernetes log aggregation and storage, integrates with Grafana
No
Post-installation
opentelemetry-collector
0.146.0
Collects and exports metrics, logs, and traces to observability backends
No
Post-installation
Conditional means the component is required for the default installation described in the user guides, but can be replaced in custom deployments.
Some of the components requires the DPF Operator to be installed before they can be installed.
This is necessary for kube-state-metrics and kube-prometheus-stack (Grafana dashboards), because we rely on ConfigMaps created by the DPF Operator to provide the necessary configuration for these components.
See Running Argo CD in a separate namespace for the configuration required to utilise ArgoCD running in a different namespace.
Running Argo CD in a separate namespace
DPF supports running Argo CD in a namespace other than dpf-operator-system. When Argo CD is installed outside dpf-operator-system, ensure that dpf-operator-system is included in the Argo CD Helm value configs.params.application.namespaces (or an equivalent configuration) so Argo CD reconciles Applications in dpf-operator-system. Also set spec.overrides.argoCDNamespace in the DPFOperatorConfig to the namespace where Argo CD is installed. See the DPFOperatorConfig guide for an example.
Installation Options
Option 1: Using Helmfile
We provide a working helmfile configuration that can be used to install all dependencies with the correct values.
The helmfiles are located at deploy/helmfiles/ in the DPF repository .
This approach ensures consistent deployment across different environments and simplifies the installation process.
The default Helmfile installs both kamaji and local-path-provisioner. Apply the required changes to files in deploy/helmfiles/ if you do not want to install these components, or if you want to replace the default local-path storage class with another storage class for Kamaji etcd.
Option 2: Manual Installation
If you prefer to install dependencies manually, you can use the individual Helm chart values provided in the sections below.
Required Configuration Values
The following section provides the specific Helm chart values that must be configured before installing each dependency. These configurations ensure proper integration with the DPF Operator and optimal performance in your environment.
Helm Chart Values
cert-manager
startupapicheck:
enabled: false
crds:
enabled: true
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
tolerations:
- operator: Exists
effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- operator: Exists
effect: NoSchedule
key: node-role.kubernetes.io/master
cainjector:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
tolerations:
- operator: Exists
effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- operator: Exists
effect: NoSchedule
key: node-role.kubernetes.io/master
webhook:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
tolerations:
- operator: Exists
effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- operator: Exists
effect: NoSchedule
key: node-role.kubernetes.io/master
argo-cd
## Disable the ApplicationSet controller.
applicationSet:
replicas: 0
dex:
enabled: false
notifications:
enabled: false
global:
podLabels:
ovn.dpu.nvidia.com/skip-injection: ""
affinity:
nodeAffinity:
# -- Default node affinity rules. Either: `none`, `soft` or `hard`
type: hard
# -- Default match expressions for node affinity
matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
redis:
image:
repository: mirror.gcr.io/redis
configs:
params:
# Argo CD can be deployed to a different namespace.
# Setting namespaces to dpf-operator-system ensures Argo CD reconciles applications in that namespace.
application.namespaces: dpf-operator-system
node-feature-discovery
# Node Feature Discovery configuration
master:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
# Note: beginning with v0.18.3 the master toleration was dropped from the chart's default values.yaml.
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
- key: "node-role.kubernetes.io/control-plane"
operator: "Equal"
value: ""
effect: "NoSchedule"
worker:
enable: true
hostNetwork: true
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
config:
sources:
pci:
deviceClassWhitelist:
- "0200"
deviceLabelFields:
- "class"
- "vendor"
- "device"
gc:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
maintenance-operator
# Maintenance Operator Chart configuration
operatorConfig:
deploy: true
maxParallelOperations: 60%
operator:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
kamaji
# Kamaji configuration
# Number of Kamaji controller replicas for High Availability
replicas: 2
resources: null
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
kamaji-etcd:
persistentVolumeClaim:
storageClassName: local-path
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
jobs:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
datastore:
enabled: true
annotations:
helm.sh/resource-policy: keep
name: default
image:
repository: ghcr.io/nvidia/kamaji
tag: v1.34.0-25.9.3
pullPolicy: Always
cfssl:
image:
tag: v1.6.5
local-path-provisioner
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- operator: Exists
effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- operator: Exists
effect: NoSchedule
key: node-role.kubernetes.io/master
kube-state-metrics
# Kube State Metrics configuration
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
extraArgs:
- --custom-resource-state-config-file=/etc/customresourcestate/config.yaml
- --metric-labels-allowlist=pods=[svc.dpu.nvidia.com/service],daemonsets=[svc.dpu.nvidia.com/service],deployments=[svc.dpu.nvidia.com/service]
volumes:
- configMap:
defaultMode: 420
name: dpf-operator-customresourcestate-config
name: customresourcestate-config
volumeMounts:
- mountPath: /etc/customresourcestate
name: customresourcestate-config
readOnly: true
prometheus:
monitor:
enabled: true
http:
honorLabels: true
rbac:
extraRules:
- apiGroups:
- svc.dpu.nvidia.com
- operator.dpu.nvidia.com
- provisioning.dpu.nvidia.com
- storage.dpu.nvidia.com
- vpc.dpu.nvidia.com
resources:
- '*'
verbs: ["list", "watch"]
- apiGroups: ["apiextensions.k8s.io"]
resources: ["customresourcedefinitions"]
verbs: ["list", "watch"]
kube-prometheus-stack
# kube-prometheus-stack configuration
#
# This configuration replaces the separate prometheus and grafana helm releases
# with a unified kube-prometheus-stack release that includes:
# - Prometheus Operator
# - Prometheus
# - Grafana
#
# Key features:
# - Grafana automatically discovers dashboards from ConfigMaps with label grafana_dashboard: "1"
# - The dpf-operator chart creates ConfigMaps with these labels for its dashboards
# - Prometheus datasource is automatically configured with uid: prometheus (matching dashboard expectations)
# - Both Prometheus and Grafana are scheduled on control-plane nodes with appropriate tolerations
#
# Note: kube-state-metrics is deployed separately and should be installed independently
kubeStateMetrics:
enabled: false
nodeExporter:
enabled: false
alertmanager:
enabled: false
crds:
enabled: true
upgradeJob:
enabled: false
# If enabled, schedule CRD upgrade job on control-plane nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Add cluster label to all built-in ServiceMonitors for management cluster
# These relabelings distinguish management cluster metrics from Kamaji tenant cluster metrics
coreDns:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: management
kubeProxy:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: management
kubeEtcd:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: management
kubeApiServer:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: management
kubeControllerManager:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: management
kubeScheduler:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: management
kubelet:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: management
# Prometheus configuration
prometheus:
prometheusSpec:
# Add cluster label to ALL metrics via external labels
# In modern Prometheus, these labels are visible in local queries
externalLabels:
cluster: management
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Persistent volume configuration
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: local-path
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 8Gi
# Service account with permissions to scrape metrics
serviceAccountName: kube-prometheus-stack-prometheus
# Additional scrape configs for DPF Operator metrics
additionalScrapeConfigs:
- job_name: 'doca-platform-framework'
scrape_interval: 15s
metrics_path: /metrics
scheme: https
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_dpu_nvidia_com_component]
action: keep
regex: ".*-controller-manager"
- source_labels: [__meta_kubernetes_pod_container_port_name]
action: keep
regex: metrics
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# Add cluster label to ALL scraped metrics for Grafana multicluster support
# This makes the cluster label visible in local queries (unlike externalLabels)
# Note: The control plane components (kube-apiserver, kube-controller-manager, kube-scheduler)
# already have cluster labels via their ServiceMonitor relabelings above
metric_relabel_configs:
- action: replace
target_label: cluster
replacement: management
# Allow monitoring of all ServiceMonitors
# Setting to {} alone isn't enough - need to disable the default helm values behavior
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
# Allow monitoring of all namespaces
serviceMonitorNamespaceSelector: {}
# Allow monitoring of all PodMonitors
podMonitorSelectorNilUsesHelmValues: false
podMonitorSelector: {}
podMonitorNamespaceSelector: {}
# Grafana configuration
grafana:
enabled: true
# Schedule grafana on control-plane nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Persistent volume configuration
persistence:
enabled: true
storageClassName: local-path
# Disable init container that changes ownership (causes issues with some storage classes)
initChownData:
enabled: false
# Datasource configuration
# kube-prometheus-stack automatically creates a Prometheus datasource with uid: prometheus
# which matches what the dpf-operator dashboards expect
# Additional datasources
additionalDataSources:
- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki.dpf-operator-system.svc.cluster.local:3100
isDefault: false
editable: true
jsonData:
maxLines: 1000
derivedFields:
# Automatically extract trace IDs from logs (if present)
- datasourceName: Tempo
matcherRegex: "traceID=(\\w+)"
name: TraceID
url: "$${__value.raw}"
# Sidecar configuration
sidecar:
# Datasources sidecar - provisions datasources from ConfigMaps/Secrets
datasources:
enabled: true
# This is critical - without it, Grafana won't load datasources on startup
defaultDatasourceEnabled: true
# Note: The sidecar writes datasources but by default skips the initial reload (REQ_SKIP_INIT: true)
# The lifecycle hook above handles triggering the initial reload
# Dashboards sidecar - provisions dashboards from ConfigMaps
dashboards:
enabled: true
# Label that the sidecar will look for in ConfigMaps
label: grafana_dashboard
labelValue: "1"
# Search in dpf-operator-system namespace for dashboard ConfigMaps
searchNamespace: dpf-operator-system
# Use folder annotation to organize dashboards into folders
folderAnnotation: grafana_folder
# Allow the sidecar to create dashboard providers automatically
provider:
foldersFromFilesStructure: true
# Enable multicluster dashboard support
# This allows dashboards to display metrics from multiple clusters with proper cluster labels
multicluster:
global:
enabled: true
# Prometheus Operator configuration
prometheusOperator:
# Schedule operator on control-plane nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Admission webhooks configuration
admissionWebhooks:
# Patch job creates/patches webhook certificates
patch:
# Schedule patch job on control-plane nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Create CRDs
createCustomResource: true
# Prometheus operator resources
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
loki
# Loki configuration for management cluster
# This deployment receives logs from OpenTelemetry Collectors running on both
# the management cluster and DPU clusters
deploymentMode: SingleBinary
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
# Enable OTLP ingestion
server:
http_listen_port: 3100
grpc_listen_port: 9095
log_level: info
storage:
type: 'filesystem'
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
# Limits configuration (includes OTLP config)
limits_config:
retention_period: 168h # 7 days
max_query_series: 10000
max_query_lookback: 720h # 30 days
ingestion_rate_mb: 50
ingestion_burst_size_mb: 100
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 20MB
allow_structured_metadata: true
otlp_config:
resource_attributes:
attributes_config:
- action: index_label
attributes:
- k8s.namespace.name
- k8s.pod.name
- k8s.container.name
- cluster
# Single binary mode configuration
singleBinary:
replicas: 1
# Schedule on control-plane nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: Exists
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: Exists
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Resources
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
# Persistence
persistence:
enabled: true
storageClass: local-path
size: 10Gi
# Gateway disabled - not needed in SingleBinary mode
# All access goes directly to the Loki service on port 3100
gateway:
enabled: false
# Read/Write components (disabled in single binary mode)
read:
replicas: 0
write:
replicas: 0
backend:
replicas: 0
# Disable components not needed in single binary mode
chunksCache:
enabled: false
resultsCache:
enabled: false
# Monitoring configuration
monitoring:
serviceMonitor:
enabled: true
labels:
release: kube-prometheus-stack
selfMonitoring:
enabled: false
grafanaAgent:
installOperator: false
# Test configuration
test:
enabled: false
# Loki canary (synthetic log generator for testing)
lokiCanary:
enabled: false
opentelemetry-collector
# OpenTelemetry Collector configuration for management cluster
# This collector receives logs and metrics from:
# 1. Local management cluster pods (via filelog receiver)
# 2. OpenTelemetry Collectors running on DPU clusters (via OTLP receiver)
mode: daemonset
# Image configuration (required as of chart version 0.145.0)
# Use contrib distribution for Loki exporter support
image:
repository: otel/opentelemetry-collector-contrib
tag: "" # defaults to chart appVersion
# Run on all management cluster nodes to collect logs
# Tolerations allow running on control-plane nodes
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Presets for Kubernetes integration
presets:
logsCollection:
enabled: true
includeCollectorLogs: true
kubernetesAttributes:
enabled: true
extractAllPodLabels: true
extractAllPodAnnotations: false
# OpenTelemetry Collector configuration
config:
receivers:
# OTLP receiver for logs and metrics from DPU clusters
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 5s
limit_mib: 1024
spike_limit_mib: 256
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
- k8s.job.name
- k8s.node.name
- k8s.pod.name
- k8s.pod.uid
- k8s.pod.start_time
# Add management cluster label (only if not already set by DPU cluster)
resource:
attributes:
- key: cluster
value: "management"
action: insert
exporters:
# Export logs to Loki via OTLP (directly to Loki, bypassing gateway)
otlphttp/loki:
endpoint: http://loki:3100/otlp
tls:
insecure: true
# Export metrics to Prometheus (via remote write)
prometheusremotewrite:
endpoint: http://kube-prometheus-stack-prometheus.dpf-operator-system.svc.cluster.local:9090/api/v1/write
resource_to_telemetry_conversion:
enabled: true
# Debug exporter for troubleshooting
debug:
verbosity: basic
sampling_initial: 5
sampling_thereafter: 200
service:
pipelines:
logs:
receivers: [otlp, filelog]
processors: [memory_limiter, k8sattributes, resource, batch]
exporters: [otlphttp/loki, debug]
metrics:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, resource, batch]
exporters: [prometheusremotewrite, debug]
# Resources for the collector deployment
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 200m
memory: 512Mi
# Service configuration
# Use NodePort to allow DPU clusters to reach this collector
service:
enabled: true
type: NodePort
# Ports configuration
ports:
otlp:
enabled: true
containerPort: 4317
servicePort: 4317
protocol: TCP
otlp-http:
enabled: true
containerPort: 4318
servicePort: 4318
# Fixed NodePort for DPU clusters to use. Chosen from the lower static band of
# the NodePort range to avoid collisions with dynamically auto-assigned NodePorts,
# which are allocated from the upper band first.
# See https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/3668-reserved-service-nodeport-range
nodePort: 30050
protocol: TCP
metrics:
enabled: true
containerPort: 8888
servicePort: 8888
protocol: TCP
# ServiceAccount configuration
serviceAccount:
create: true
name: opentelemetry-collector
# ClusterRole permissions
clusterRole:
create: true
rules:
- apiGroups: [""]
resources: ["pods", "namespaces", "nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets", "deployments", "daemonsets", "statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
# ServiceMonitor for Prometheus monitoring
serviceMonitor:
enabled: true
metricsEndpoints:
- port: metrics
Last updated: June 24, 2026