NVIDIA UFM Cable Validation Tool

Simple Tuning Guide

Simple CVT Performance Tuning Guide

Easy Configuration - Just 3 Variables

Based on extensive testing and real-world deployments, CVT performance can be optimized with just these simple settings:

Core Configuration:

# 1. Agent Deployment (~200MB image + local container operations) export CVT_DEPLOYMENT_MAX_WORKERS=60 # 2. Everything Else (validation, connectivity, DNS, etc.) export CVT_MAX_WORKERS=150 # 3. Batching Control (when to split large deployments) export CVT_BATCHING_THRESHOLD=10000

How to Decide Values

CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment)

Question: How much bandwidth can your management network handle?

Note: Deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. The process is not purely bandwidth-limited, so we can use higher worker counts.

Network Speed

Recommended Value

Reasoning

1G

20

20 × 200MB = ~4GB concurrent + local ops

10G

60

60 × 200MB = ~12GB concurrent + local ops

25G+

100-160

Higher bandwidth + parallel local operations

CVT_MAX_WORKERS (Everything Else)

Question: How many CPU cores does your server have?

Server CPU Cores

Recommended Value

Reasoning

8-32 cores

30-50

Conservative scaling

32-128 cores

75-150

Balanced scaling

128+ cores

150-300

Aggressive scaling

But watch for switch overload! If you see many timeout errors, reduce this value.

CVT_BATCHING_THRESHOLD (When to Batch)

Question: How many switches do you have?

Switch Count

Recommended Value

What Happens

<5,000

10000 (default)

Single batch (faster)

5,000-20,000

5000

Light batching

20,000+

3000

More aggressive batching

Quick Start by Deployment Size:

Small Deployment (<1,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=20 export CVT_MAX_WORKERS=50 # No need to change batching threshold

Medium Deployment (1,000-10,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=40 export CVT_MAX_WORKERS=100 # No need to change batching threshold

Large Deployment (10,000-30,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=60 export CVT_MAX_WORKERS=150 export CVT_BATCHING_THRESHOLD=5000

Hyperscale Deployment (30,000+ switches)

export CVT_DEPLOYMENT_MAX_WORKERS=80 export CVT_MAX_WORKERS=200 export CVT_BATCHING_THRESHOLD=3000

How to Know If Your Settings Are Good

Good Signs:

  • ✅ Server load increases during validation (better CPU utilization)

  • ✅ Validation completes faster than before

  • ✅ No significant increase in timeout errors

  • ✅ Network bandwidth stays below 80%

Warning Signs:

  • ⚠️ Many "Timeout while trying to start validation" errors

  • ⚠️ "Connection refused" errors from switches

  • ⚠️ Server load stays low (underutilization)

  • ⚠️ Network bandwidth hits 90%+

Adjustment Strategy:

  1. Too many timeouts: Reduce CVT_MAX_WORKERS by 25-50

  2. Server underutilized: Increase CVT_MAX_WORKERS by 25-50

  3. Deployment too slow: Increase CVT_DEPLOYMENT_MAX_WORKERS (if network allows)

  4. Memory issues: Reduce CVT_BATCHING_THRESHOLD

Expected Performance

Your Customer's 32,149 Switches:

Current Configuration:

export CVT_DEPLOYMENT_MAX_WORKERS=60 # Optimal for 10G network with 200MB image export CVT_MAX_WORKERS=150 # Good for 448-core server export CVT_BATCHING_THRESHOLD=10000 # Will use batching (32K > 10K)

Expected Results:

  • Current: 7m 13s

  • Optimized: 2-3 minutes (60-75% improvement)

  • Server Load: Should increase from 5-7 to 15-20

Monitoring and Troubleshooting

Critical Monitoring Points:

Watch for Device Overload:

  1. Timeout Errors: Increase in "Timeout while trying to start validation" messages

  2. Connection Refused: "Connection error" messages from devices

  3. Response Times: Slower device response times

  4. Failure Rate: Higher percentage of failed device connections

Server Utilization Monitoring:

  1. Load Average: Should increase from baseline to target ranges

  2. CPU Usage: Better utilization of available cores

  3. Memory: Watch for any memory pressure during large deployments

  4. Network: Monitor bandwidth utilization on management interface

Success Criteria by Worker Count

Worker Count

Expected Load Average

Expected Time Improvement

Risk Level

50-75

8-12

30-50% faster

Low

100-150

12-18

50-70% faster

Medium

200+

18-25

70%+ faster

High

Red Flags (Scale Back If You See)

  • ⚠️ Significant increase in timeout errors

  • ⚠️ "Connection refused" errors from devices

  • ⚠️ Device response times getting slower

  • ⚠️ Higher failure rates than baseline

Green Lights (Scale Up If You See)

  • ✅ Stable or improved device response times

  • ✅ No increase in connection errors

  • ✅ Server load well below target range

  • ✅ Good success rate maintained

System Tuning for Large Deployments

File Descriptor Limits

# Increase file descriptor limits for high concurrency ulimit -n 65536 # Make permanent by adding to /etc/security/limits.conf: echo "* soft nofile 65536" >> /etc/security/limits.conf echo "* hard nofile 65536" >> /etc/security/limits.conf

Network Optimization

# Optimize network settings for high concurrency echo 65536 > /proc/sys/net/core/somaxconn echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

Memory Settings

# For very large deployments (20,000+ devices) echo 1 > /proc/sys/vm/overcommit_memory echo 80 > /proc/sys/vm/overcommit_ratio

Agent Deployment Time Estimates

Deployment Duration by Cluster Size:

Devices

Workers

Network

Estimated Time

1,000

20

1G

1-2 hours

4,000

60

10G

2-4 hours

10,000

80

25G

4-8 hours

25,000+

100-160

40G+

10-20 hours

Note: Agent deployment includes image fetch (~200MB per device), local save, image load, and container creation. With the reduced image size and parallel local operations, deployment is significantly faster than with larger images.

Bottom Line

Most customers only need to set 2 variables:

  1. CVT_DEPLOYMENT_MAX_WORKERS (based on network bandwidth)

  2. CVT_MAX_WORKERS (based on server CPU and device tolerance)

The third variable (CVT_BATCHING_THRESHOLD) usually works fine at default!

Last updated: