Simple CVT Performance Tuning Guide
Easy Configuration - Just 3 Variables
Based on extensive testing and real-world deployments, CVT performance can be optimized with just these simple settings:
Core Configuration:
# 1. Agent Deployment (~200MB image + local container operations) export CVT_DEPLOYMENT_MAX_WORKERS=60 # 2. Everything Else (validation, connectivity, DNS, etc.) export CVT_MAX_WORKERS=150 # 3. Batching Control (when to split large deployments) export CVT_BATCHING_THRESHOLD=10000
How to Decide Values
CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment)
Question: How much bandwidth can your management network handle?
Note: Deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. The process is not purely bandwidth-limited, so we can use higher worker counts.
|
Network Speed |
Recommended Value |
Reasoning |
|---|---|---|
|
1G |
20 |
20 × 200MB = ~4GB concurrent + local ops |
|
10G |
60 |
60 × 200MB = ~12GB concurrent + local ops |
|
25G+ |
100-160 |
Higher bandwidth + parallel local operations |
CVT_MAX_WORKERS (Everything Else)
Question: How many CPU cores does your server have?
|
Server CPU Cores |
Recommended Value |
Reasoning |
|---|---|---|
|
8-32 cores |
30-50 |
Conservative scaling |
|
32-128 cores |
75-150 |
Balanced scaling |
|
128+ cores |
150-300 |
Aggressive scaling |
But watch for switch overload! If you see many timeout errors, reduce this value.
CVT_BATCHING_THRESHOLD (When to Batch)
Question: How many switches do you have?
|
Switch Count |
Recommended Value |
What Happens |
|---|---|---|
|
<5,000 |
10000 (default) |
Single batch (faster) |
|
5,000-20,000 |
5000 |
Light batching |
|
20,000+ |
3000 |
More aggressive batching |
Quick Start by Deployment Size:
Small Deployment (<1,000 switches)
export CVT_DEPLOYMENT_MAX_WORKERS=20 export CVT_MAX_WORKERS=50 # No need to change batching threshold
Medium Deployment (1,000-10,000 switches)
export CVT_DEPLOYMENT_MAX_WORKERS=40 export CVT_MAX_WORKERS=100 # No need to change batching threshold
Large Deployment (10,000-30,000 switches)
export CVT_DEPLOYMENT_MAX_WORKERS=60 export CVT_MAX_WORKERS=150 export CVT_BATCHING_THRESHOLD=5000
Hyperscale Deployment (30,000+ switches)
export CVT_DEPLOYMENT_MAX_WORKERS=80 export CVT_MAX_WORKERS=200 export CVT_BATCHING_THRESHOLD=3000
How to Know If Your Settings Are Good
Good Signs:
-
✅ Server load increases during validation (better CPU utilization)
-
✅ Validation completes faster than before
-
✅ No significant increase in timeout errors
-
✅ Network bandwidth stays below 80%
Warning Signs:
-
⚠️ Many "Timeout while trying to start validation" errors
-
⚠️ "Connection refused" errors from switches
-
⚠️ Server load stays low (underutilization)
-
⚠️ Network bandwidth hits 90%+
Adjustment Strategy:
-
Too many timeouts: Reduce
CVT_MAX_WORKERSby 25-50 -
Server underutilized: Increase
CVT_MAX_WORKERSby 25-50 -
Deployment too slow: Increase
CVT_DEPLOYMENT_MAX_WORKERS(if network allows) -
Memory issues: Reduce
CVT_BATCHING_THRESHOLD
Expected Performance
Your Customer's 32,149 Switches:
Current Configuration:
export CVT_DEPLOYMENT_MAX_WORKERS=60 # Optimal for 10G network with 200MB image export CVT_MAX_WORKERS=150 # Good for 448-core server export CVT_BATCHING_THRESHOLD=10000 # Will use batching (32K > 10K)
Expected Results:
-
Current: 7m 13s
-
Optimized: 2-3 minutes (60-75% improvement)
-
Server Load: Should increase from 5-7 to 15-20
Monitoring and Troubleshooting
Critical Monitoring Points:
Watch for Device Overload:
-
Timeout Errors: Increase in "Timeout while trying to start validation" messages
-
Connection Refused: "Connection error" messages from devices
-
Response Times: Slower device response times
-
Failure Rate: Higher percentage of failed device connections
Server Utilization Monitoring:
-
Load Average: Should increase from baseline to target ranges
-
CPU Usage: Better utilization of available cores
-
Memory: Watch for any memory pressure during large deployments
-
Network: Monitor bandwidth utilization on management interface
Success Criteria by Worker Count
|
Worker Count |
Expected Load Average |
Expected Time Improvement |
Risk Level |
|---|---|---|---|
|
50-75 |
8-12 |
30-50% faster |
Low |
|
100-150 |
12-18 |
50-70% faster |
Medium |
|
200+ |
18-25 |
70%+ faster |
High |
Red Flags (Scale Back If You See)
-
⚠️ Significant increase in timeout errors
-
⚠️ "Connection refused" errors from devices
-
⚠️ Device response times getting slower
-
⚠️ Higher failure rates than baseline
Green Lights (Scale Up If You See)
-
✅ Stable or improved device response times
-
✅ No increase in connection errors
-
✅ Server load well below target range
-
✅ Good success rate maintained
System Tuning for Large Deployments
File Descriptor Limits
# Increase file descriptor limits for high concurrency ulimit -n 65536 # Make permanent by adding to /etc/security/limits.conf: echo "* soft nofile 65536" >> /etc/security/limits.conf echo "* hard nofile 65536" >> /etc/security/limits.conf
Network Optimization
# Optimize network settings for high concurrency echo 65536 > /proc/sys/net/core/somaxconn echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
Memory Settings
# For very large deployments (20,000+ devices) echo 1 > /proc/sys/vm/overcommit_memory echo 80 > /proc/sys/vm/overcommit_ratio
Agent Deployment Time Estimates
Deployment Duration by Cluster Size:
|
Devices |
Workers |
Network |
Estimated Time |
|---|---|---|---|
|
1,000 |
20 |
1G |
1-2 hours |
|
4,000 |
60 |
10G |
2-4 hours |
|
10,000 |
80 |
25G |
4-8 hours |
|
25,000+ |
100-160 |
40G+ |
10-20 hours |
Note: Agent deployment includes image fetch (~200MB per device), local save, image load, and container creation. With the reduced image size and parallel local operations, deployment is significantly faster than with larger images.
Bottom Line
Most customers only need to set 2 variables:
-
CVT_DEPLOYMENT_MAX_WORKERS(based on network bandwidth) -
CVT_MAX_WORKERS(based on server CPU and device tolerance)
The third variable (CVT_BATCHING_THRESHOLD) usually works fine at default!
Last updated: