Overview
This sizing guide provides hardware and network recommendations for Cable Validation deployments based on cluster size. Recommendations are based on performance analysis of enterprise deployments and optimal resource utilization patterns.
Important Note: Cable Validation Tool (CVT) handles both switches and hosts in modern deployments. The legacy naming in the codebase (e.g., "SwitchAgentMgr", "switch_ip") reflects historical origins when CVT only handled switches, but now applies to all managed devices (switches, hosts, HCAs, etc.).
Sizing Methodology
Key Factors:
-
Device Overload Threshold: Individual devices (switches/hosts) can handle ~5-10 concurrent REST API calls
-
Network Bandwidth: 10G MGMT interface provides ~800-900 MB/s practical throughput
-
CPU Utilization: Target 15-25 load average for optimal performance
-
Memory Requirements: ~50-100 MB per 1000 devices for topology and batch processing
-
Batch Processing: Optimal batch sizes scale with worker count
-
Mixed Workloads: Switches and hosts may have different response characteristics
Quick Configuration Reference
🎯 Simple 3-Variable Configuration
CVT performance can be optimized with just three environment variables:
# 1. Agent Deployment (~200MB image + local container operations) export CVT_DEPLOYMENT_MAX_WORKERS=60 # 2. Everything Else (validation, connectivity, DNS, etc.) export CVT_MAX_WORKERS=150 # 3. Batching Control (when to split large deployments) export CVT_BATCHING_THRESHOLD=10000
Note: Agent deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. Higher worker counts are possible because the process isn't purely bandwidth-limited.
See the Simple Tuning Guide for detailed configuration guidance.
Cluster Sizing Table
|
Cluster Size |
Recommended CPUs |
Recommended Memory |
Recommended MAX_WORKERS |
DEPLOYMENT_MAX_WORKERS |
MGMT Bandwidth |
Expected Time |
Notes |
|---|---|---|---|---|---|---|---|
|
Small Clusters (1-1,000 devices) |
|
|
|
|
|
|
|
|
100 devices |
4-8 cores |
4-8 GB |
30-50 |
20 |
1G |
30-60 seconds |
Single server, basic setup |
|
500 devices |
8-16 cores |
8-16 GB |
50-75 |
20-40 |
1G |
1-2 minutes |
Development/test environment |
|
1,000 devices |
16-32 cores |
16-32 GB |
50-100 |
20-40 |
1G |
2-3 minutes |
Small production deployment |
|
Medium Clusters (1,000-10,000 devices) |
|
|
|
|
|
|
|
|
2,500 devices |
32-64 cores |
32-64 GB |
75-100 |
40 |
10G |
2-3 minutes |
Regional deployment |
|
5,000 devices |
64-128 cores |
64-128 GB |
100-150 |
40-60 |
10G |
3-5 minutes |
Large regional deployment |
|
7,500 devices |
96-192 cores |
96-192 GB |
125-175 |
60 |
10G |
4-6 minutes |
Multi-site deployment |
|
10,000 devices |
128-256 cores |
128-256 GB |
150-200 |
60 |
10G |
5-8 minutes |
Enterprise deployment |
|
Large Clusters (10,000-25,000 devices) |
|
|
|
|
|
|
|
|
15,000 devices |
192-384 cores |
192-384 GB |
175-225 |
60-80 |
25G+ |
6-10 minutes |
Large enterprise |
|
20,000 devices |
256-512 cores |
256-512 GB |
200-250 |
80 |
25G+ |
8-12 minutes |
Hyperscale deployment |
|
25,000 devices |
320-640 cores |
320-640 GB |
225-275 |
80-100 |
40G+ |
10-15 minutes |
Hyperscale deployment |
|
Hyperscale Clusters (25,000+ devices) |
|
|
|
|
|
|
|
|
30,000 devices |
384-768 cores |
384-768 GB |
250-300 |
100-120 |
40G+ |
12-18 minutes |
Hyperscale datacenter |
|
35,000 devices |
448-896 cores |
448-896 GB |
275-325 |
120-140 |
40G+ |
15-20 minutes |
Hyperscale datacenter |
|
40,000 devices |
512-1024 cores |
512 GB-1TB |
300-350 |
140-160 |
40G+ |
18-25 minutes |
Massive hyperscale |
Detailed Recommendations by Scale
Small Clusters (1-1,000 devices)
Characteristics:
-
Single server deployment
-
Basic network infrastructure
-
Development/test environments
-
Device Mix: Primarily switches, some hosts/HCAs
Sizing Logic:
-
CPU: 1 core per 25-50 devices
-
Memory: 10-20 MB per device for topology data
-
Workers: Conservative scaling to avoid device overload
-
Network: 1G sufficient for small clusters
Medium Clusters (1,000-10,000 devices)
Characteristics:
-
Production deployments
-
10G management networks
-
Regional or multi-site deployments
-
Device Mix: Mixed switches and hosts, HCAs in compute clusters
Sizing Logic:
-
CPU: 1 core per 40-80 devices (better efficiency at scale)
-
Memory: 8-15 MB per device (shared topology data)
-
Workers: Balanced scaling considering device capacity
-
Network: 10G required for concurrent processing
-
Host Considerations: Hosts may respond differently than switches
Large Clusters (10,000-25,000 devices)
Characteristics:
-
Enterprise-scale deployments
-
High-performance requirements
-
25G+ management networks
-
Device Mix: Large numbers of compute hosts + infrastructure switches
Sizing Logic:
-
CPU: 1 core per 60-100 devices (enterprise efficiency)
-
Memory: 5-12 MB per device (optimized topology handling)
-
Workers: Approaching device overload thresholds
-
Network: 25G+ to handle concurrent load
-
Mixed Response: Account for different device response characteristics
Hyperscale Clusters (25,000+ devices)
Characteristics:
-
Massive datacenter deployments
-
Enterprise-grade hardware (like customer's 448-core server)
-
40G+ management networks
-
Device Mix: Thousands of compute hosts + infrastructure switches
Sizing Logic:
-
CPU: 1 core per 80-120 devices (maximum efficiency)
-
Memory: 3-10 MB per device (highly optimized)
-
Workers: At or near device overload limits
-
Network: 40G+ essential for performance
-
Device Diversity: Must handle switches, hosts, HCAs, storage devices
Network Bandwidth Analysis
Bandwidth Requirements by Cluster Size
|
Cluster Size |
Concurrent Workers |
Peak Bandwidth Required |
Network Recommendation |
Device Types |
|---|---|---|---|---|
|
1,000 |
50 workers |
~50-100 MB/s |
1G (sufficient) |
Switches + some hosts |
|
5,000 |
125 workers |
~200-400 MB/s |
1G (tight) / 10G (recommended) |
Mixed switches/hosts |
|
10,000 |
175 workers |
~400-700 MB/s |
10G (required) |
Balanced switches/hosts |
|
25,000 |
250 workers |
~600-900 MB/s |
10G (tight) / 25G (recommended) |
Majority hosts + switches |
|
40,000 |
350 workers |
~800-1200 MB/s |
25G (minimum) / 40G (optimal) |
Large compute + storage |
Bandwidth Calculation Logic:
-
Per Worker: ~2-4 MB/s during active validation startup
-
Peak Usage: During initial topology push to all devices
-
Sustained Usage: Much lower during normal validation operation
-
Burst Patterns: High bandwidth during startup, lower during monitoring
Key Insights:
-
Bandwidth is BURSTY: High during startup, low during validation
-
10G Limit: Starts getting tight around 10,000 devices
-
25G Sweet Spot: Good performance for 25,000-40,000 devices
-
40G Future-Proof: Optimal for large hyperscale deployments
Memory Usage Patterns
Memory Breakdown by Component
|
Component |
Memory per 1000 Devices |
Notes |
|---|---|---|
|
Topology Data |
20-40 MB |
Device definitions, links, mixed switches/hosts |
|
Batch Processing |
15-30 MB |
Temporary data during processing |
|
Connection Pools |
5-10 MB |
HTTP session management |
|
Results Storage |
10-20 MB |
Validation results and reports |
|
Device Metadata |
5-15 MB |
Host-specific data, HCA mappings |
|
Total |
55-115 MB |
Per 1000 devices (switches + hosts) |
Performance Optimization Guidelines
Quick Performance Tuning
For detailed tuning instructions and troubleshooting, see the Simple Tuning Guide which provides:
-
Easy-to-follow configuration decisions
-
Monitoring guidance and success criteria
-
Troubleshooting common issues
-
System tuning for large deployments
CPU Optimization
-
Target Load: 15-25 average load during processing
-
NUMA Awareness: Use dual-socket servers for 20,000+ devices
-
Worker Scaling: Adjust
CVT_MAX_WORKERSbased on CPU cores and observed load -
Device Mix: Account for different CPU requirements of switches vs hosts
-
Monitoring: If load stays low, increase workers; if timeout errors increase, reduce workers
Memory Optimization
-
Batching: Use
CVT_BATCHING_THRESHOLDto control memory usage on large deployments -
Connection Pooling: Automatically scales with worker count
-
Garbage Collection: Monitor for large deployments (20,000+ devices)
-
Device Metadata: Additional memory for host-specific data (HCA mappings, etc.)
Network Optimization
-
Bandwidth Planning: Set
CVT_DEPLOYMENT_MAX_WORKERSbased on management network capacity -
Connection Reuse: Essential for large deployments (handled automatically)
-
Bandwidth Monitoring: Watch for saturation at scale
-
Device Response Variance: Hosts may respond differently than switches
-
Burst Patterns: High bandwidth during startup, lower during validation operation
Device Type Considerations
Switches vs Hosts Performance Characteristics
|
Device Type |
Typical Response Time |
Concurrent Call Limit |
Special Considerations |
|---|---|---|---|
|
Network Switches |
1-3 seconds |
5-10 concurrent |
REST API on switch OS |
|
Compute Hosts |
2-5 seconds |
3-8 concurrent |
Agent on host OS, may be busier |
|
Storage Devices |
1-4 seconds |
5-12 concurrent |
Usually dedicated management |
|
HCA Devices |
1-2 seconds |
8-15 concurrent |
Lightweight agent |
Example Configurations
Small Deployment (<1,000 devices)
# Network: 1G management interface # Server: 8-32 cores export CVT_DEPLOYMENT_MAX_WORKERS=20 export CVT_MAX_WORKERS=50 # CVT_BATCHING_THRESHOLD=10000 (default, no need to change)
Medium Deployment (1,000-10,000 devices)
# Network: 10G management interface # Server: 64-128 cores export CVT_DEPLOYMENT_MAX_WORKERS=40 export CVT_MAX_WORKERS=100 # CVT_BATCHING_THRESHOLD=10000 (default, no need to change)
Large Deployment (10,000-30,000 devices)
# Network: 25G+ management interface # Server: 192-384 cores export CVT_DEPLOYMENT_MAX_WORKERS=60 export CVT_MAX_WORKERS=150 export CVT_BATCHING_THRESHOLD=5000
Hyperscale Deployment (30,000+ devices)
# Network: 40G+ management interface # Server: 448+ cores export CVT_DEPLOYMENT_MAX_WORKERS=80 export CVT_MAX_WORKERS=200 export CVT_BATCHING_THRESHOLD=3000
Configuration Guidelines
CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment):
-
Based on network bandwidth and local container operations (~200MB per device image)
-
Deployment includes: image fetch, save to disk, load image, container creation
-
1G network: 20 workers (~4GB concurrent + local ops)
-
10G network: 60 workers (~12GB concurrent + local ops)
-
25G+ network: 100-160 workers (higher bandwidth + parallel local operations)
CVT_MAX_WORKERS (Validation Operations):
-
Based on server CPU cores and device capacity
-
8-32 cores: 30-50 workers
-
32-128 cores: 75-150 workers
-
128+ cores: 150-300 workers
-
Watch for device timeout errors and reduce if needed
CVT_BATCHING_THRESHOLD (Batch Processing):
-
<5,000 devices: 10000 (default, single batch)
-
5,000-20,000 devices: 5000 (light batching)
-
20,000+ devices: 3000 (aggressive batching)
Scaling Considerations
Vertical Scaling Limits
-
Single Server: Effective up to ~40,000 devices (switches + hosts)
-
CPU Bound: Beyond 40,000 devices, consider distributed processing
-
Memory Bound: Rarely an issue with modern servers (hosts require slightly more memory)
-
Network Bound: Primary constraint for large deployments
-
Device Mix: Higher host percentage may require more resources
Horizontal Scaling Options
-
Multiple Collectors: Split clusters across multiple servers
-
Geographic Distribution: Regional collectors for global deployments
-
Load Balancing: Distribute devices across multiple validation instances
Device Mix Impact on Sizing
Typical Cluster Compositions
|
Cluster Type |
Switches % |
Hosts % |
Notes |
Sizing Impact |
|---|---|---|---|---|
|
Infrastructure-Heavy |
80% |
20% |
Network-focused deployment |
Lower memory, higher network load |
|
Compute-Heavy |
30% |
70% |
HPC/AI clusters |
Higher memory, variable response times |
|
Balanced |
50% |
50% |
Mixed enterprise deployment |
Standard sizing applies |
|
Storage-Heavy |
40% |
60% |
Storage clusters with many storage hosts |
Higher memory, faster responses |
Sizing Adjustments by Device Mix
Infrastructure-Heavy Clusters (80% switches):
-
CPU: Use lower end of range
-
Memory: Use lower end of range
-
Workers: Can be more aggressive
-
Network: Higher bandwidth needs per device
Compute-Heavy Clusters (70% hosts):
-
CPU: Use higher end of range
-
Memory: Use higher end of range (HCA mappings, host metadata)
-
Workers: More conservative (hosts may be busier)
-
Network: Variable load patterns
Balanced Clusters (50/50 mix):
-
CPU: Use middle of range
-
Memory: Use middle of range
-
Workers: Standard recommendations apply
-
Network: Standard bandwidth planning
Monitoring and Alerting
Key Metrics to Monitor
-
Server Load: Target 15-25 during processing
-
Memory Usage: Should stay well below allocated
-
Network Utilization: Watch for bandwidth saturation (especially during agent deployment)
-
Device Response Times: Primary performance indicator
-
Error Rates: Timeout and connection errors
-
Validation Completion Time: Compare against expected times in sizing table
Success Indicators
Good Signs (can increase CVT_MAX_WORKERS):
-
✅ Server load increases during validation (better CPU utilization)
-
✅ Validation completes faster than baseline
-
✅ No significant increase in timeout errors
-
✅ Network bandwidth stays below 80%
Warning Signs (reduce CVT_MAX_WORKERS):
-
⚠️ Many "Timeout while trying to start validation" errors
-
⚠️ "Connection refused" errors from devices
-
⚠️ Server load stays low (underutilization)
-
⚠️ Network bandwidth hits 90%+
Scaling Triggers
-
Scale Up Workers: Load < 10, low error rates, fast completion
-
Scale Down Workers: High error rates, device timeouts, network saturation
-
Increase Deployment Workers: Network utilization < 50% during agent deployment
-
Decrease Deployment Workers: Network bandwidth > 80% during agent deployment
-
Adjust Batching: Memory usage > 80% (reduce
CVT_BATCHING_THRESHOLD)
Performance Expectations by Cluster Size
|
Cluster Size |
Expected Validation Time |
Expected Load Average |
Target Worker Count |
|---|---|---|---|
|
1,000 devices |
1-3 minutes |
8-12 |
50-100 |
|
5,000 devices |
3-5 minutes |
12-18 |
100-150 |
|
10,000 devices |
5-8 minutes |
15-22 |
150-200 |
|
25,000 devices |
10-15 minutes |
18-25 |
225-275 |
|
40,000 devices |
18-25 minutes |
20-28 |
300-350 |
Document Relationship
This Cluster Sizing Guide provides comprehensive hardware and infrastructure planning:
-
CPU, memory, and network capacity planning
-
Expected performance at different scales
-
Device type considerations (switches, hosts, HCAs)
-
Detailed sizing methodology
For day-to-day performance tuning, refer to the Simple Tuning Guide:
-
Simple 3-variable configuration
-
Quick-start configurations by deployment size
-
Troubleshooting and monitoring guidance
-
Practical tuning adjustments
Last updated: