Problem
During the Sept 30 workshop, provisioner capacity (6 replicas for default org) became a bottleneck when ~10 users simultaneously deployed workspaces. Provisioners are critical for workspace create/delete/update operations, and insufficient capacity causes timeouts and poor user experience.
Context
Current State:
- Default org: 6 replicas @ 500m CPU / 512 MB memory each
- Experimental org: 2 replicas
- Demo org: 2 replicas
- Manual scaling required before workshops
Current Limitations:
- Terraform runs are single-threaded (1 provisioner = 1 concurrent operation)
- Each workspace create/delete/update occupies 1 provisioner
- No auto-scaling based on queue depth
- No clear guidelines on when to scale
Requirements
Capacity Planning Guidelines
Manual Scaling Procedures
Auto-Scaling Implementation (Long-term)
Resource Limit Optimization
Monitoring & Alerting
Success Criteria
- Clear scaling guidelines available for workshop planning
- Manual scaling can be performed in <2 minutes
- (Long-term) Auto-scaling triggers before users experience delays
- Zero workspace timeouts due to provisioner capacity during workshops
- Provisioner resource usage optimized (no OOMKills)
Implementation Notes
HPA Example:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: coder-provisioner-default-hpa
namespace: coder
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coder-provisioner-default
minReplicas: 6
maxReplicas: 15
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Queue Depth Metric (requires custom implementation):
- Coder API may expose provisioner job queue
- Export as Prometheus metric
- Use for HPA scaling decisions
Related
Sept 30 Workshop Postmortem
Incident Runbook - High Resource Contention
Incident Runbook - Provisioner Failures
#1 (Storage optimization)
#6 (Monitoring and alerting)
Problem
During the Sept 30 workshop, provisioner capacity (6 replicas for default org) became a bottleneck when ~10 users simultaneously deployed workspaces. Provisioners are critical for workspace create/delete/update operations, and insufficient capacity causes timeouts and poor user experience.
Context
Current State:
Current Limitations:
Requirements
Capacity Planning Guidelines
<10 concurrent users: 6 replicas (current default)10-15 concurrent users: 8 replicas15-20 concurrent users: 10 replicas20-30 concurrent users: 12-15 replicasManual Scaling Procedures
Auto-Scaling Implementation (Long-term)
Resource Limit Optimization
Monitoring & Alerting
Success Criteria
Implementation Notes
HPA Example:
Queue Depth Metric (requires custom implementation):
Related
Sept 30 Workshop Postmortem
Incident Runbook - High Resource Contention
Incident Runbook - Provisioner Failures
#1 (Storage optimization)
#6 (Monitoring and alerting)