Skip to content

Implement provisioner auto-scaling and capacity guidelines #8

@blink-so

Description

@blink-so

Problem

During the Sept 30 workshop, provisioner capacity (6 replicas for default org) became a bottleneck when ~10 users simultaneously deployed workspaces. Provisioners are critical for workspace create/delete/update operations, and insufficient capacity causes timeouts and poor user experience.

Context

Current State:

  • Default org: 6 replicas @ 500m CPU / 512 MB memory each
  • Experimental org: 2 replicas
  • Demo org: 2 replicas
  • Manual scaling required before workshops

Current Limitations:

  • Terraform runs are single-threaded (1 provisioner = 1 concurrent operation)
  • Each workspace create/delete/update occupies 1 provisioner
  • No auto-scaling based on queue depth
  • No clear guidelines on when to scale

Requirements

Capacity Planning Guidelines

  • Document scaling recommendations:
    • <10 concurrent users: 6 replicas (current default)
    • 10-15 concurrent users: 8 replicas
    • 15-20 concurrent users: 10 replicas
    • 20-30 concurrent users: 12-15 replicas
  • Add to pre-workshop checklist (Create pre-workshop validation checklist and runbook #4)
  • Add to workshop planning guide

Manual Scaling Procedures

  • Document commands for scaling each org's provisioners:
    kubectl scale deployment coder-provisioner-default -n coder --replicas=10
    kubectl scale deployment coder-provisioner-experimental -n coder --replicas=4
    kubectl scale deployment coder-provisioner-demo -n coder --replicas=4
  • Add to incident runbook (Create pre-workshop validation checklist and runbook #4) ✅ (already added)
  • Create pre-workshop scaling checklist item

Auto-Scaling Implementation (Long-term)

  • Investigate Horizontal Pod Autoscaler (HPA) for provisioners
  • Define custom metrics for provisioner queue depth
  • Implement HPA based on:
    • Provisioner queue depth
    • CPU/memory utilization
    • Active Terraform jobs
  • Test auto-scaling behavior under load
  • Document auto-scaling configuration in Terraform

Resource Limit Optimization

  • Evaluate if 500m CPU / 512 MB is sufficient
  • Monitor for OOMKilled or CPU throttling events
  • Consider increasing to 1 CPU / 1 GB if needed
  • Document resource limit adjustment procedure

Monitoring & Alerting

  • Add metrics for:
    • Provisioner queue depth
    • Provisioner job duration (p50, p95, p99)
    • Provisioner failure rate
    • Number of active provisioner replicas
  • Alert when:
    • Queue depth > 5 jobs for >2 minutes
    • Provisioner failure rate > 5%
    • Average job duration > 5 minutes
  • Add to monitoring dashboard (Implement comprehensive resource monitoring and alerting #6)

Success Criteria

  • Clear scaling guidelines available for workshop planning
  • Manual scaling can be performed in <2 minutes
  • (Long-term) Auto-scaling triggers before users experience delays
  • Zero workspace timeouts due to provisioner capacity during workshops
  • Provisioner resource usage optimized (no OOMKills)

Implementation Notes

HPA Example:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coder-provisioner-default-hpa
  namespace: coder
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coder-provisioner-default
  minReplicas: 6
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Queue Depth Metric (requires custom implementation):

  • Coder API may expose provisioner job queue
  • Export as Prometheus metric
  • Use for HPA scaling decisions

Related

Sept 30 Workshop Postmortem
Incident Runbook - High Resource Contention
Incident Runbook - Provisioner Failures
#1 (Storage optimization)
#6 (Monitoring and alerting)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions