Skip to content

UpgradeCRDs adds a hard-coded 30s sleep and re-applies all CRDs on every reconcile (including no-ops / drift-detection sync) #1843

@vikramhh

Description

@vikramhh

Summary

When a Helm chart is deployed with HelmChart.Options.UpgradeOptions.UpgradeCRDs: true, the addon-controller
re-applies the chart's entire crds/ folder and then does an unconditional time.Sleep(30 * time.Second)
on every upgrade — even when nothing has changed. Combined with SyncModeContinuousWithDriftDetection
(where shouldUpgrade always returns true), this means each UpgradeCRDs: true chart pays a flat 30s on
every reconcile, regardless of whether the chart version, values, or the CRDs themselves changed.

When several such charts are deployed by the same profile, their upgrades are serialized, so the penalty is
roughly 30s × (number of UpgradeCRDs charts) added to each reconcile/rollout.

Affected versions

Observed on v1.1.1; the same code is present on the latest release v1.11.1 and current main
(controllers/handlers_helm.go). Not fixed.

Details

1. The 30s sleep is unconditional and counts no-op patches

upgradeCRDs (in controllers/handlers_helm.go):

// upgradeCRDsInFile
_, err = dr.Patch(ctx, crds[i].GetName(), types.ApplyPatchType, crdData, options)
if err != nil { return upgradedCRDs, err }
upgradedCRDs += 1            // counts EVERY patch, not actual changes

// upgradeCRDs
if upgradedCRDs > 0 {
    const waitForCRDs = 30
    time.Sleep(waitForCRDs * time.Second)   // blind fixed sleep
}
  • The counter increments for every CRD server-side-applied, not for CRDs that actually changed. An SSA
    patch of an identical CRD is a server-side no-op but still increments upgradedCRDs.
  • So whenever the chart ships any CRD under crds/, upgradedCRDs > 0 is always true and the code sleeps a
    fixed 30 seconds — even when no CRD was modified.
  • The sleep is a fixed duration rather than a wait on each CRD's Established/NamesAccepted condition,
    which normally settles in well under a second.

2. In drift-detection sync mode, this runs on every reconcile

shouldUpgrade short-circuits for SyncModeContinuousWithDriftDetection:

func shouldUpgrade(...) bool {
    if requestedChart.HelmChartAction == HelmChartActionUninstall { return false }
    if clusterSummary...SyncMode != SyncModeContinuousWithDriftDetection {
        // ... compare values hash; compare installed vs requested chart version;
        //     return false when unchanged ...
    }
    // With drift detection mode, there is reconciliation due to configuration drift even
    // when version is same. So skip this check in SyncModeContinuousWithDriftDetection
    return true
}

The upgrade path is handleChart → handleUpgrade → doUpgradeRelease → upgradeRelease, and upgradeRelease
calls upgradeCRDs before the helm upgrade. Because shouldUpgrade returns true on every reconcile in
drift mode, upgradeCRDs (and its 30s sleep) fires on every reconcile for every UpgradeCRDs: true chart —
not only when the chart version/values changed.

(First install is unaffected: the install path handleInstall → doInstallRelease → installRelease uses
Helm's install action, which applies crds/ natively and never calls upgradeCRDs. The cost only appears
on the 2nd and subsequent reconciles, which take the upgrade path.)

Reproduction

  1. Deploy a chart that ships CRDs in crds/ (e.g. node-feature-discovery, kube-prometheus-stack,
    envoy-gateway) via a ClusterProfile/Profile with syncMode: ContinuousWithDriftDetection and
    helmCharts[].options.upgradeOptions.upgradeCRDs: true.
  2. Let it install (rev1) — fast.
  3. Trigger any reconcile of the ClusterSummary (e.g. change an unrelated chart's values, or just let drift
    reconciliation run). No change to this chart is required.
  4. Observe: the chart is upgraded again, its crds/ are re-applied via SSA (no-op), and the rollout stalls
    ~30s on this chart. With N such charts under one profile the stalls serialize to ~30s × N.

Measured impact (downstream)

In a downstream product, three UpgradeCRDs: true charts sit in a serialized addon rollout chain. A
no-op second apply stretched the chain from ~35s to ~127–133s — i.e. roughly +30s × 3 ≈ +90s — entirely
attributable to these sleeps, with no CRD actually changing.

Suggested fixes (any one helps; ideally all three)

  1. Replace the fixed time.Sleep(30s) with a readiness wait — poll each applied CRD's
    Established (and NamesAccepted) condition with a timeout, instead of sleeping a flat 30s. This is
    typically sub-second.
  2. Only count CRDs that actually changed — inspect the SSA patch response (e.g. compare
    metadata.generation/resourceVersion before vs after, or use a server-side dry-run/diff) and skip the
    wait entirely when nothing changed. A no-op re-apply should not cost 30s.
  3. Gate CRD re-apply on a real chart-version change — even in drift mode, only re-apply crds/ when the
    installed chart version differs from the requested one, rather than on every drift reconcile.

Environment

  • projectsveltos addon-controller v1.1.1 (also reproduced by inspection on v1.11.1 / main)
  • Sync mode: ContinuousWithDriftDetection
  • Charts deployed via k0rdent/kcm forwarding ServiceHelmOptions.UpgradeOptions = {UpgradeCRDs: true}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions