UpgradeCRDs adds a hard-coded 30s sleep and re-applies all CRDs on every reconcile (including no-ops / drift-detection sync)

## Summary

When a Helm chart is deployed with `HelmChart.Options.UpgradeOptions.UpgradeCRDs: true`, the addon-controller
re-applies the chart's entire `crds/` folder and then does an unconditional **`time.Sleep(30 * time.Second)`**
on *every* upgrade — even when nothing has changed. Combined with `SyncModeContinuousWithDriftDetection`
(where `shouldUpgrade` always returns `true`), this means each `UpgradeCRDs: true` chart pays a flat 30s on
**every reconcile**, regardless of whether the chart version, values, or the CRDs themselves changed.

When several such charts are deployed by the same profile, their upgrades are serialized, so the penalty is
roughly **30s × (number of `UpgradeCRDs` charts)** added to each reconcile/rollout.

## Affected versions

Observed on **v1.1.1**; the same code is present on the latest release **v1.11.1** and current `main`
(`controllers/handlers_helm.go`). Not fixed.

## Details

### 1. The 30s sleep is unconditional and counts no-op patches

`upgradeCRDs` (in `controllers/handlers_helm.go`):

```go
// upgradeCRDsInFile
_, err = dr.Patch(ctx, crds[i].GetName(), types.ApplyPatchType, crdData, options)
if err != nil { return upgradedCRDs, err }
upgradedCRDs += 1            // counts EVERY patch, not actual changes

// upgradeCRDs
if upgradedCRDs > 0 {
    const waitForCRDs = 30
    time.Sleep(waitForCRDs * time.Second)   // blind fixed sleep
}
```

- The counter increments for **every** CRD server-side-applied, not for CRDs that actually changed. An SSA
  patch of an identical CRD is a server-side no-op but still increments `upgradedCRDs`.
- So whenever the chart ships any CRD under `crds/`, `upgradedCRDs > 0` is always true and the code sleeps a
  fixed 30 seconds — even when no CRD was modified.
- The sleep is a fixed duration rather than a wait on each CRD's `Established`/`NamesAccepted` condition,
  which normally settles in well under a second.

### 2. In drift-detection sync mode, this runs on every reconcile

`shouldUpgrade` short-circuits for `SyncModeContinuousWithDriftDetection`:

```go
func shouldUpgrade(...) bool {
    if requestedChart.HelmChartAction == HelmChartActionUninstall { return false }
    if clusterSummary...SyncMode != SyncModeContinuousWithDriftDetection {
        // ... compare values hash; compare installed vs requested chart version;
        //     return false when unchanged ...
    }
    // With drift detection mode, there is reconciliation due to configuration drift even
    // when version is same. So skip this check in SyncModeContinuousWithDriftDetection
    return true
}
```

The upgrade path is `handleChart → handleUpgrade → doUpgradeRelease → upgradeRelease`, and `upgradeRelease`
calls `upgradeCRDs` *before* the helm upgrade. Because `shouldUpgrade` returns `true` on every reconcile in
drift mode, `upgradeCRDs` (and its 30s sleep) fires on every reconcile for every `UpgradeCRDs: true` chart —
not only when the chart version/values changed.

(First install is unaffected: the install path `handleInstall → doInstallRelease → installRelease` uses
Helm's install action, which applies `crds/` natively and never calls `upgradeCRDs`. The cost only appears
on the 2nd and subsequent reconciles, which take the upgrade path.)

## Reproduction

1. Deploy a chart that ships CRDs in `crds/` (e.g. node-feature-discovery, kube-prometheus-stack,
   envoy-gateway) via a ClusterProfile/Profile with `syncMode: ContinuousWithDriftDetection` and
   `helmCharts[].options.upgradeOptions.upgradeCRDs: true`.
2. Let it install (rev1) — fast.
3. Trigger any reconcile of the ClusterSummary (e.g. change an unrelated chart's values, or just let drift
   reconciliation run). **No change to this chart is required.**
4. Observe: the chart is upgraded again, its `crds/` are re-applied via SSA (no-op), and the rollout stalls
   ~30s on this chart. With N such charts under one profile the stalls serialize to ~30s × N.

### Measured impact (downstream)

In a downstream product, three `UpgradeCRDs: true` charts sit in a serialized addon rollout chain. A
no-op second apply stretched the chain from ~35s to ~127–133s — i.e. roughly +30s × 3 ≈ +90s — entirely
attributable to these sleeps, with no CRD actually changing.

## Suggested fixes (any one helps; ideally all three)

1. **Replace the fixed `time.Sleep(30s)` with a readiness wait** — poll each applied CRD's
   `Established` (and `NamesAccepted`) condition with a timeout, instead of sleeping a flat 30s. This is
   typically sub-second.
2. **Only count CRDs that actually changed** — inspect the SSA patch response (e.g. compare
   `metadata.generation`/`resourceVersion` before vs after, or use a server-side dry-run/diff) and skip the
   wait entirely when nothing changed. A no-op re-apply should not cost 30s.
3. **Gate CRD re-apply on a real chart-version change** — even in drift mode, only re-apply `crds/` when the
   installed chart version differs from the requested one, rather than on every drift reconcile.

## Environment

- projectsveltos addon-controller v1.1.1 (also reproduced by inspection on v1.11.1 / `main`)
- Sync mode: `ContinuousWithDriftDetection`
- Charts deployed via k0rdent/kcm forwarding `ServiceHelmOptions.UpgradeOptions = {UpgradeCRDs: true}`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UpgradeCRDs adds a hard-coded 30s sleep and re-applies all CRDs on every reconcile (including no-ops / drift-detection sync) #1843

Summary

Affected versions

Details

1. The 30s sleep is unconditional and counts no-op patches

2. In drift-detection sync mode, this runs on every reconcile

Reproduction

Measured impact (downstream)

Suggested fixes (any one helps; ideally all three)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

UpgradeCRDs adds a hard-coded 30s sleep and re-applies all CRDs on every reconcile (including no-ops / drift-detection sync) #1843

Description

Summary

Affected versions

Details

1. The 30s sleep is unconditional and counts no-op patches

2. In drift-detection sync mode, this runs on every reconcile

Reproduction

Measured impact (downstream)

Suggested fixes (any one helps; ideally all three)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions