Skip to content

feat(worker): audit-only orphan-customer-DB / redis-namespace sweep (flag-gated OFF)#102

Merged
mastermanas805 merged 2 commits into
masterfrom
feat/orphan-db-sweep-audit-only
Jun 8, 2026
Merged

feat(worker): audit-only orphan-customer-DB / redis-namespace sweep (flag-gated OFF)#102
mastermanas805 merged 2 commits into
masterfrom
feat/orphan-db-sweep-audit-only

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What

A new River periodic job orphan_db_sweep (internal/jobs/orphan_db_sweep.go) that addresses the ~25 orphaned customer DB / redis-namespace drain-backlog — in DETECTION / DRY-RUN mode only. It lists instant-customer-* namespaces, flags the ones whose token has no non-terminal (pending/active/paused/suspended) resources row and is past the provisioning grace window, then LOGS each candidate (token masked via logsafe.Token) + emits candidate metrics. It drops nothing in audit-only mode.

truehomie-2026-06-03 safety — NO destructive drop runs by default

On 2026-06-03 an active Pro customer's DB + role were dropped by an unidentified, unaudited path. Accordingly:

  • There is no manual / raw DROP anywhere in this job.
  • The actual deprovision sits behind a second, destructive flag and, when (and only when) enabled, routes through the existing audited provisioner DeprovisionResource chokepoint — the same path the TTL reaper (expire.go) uses.
  • For this PR the destructive path is wired but intentionally unreachable-by-default: we ship audit-only, review the dry-run candidate list, and only later (a deliberate operator action) consider lighting the destructive flag.

How it differs from orphan_sweep_reconciler.go PASS 4

PASS 4 already lists instant-customer-* namespaces and deletes the orphans immediately as a side-effect of a large reconciler. This job is a separate, conservative, observability-first surface: hourly, audit-only by default, flag-gated, with per-kind candidate metrics so we can measure the backlog and review the dry-run list before any reclamation.

Flag gates — both default OFF / fail-closed

Env var Default Effect
ORPHAN_DB_SWEEP_ENABLED false Master flag. Off → Work is a DEBUG no-op: no namespace List, no DB read, no metric.
ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLED false Destructive flag. Meaningless unless the master is also on and a provisioner is wired. When all true, a confirmed orphan routes through the audited DeprovisionResource chokepoint only.

destructiveArmed() = Enabled && DestructiveEnabled && provisioner != nil (defense-in-depth).

Fail-safe / fail-open

  • Namespace-List error or live-token DB-read error → zero candidates (never an empty-set a destructive caller could read as "drop everything").
  • Candidate whose token reappears live at the destructive re-confirm → skipped.
  • A generic (unmapped-kind) orphan has no proven backing type → skipped (no guessed DROP). When in doubt, skip + log.

Metrics (rule 25 — infra follow-up needed)

This repo does not own infra/, so the alert + dashboard tile + METRICS-CATALOG.md row are an explicit follow-up. Exact metric names:

  • instant_orphan_db_sweep_candidates_total{kind} — counter; kind{customer_namespace, redis_namespace}. One increment per detected orphan candidate.
  • instant_orphan_db_sweep_candidates_current{kind} — gauge; the orphan-candidate count observed by the most recent tick (falls to 0 when the backlog drains).

Both are lazy *Vec; both kind label values are primed in metrics_test.go so the tiles render from process start.

Suggested alerts for the infra PR:

  • sum(instant_orphan_db_sweep_candidates_total) by (kind) > 0 for 1h → P2 (standing orphan backlog = real cost: live pod, no owner; review the dry-run log before enabling destructive reclamation).
  • max(instant_orphan_db_sweep_candidates_current) by (kind) > 25P2 (the documented drain-backlog from the task brief; above it, accumulation is outpacing reclamation).
  • Suggested dashboard tile: instant_orphan_db_sweep_candidates_current per kind on infra/newrelic/dashboards/instanode-reliability.json.

Wiring

  • Registered in StartWorkers reusing the same seams as the orphan-sweep reconciler: K8sNamespaceLister (namespace List + age check) and the audited ResourceDeprovisioner (= provClient, only for the flag-gated destructive arm).
  • Periodic entry in buildPeriodicJobs — hourly, reconcileInsertOpts (carries UniqueOpts so replicas:2 doesn't double-run; passes TestPeriodicJobs_AllCarryUniqueOpts), RunOnStart=false.

Coverage block (rule 17)

Symptom:        orphaned customer DB / instant-customer-* redis namespace with no
                active/whitelisted resource row (the ~25 drain-backlog)
Enumeration:    rg -F 'instant-customer-' ; rg 'DeprovisionResource|ListCustomerNamespaces'
                (reuses customerNamespacePrefix + the audited reaper chokepoint —
                no new namespace scheme, no new DROP path)
Sites found:    1 new detection surface (orphan_db_sweep.go); 0 new drop paths
Sites touched:  1  (audit-only; destructive arm routes through the EXISTING
                   audited provisioner chokepoint, reused not re-implemented)
Coverage test:  TestOrphanDBSweep_Detection (orphan/live/pending/within-grace),
                TestOrphanDBSweep_MasterFlagOff (off → no-op, lister never called),
                TestOrphanDBSweep_FailOpenLiveTokenError + ...NamespaceListError
                (DB/k8s blip → 0 candidates, deprovisioner never called),
                TestOrphanDBSweep_DestructiveArmRoutesThroughAuditedChokepoint
                (BOTH flags on → audited DeprovisionResource w/ right token+type),
                TestOrphanDBSweep_DestructiveSkipsWhenTokenReappearsLive (fail-safe).
                New file at 100% statement coverage.
Live verified:  N/A — flag-gated OFF, ships inert (DEBUG no-op every tick until
                an operator lights ORPHAN_DB_SWEEP_ENABLED). No prod behavior
                change on merge; the live-URL gate is the operator's first
                flag-on review of the dry-run candidate log.

Gate

make gate green (the EXACT CI deploy.yml test step — build + vet + go test ./... -short -count=1). golangci-lint run on touched packages: 0 issues.

Explicit statement

No destructive drop runs by default. Both flags default OFF / fail-closed; the destructive arm is unreachable on merge and, even when armed, never issues a raw DROP — it only routes through the existing audited provisioner deprovision chokepoint.

🤖 Generated with Claude Code

mastermanas805 and others added 2 commits June 8, 2026 23:46
…flag-gated OFF)

New River periodic job orphan_db_sweep (hourly, reconcile queue, UniqueOpts)
that addresses the ~25 orphaned customer DB / redis namespace drain-backlog —
in DETECTION / DRY-RUN mode only. It lists instant-customer-* namespaces, flags
the ones whose token has NO non-terminal (pending/active/paused/suspended)
resources row and is past the provisioning grace window, then LOGS each
candidate (token masked via logsafe.Token) and emits the candidate metrics.
It DROPS NOTHING in audit-only mode.

truehomie-2026-06-03 safety: there is NO manual / raw DROP anywhere in this
job. The destructive teardown sits behind a SECOND flag and, when (and only
when) enabled, routes through the AUDITED provisioner DeprovisionResource
chokepoint — the same path the TTL reaper (expire.go) uses. For this PR that
path is intentionally unreachable-by-default.

Two flag gates, BOTH default OFF / fail-closed:
  ORPHAN_DB_SWEEP_ENABLED            — master flag; off → Work is a DEBUG no-op
                                       (no namespace List, no DB read, no metric).
  ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLED — destructive flag; meaningless unless the
                                       master is also on AND a provisioner is
                                       wired. Routes through the audited
                                       chokepoint only.

Fail-safe / fail-open: a namespace-List error or a live-token DB-read error
degrades to ZERO candidates (never an empty-set that a destructive caller could
read as "drop everything"); a candidate whose token reappears live at the
destructive re-confirm is SKIPPED; a generic (unmapped-kind) orphan is SKIPPED
(no proven backing type → no guessed DROP). When in doubt, skip + log.

Metrics (lazy *Vec, both labels primed in metrics_test):
  instant_orphan_db_sweep_candidates_total{kind}   — counter, kind in
    {customer_namespace, redis_namespace}
  instant_orphan_db_sweep_candidates_current{kind} — gauge, current backlog
Alert + dashboard + catalog live in the infra repo (not owned here) — see PR
body for the exact metric names + suggested alerts (rule 25 follow-up).

Tests: candidate detection (orphan vs live vs pending vs within-grace), kind
classification, both flag gates (off → no-op), masking, fail-open paths, and
that the destructive deprovisioner is NEVER called in audit-only mode. New file
at 100% statement coverage. make gate green (build + vet + go test -short).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ner helper

diff-cover flagged the `if provClient != nil` branch in the StartWorkers wiring
(integration-only, not unit-reachable). Extract the typed-nil-safe conversion
into orphanDBSweepDeprovisionerFor (mirrors NewExpireAnonymousWorker's handling)
and unit-test both arms directly, so the wiring call site is a single
non-branching expression covered by TestStartWorkers_FullBoot and the branch
logic is covered by the new unit test. New code back to 100% patch coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit a0f4014 into master Jun 8, 2026
12 checks passed
@mastermanas805 mastermanas805 deleted the feat/orphan-db-sweep-audit-only branch June 8, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant