Skip to content

fix: auto-reset analytics credentials on auth failure#18

Merged
passcod merged 2 commits intomainfrom
fix-auth-credential-reset
Mar 20, 2026
Merged

fix: auto-reset analytics credentials on auth failure#18
passcod merged 2 commits intomainfrom
fix-auth-credential-reset

Conversation

@passcod
Copy link
Member

@passcod passcod commented Mar 19, 2026

Problem

When a restore's setup-auth init container fails to set the analytics user password (as happened with the \getenv-based psql variable binding bug fixed in #17), the operator gets stuck forever in a tight retry loop:

WARN replica reconciliation error: Database connection error: db error
FATAL: password authentication failed for user "analytics"
DETAIL: User "analytics" has no password assigned.

The restore sits in Switching phase indefinitely, the replica stays Restoring, and the database is completely inaccessible from the outside. This has been observed on tamanu-replica-palau-prod for over a week.

Fix

When the replica reconciler encounters an authentication failure (SQLSTATE 28P01/28000) connecting to any restore, it now automatically recovers via a credential-reset sequence:

  1. Detect the auth failure (SQLSTATE codes, with message-text fallback)
  2. Scale the restore deployment to 0 so the PVC is free
  3. Run a credential-reset Job that mounts the PVC directly and uses postgres --single (no TCP, no auth layer) to ALTER ROLE ... WITH PASSWORD '...', reading the password from the existing replica-creds secret
  4. Scale the deployment back to 1 once the job succeeds
  5. Resume normal reconciliation — the next connection attempt will succeed

The fix is fully idempotent and handles job failure (scales back up, retries next reconcile).

Scope

  • restore/builders.rs: build_credential_reset_job, credential_reset_job_name
  • restore.rs: pub(crate) re-exports
  • replica.rs: is_auth_error, ensure_credential_reset helpers; wired into reconcile_schema_migration for both the source (active) and target (switching) restore connection attempts

This guards against the specific prior bug (#17) and any future regression in the credential-setting pathway.

🦸 Review Hero

  • Run Review Hero
  • Auto-fix review suggestions
  • Auto-fix CI failures

When the replica reconciler cannot connect to a restore (Active or
Switching) due to an authentication failure, it now automatically
recovers rather than retrying indefinitely.

The recovery sequence:
1. Detect the auth failure via SQLSTATE 28P01/28000 (with a message-
   text fallback for cases where the db error code isn't surfaced).
2. Scale the restore deployment to 0 so the PVC is free.
3. Create a credential-reset Job that mounts the PVC directly and
   uses postgres --single (no TCP, no auth layer) to issue
   ALTER ROLE ... WITH PASSWORD '...', reading the password from
   the existing replica-creds secret.
4. Once the job succeeds, scale the deployment back to 1 and
   resume normal reconciliation.

This is resilient to the prior bug (fixed in 838318d) where
\getenv-based psql variable binding silently failed to set the
analytics password, and guards against any future regression in
that pathway.

Affected code:
- restore/builders.rs: build_credential_reset_job, credential_reset_job_name
- restore.rs: pub(crate) re-exports of the above
- replica.rs: is_auth_error, ensure_credential_reset helpers;
  wired into reconcile_schema_migration for both source and
  target restore connection attempts
@review-hero
Copy link

review-hero bot commented Mar 19, 2026

🦸 Review Hero Summary
3 agents reviewed this PR | 3 critical | 3 suggestions | 0 nitpicks

Local fix prompt (copy to your coding agent)

Fix these issues identified on the pull request. One commit per issue fixed.


src/controllers/replica.rs:898: The scale_patch JSON value {"spec": {"replicas": N}} and PatchParams::apply("postgres-restore-operator") are reconstructed three separate times inside ensure_credential_reset (scale-down path, succeeded path, failed path). Each call to serde_json::json! allocates a new Value tree and each PatchParams::apply heap-allocates the field manager string. These are all called in the hot reconcile loop. Hoist them to constants or local variables defined once at the top of the function to avoid redundant allocations on every invocation.


src/controllers/replica.rs:982: The pod list fetches full Pod objects (including spec, all containers, volumes, etc.) but only consumes status.phase. On a busy cluster this transfers unnecessary data from the API server on every reconcile loop iteration that hits this path. Use ListParams::default().labels(&label).limit(2) to cap the result set, and consider adding &kube::api::ListParams field selectors or restricting returned fields via server-side filtering if the client version supports it. Given there should be at most one pod per restore, a limit of 1–2 prevents an accidental full-list scan if labels are ever reused.


src/controllers/replica.rs:1165: The return value of ensure_credential_reset is silently discarded. The function is documented to return Ok(true) when the job has succeeded and credentials are reset (signalling the caller to retry the connection), but both call sites unconditionally return Ok(false) regardless. The current behaviour is not incorrect (the next reconcile will naturally retry), but it means the deployment is left scaled down for one extra reconcile cycle after a successful reset — the deployment is scaled back to 1 inside ensure_credential_reset on Ok(true), but the caller stops without logging or acting on that. Either use the result (if ensure_credential_reset(...).await? { /* log & fall through */ } else { return Ok(false); }) or change the return type to Result<()> and move the scale-up / job-deletion to a separate pass.


src/controllers/restore/builders.rs:57: The script uses set -ex, which traces every command before execution. This means the shell will print the expanded echo "ALTER ROLE ${ANALYTICS_USERNAME} WITH PASSWORD '${ANALYTICS_PASSWORD}';" line — with the plaintext password — to stdout/container logs. Anyone with access to pod logs (kubectl logs, log aggregation pipelines) will see the credential. Fix: remove set -x or, better, write the SQL to a temp file and redirect it rather than inlining it in an echo.

ANALYTICS_USERNAME is sourced from replica.spec.analytics_username — a user-controlled CRD field — and is interpolated verbatim into the SQL string passed to postgres --single. A value like analytics' WITH SUPERUSER; ALTER ROLE postgres WITH PASSWORD 'pwned'; -- would be executed as valid SQL, potentially escalating privileges or corrupting other roles on the restored database. Fix: validate/quote the username before building the script, or pass it as a psql variable (-v username=...) which is not interpolated as SQL.


src/controllers/restore/builders.rs:65: The shell script embeds both ANALYTICS_USERNAME and ANALYTICS_PASSWORD inside a double-quoted echo string. If the password contains a double-quote character (e.g. pass"word), the shell command becomes malformed: echo "ALTER ROLE user WITH PASSWORD 'pass"word';". Even if the password is operator-generated, single quotes in the password break the SQL literal syntax ('pass'word' is a SQL error). Similarly, ANALYTICS_USERNAME comes from user input (replica.spec.analytics_username) and is not SQL-identifier-quoted, so a username like my-user or one containing a single quote will cause a syntax error. The comment claiming safety applies only to the shell-command level and only if the password never contains ". Fix: use a heredoc with single quotes to isolate the password from shell expansion, escape single quotes in the password value (${ANALYTICS_PASSWORD//\'/\'\'} or dollar-quoting in the SQL), and double-quote the username identifier.

@review-hero
Copy link

review-hero bot commented Mar 19, 2026

🦸 Review Hero Auto-Fix
Applied fixes for 3 review comments.

View logs

@passcod passcod merged commit 1141dee into main Mar 20, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant