fix: auto-reset analytics credentials on auth failure by passcod · Pull Request #18 · beyondessential/postgres-restore-operator

passcod · 2026-03-19T23:15:06Z

Problem

When a restore's setup-auth init container fails to set the analytics user password (as happened with the \getenv-based psql variable binding bug fixed in #17), the operator gets stuck forever in a tight retry loop:

WARN replica reconciliation error: Database connection error: db error
FATAL: password authentication failed for user "analytics"
DETAIL: User "analytics" has no password assigned.

The restore sits in Switching phase indefinitely, the replica stays Restoring, and the database is completely inaccessible from the outside. This has been observed on tamanu-replica-palau-prod for over a week.

Fix

When the replica reconciler encounters an authentication failure (SQLSTATE 28P01/28000) connecting to any restore, it now automatically recovers via a credential-reset sequence:

Detect the auth failure (SQLSTATE codes, with message-text fallback)
Scale the restore deployment to 0 so the PVC is free
Run a credential-reset Job that mounts the PVC directly and uses postgres --single (no TCP, no auth layer) to ALTER ROLE ... WITH PASSWORD '...', reading the password from the existing replica-creds secret
Scale the deployment back to 1 once the job succeeds
Resume normal reconciliation — the next connection attempt will succeed

The fix is fully idempotent and handles job failure (scales back up, retries next reconcile).

Scope

restore/builders.rs: build_credential_reset_job, credential_reset_job_name
restore.rs: pub(crate) re-exports
replica.rs: is_auth_error, ensure_credential_reset helpers; wired into reconcile_schema_migration for both the source (active) and target (switching) restore connection attempts

This guards against the specific prior bug (#17) and any future regression in the credential-setting pathway.

🦸 Review Hero

Run Review Hero
Auto-fix review suggestions
Auto-fix CI failures

When the replica reconciler cannot connect to a restore (Active or Switching) due to an authentication failure, it now automatically recovers rather than retrying indefinitely. The recovery sequence: 1. Detect the auth failure via SQLSTATE 28P01/28000 (with a message- text fallback for cases where the db error code isn't surfaced). 2. Scale the restore deployment to 0 so the PVC is free. 3. Create a credential-reset Job that mounts the PVC directly and uses postgres --single (no TCP, no auth layer) to issue ALTER ROLE ... WITH PASSWORD '...', reading the password from the existing replica-creds secret. 4. Once the job succeeds, scale the deployment back to 1 and resume normal reconciliation. This is resilient to the prior bug (fixed in 838318d) where \getenv-based psql variable binding silently failed to set the analytics password, and guards against any future regression in that pathway. Affected code: - restore/builders.rs: build_credential_reset_job, credential_reset_job_name - restore.rs: pub(crate) re-exports of the above - replica.rs: is_auth_error, ensure_credential_reset helpers; wired into reconcile_schema_migration for both source and target restore connection attempts

src/controllers/replica.rs

src/controllers/restore/builders.rs

review-hero · 2026-03-19T23:36:15Z

🦸 Review Hero Summary
3 agents reviewed this PR | 3 critical | 3 suggestions | 0 nitpicks

Local fix prompt (copy to your coding agent)

Fix these issues identified on the pull request. One commit per issue fixed.

src/controllers/replica.rs:898: The scale_patch JSON value {"spec": {"replicas": N}} and PatchParams::apply("postgres-restore-operator") are reconstructed three separate times inside ensure_credential_reset (scale-down path, succeeded path, failed path). Each call to serde_json::json! allocates a new Value tree and each PatchParams::apply heap-allocates the field manager string. These are all called in the hot reconcile loop. Hoist them to constants or local variables defined once at the top of the function to avoid redundant allocations on every invocation.

src/controllers/replica.rs:982: The pod list fetches full Pod objects (including spec, all containers, volumes, etc.) but only consumes status.phase. On a busy cluster this transfers unnecessary data from the API server on every reconcile loop iteration that hits this path. Use ListParams::default().labels(&label).limit(2) to cap the result set, and consider adding &kube::api::ListParams field selectors or restricting returned fields via server-side filtering if the client version supports it. Given there should be at most one pod per restore, a limit of 1–2 prevents an accidental full-list scan if labels are ever reused.

src/controllers/replica.rs:1165: The return value of ensure_credential_reset is silently discarded. The function is documented to return Ok(true) when the job has succeeded and credentials are reset (signalling the caller to retry the connection), but both call sites unconditionally return Ok(false) regardless. The current behaviour is not incorrect (the next reconcile will naturally retry), but it means the deployment is left scaled down for one extra reconcile cycle after a successful reset — the deployment is scaled back to 1 inside ensure_credential_reset on Ok(true), but the caller stops without logging or acting on that. Either use the result (if ensure_credential_reset(...).await? { /* log & fall through */ } else { return Ok(false); }) or change the return type to Result<()> and move the scale-up / job-deletion to a separate pass.

src/controllers/restore/builders.rs:57: The script uses set -ex, which traces every command before execution. This means the shell will print the expanded echo "ALTER ROLE ${ANALYTICS_USERNAME} WITH PASSWORD '${ANALYTICS_PASSWORD}';" line — with the plaintext password — to stdout/container logs. Anyone with access to pod logs (kubectl logs, log aggregation pipelines) will see the credential. Fix: remove set -x or, better, write the SQL to a temp file and redirect it rather than inlining it in an echo.

ANALYTICS_USERNAME is sourced from replica.spec.analytics_username — a user-controlled CRD field — and is interpolated verbatim into the SQL string passed to postgres --single. A value like analytics' WITH SUPERUSER; ALTER ROLE postgres WITH PASSWORD 'pwned'; -- would be executed as valid SQL, potentially escalating privileges or corrupting other roles on the restored database. Fix: validate/quote the username before building the script, or pass it as a psql variable (-v username=...) which is not interpolated as SQL.

src/controllers/restore/builders.rs:65: The shell script embeds both ANALYTICS_USERNAME and ANALYTICS_PASSWORD inside a double-quoted echo string. If the password contains a double-quote character (e.g. pass"word), the shell command becomes malformed: echo "ALTER ROLE user WITH PASSWORD 'pass"word';". Even if the password is operator-generated, single quotes in the password break the SQL literal syntax ('pass'word' is a SQL error). Similarly, ANALYTICS_USERNAME comes from user input (replica.spec.analytics_username) and is not SQL-identifier-quoted, so a username like my-user or one containing a single quote will cause a syntax error. The comment claiming safety applies only to the shell-command level and only if the password never contains ". Fix: use a heredoc with single quotes to isolate the password from shell expansion, escape single quotes in the password value (${ANALYTICS_PASSWORD//\'/\'\'} or dollar-quoting in the SQL), and double-quote the username identifier.

review-hero · 2026-03-19T23:45:47Z

🦸 Review Hero Auto-Fix
Applied fixes for 3 review comments.

View logs

review-hero bot reviewed Mar 19, 2026

View reviewed changes

src/controllers/replica.rs Show resolved Hide resolved

src/controllers/replica.rs Show resolved Hide resolved

src/controllers/replica.rs Show resolved Hide resolved

src/controllers/restore/builders.rs Show resolved Hide resolved

src/controllers/restore/builders.rs Show resolved Hide resolved

fix: auto-fix 3 review suggestions

37c40b2

passcod merged commit 1141dee into main Mar 20, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto-reset analytics credentials on auth failure#18

fix: auto-reset analytics credentials on auth failure#18
passcod merged 2 commits intomainfrom
fix-auth-credential-reset

passcod commented Mar 19, 2026 •

edited by review-hero bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

review-hero bot commented Mar 19, 2026

Uh oh!

review-hero bot commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

passcod commented Mar 19, 2026 • edited by review-hero bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Scope

🦸 Review Hero

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

review-hero bot commented Mar 19, 2026

Uh oh!

review-hero bot commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

passcod commented Mar 19, 2026 •

edited by review-hero bot

Loading