fix: auto-reset analytics credentials on auth failure#18
Conversation
When the replica reconciler cannot connect to a restore (Active or Switching) due to an authentication failure, it now automatically recovers rather than retrying indefinitely. The recovery sequence: 1. Detect the auth failure via SQLSTATE 28P01/28000 (with a message- text fallback for cases where the db error code isn't surfaced). 2. Scale the restore deployment to 0 so the PVC is free. 3. Create a credential-reset Job that mounts the PVC directly and uses postgres --single (no TCP, no auth layer) to issue ALTER ROLE ... WITH PASSWORD '...', reading the password from the existing replica-creds secret. 4. Once the job succeeds, scale the deployment back to 1 and resume normal reconciliation. This is resilient to the prior bug (fixed in 838318d) where \getenv-based psql variable binding silently failed to set the analytics password, and guards against any future regression in that pathway. Affected code: - restore/builders.rs: build_credential_reset_job, credential_reset_job_name - restore.rs: pub(crate) re-exports of the above - replica.rs: is_auth_error, ensure_credential_reset helpers; wired into reconcile_schema_migration for both source and target restore connection attempts
|
🦸 Review Hero Summary Local fix prompt (copy to your coding agent)Fix these issues identified on the pull request. One commit per issue fixed.
|
|
🦸 Review Hero Auto-Fix |
Problem
When a restore's
setup-authinit container fails to set the analytics user password (as happened with the\getenv-based psql variable binding bug fixed in #17), the operator gets stuck forever in a tight retry loop:The restore sits in
Switchingphase indefinitely, the replica staysRestoring, and the database is completely inaccessible from the outside. This has been observed ontamanu-replica-palau-prodfor over a week.Fix
When the replica reconciler encounters an authentication failure (SQLSTATE
28P01/28000) connecting to any restore, it now automatically recovers via a credential-reset sequence:postgres --single(no TCP, no auth layer) toALTER ROLE ... WITH PASSWORD '...', reading the password from the existingreplica-credssecretThe fix is fully idempotent and handles job failure (scales back up, retries next reconcile).
Scope
restore/builders.rs:build_credential_reset_job,credential_reset_job_namerestore.rs:pub(crate)re-exportsreplica.rs:is_auth_error,ensure_credential_resethelpers; wired intoreconcile_schema_migrationfor both the source (active) and target (switching) restore connection attemptsThis guards against the specific prior bug (#17) and any future regression in the credential-setting pathway.
🦸 Review Hero