fix: ignore stale resource state by jason-lynch · Pull Request #296 · pgEdge/control-plane

jason-lynch · 2026-03-11T21:55:15Z

Summary

When a resource Create or Refresh fails, we still record the resource in state so that it can be cleaned up by a later delete operation. To distinguish this state, we set a special NeedsRecreate flag. However, we did not account for this in the helper functions that retrieve resources from the state. This made it possible for callers to get stale state for a resource that no longer exists.

This commit fixes that issue by adding a check for the NeedsRecreate flag in the helper functions that all the resources use to inspect the state.

I've also added a related fix in NodeResource.Create to handle cases where the first instance resource is in a bad state.

Testing

The scenario that this PR fixes:

# create a database with one node and two instances:
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] }
    ]
  }
}
EOF

# once that finishes, stop the host-1 instance
docker service scale postgres-storefront-n1-689qacsi=0

# then update the database to remove the first instance
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-2"] }
    ]
  }
}
EOF

Prior to this fix, you would get an error like:

failed to execute plan update: failed to plan database update: node n1 has no primary instance

coderabbitai · 2026-03-11T21:55:29Z

📝 Walkthrough

Walkthrough

NodeResource.Create now scans all InstanceIDs, ignoring ErrNotFound until finding a non-empty PrimaryInstanceID; FromState returns not-found when ResourceData.NeedsRecreate is true; InstanceResource.isFirstTimeSetup signature was simplified to return only a bool and its logic reduced to an identifier existence check.

Changes

Cohort / File(s)	Summary
Node / Instance database `server/internal/database/node_resource.go`, `server/internal/database/instance_resource.go`	`NodeResource.Create` iterates all `InstanceIDs`, ignores `ErrNotFound`, and assigns `n.PrimaryInstanceID` from the first instance that exposes a non-empty `PrimaryInstanceID` (adds `errors` import). `InstanceResource.isFirstTimeSetup` signature changed to `bool` and simplified to check `rc.State.Get(r.Identifier())`.
Resource state handling `server/internal/resource/state.go`	`FromState` now returns a not-found error when `ResourceData.NeedsRecreate` is true, short-circuiting before converting state to a resource.

Poem

🐇 I hopped through instance lists today,
I checked each burrow, sniffed the primary way,
If a state begs "recreate", I skip that den,
I find the true leader, then hop again,
Little whiskers, careful checks — hooray! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: ignore stale resource state' accurately captures the main objective of the PR, which is to prevent stale resources marked with NeedsRecreate from being returned by helper functions.
Description check	✅ Passed	The PR description follows the template structure with Summary, Changes (implicit in detailed explanation), and Testing sections, though the checklist is not included.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/ignore-stale-resource-state

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mmols

LGTM, I suggest following later with an E2E focused around this case

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/database/node_resource.go`:
- Around line 68-75: The loop currently returns on any error from
resource.FromContext[*InstanceResource](rc, InstanceResourceIdentifier(id));
instead, check if err == resource.ErrNotFound and in that case skip this id and
continue scanning for other instances, only returning an error for
non-ErrNotFound failures; once a valid instance is returned, if
instance.PrimaryInstanceID != "" assign it to n.PrimaryInstanceID and break as
before. Ensure you reference the FromContext[*InstanceResource] call, the
InstanceResourceIdentifier(id) argument, the resource.ErrNotFound sentinel, and
the instance.PrimaryInstanceID check when making this change.

In `@server/internal/resource/state.go`:
- Around line 425-426: FromState must not map NeedsRecreate to ErrNotFound
because callers (e.g., update_database.go and plan_restore.go) rely on
ErrNotFound meaning “truly absent”; instead introduce and return a distinct
exported sentinel (e.g., ErrNeedsRecreate) from the resource package when
data.NeedsRecreate is true, update FromState to return that sentinel (not
ErrNotFound), and then adjust only the resource-lifecycle-only helpers or
callers that actually want to treat recreate as absent to check errors.Is(err,
resource.ErrNeedsRecreate) (or keep existing resource-only filtering) while
leaving callers that use errors.Is(..., resource.ErrNotFound) unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 07210c25-2804-43c5-9e2b-3a09d85bbf64

📥 Commits

Reviewing files that changed from the base of the PR and between 004c9df and d1a1c05.

📒 Files selected for processing (2)

server/internal/database/node_resource.go
server/internal/resource/state.go

server/internal/database/node_resource.go

server/internal/resource/state.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/database/node_resource.go`:
- Around line 66-77: In node_resource.go inside the loop that iterates
n.InstanceIDs (the block calling resource.FromContext[*InstanceResource] with
InstanceResourceIdentifier), reset n.PrimaryInstanceID to empty string before
the loop begins, and after calling resource.FromContext add a continue when err
!= nil && errors.Is(err, resource.ErrNotFound) so the nil instance is skipped;
keep the existing return for other errors and only access
instance.PrimaryInstanceID when instance != nil to avoid nil dereference.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6d896f10-3bce-47e1-8708-f1c254c5ad5f

📥 Commits

Reviewing files that changed from the base of the PR and between d1a1c05 and 1b1f8d7.

📒 Files selected for processing (2)

server/internal/database/node_resource.go
server/internal/resource/state.go

🚧 Files skipped from review as they are similar to previous changes (1)

server/internal/resource/state.go

server/internal/database/node_resource.go

When a resource `Create` or `Refresh` fails, we still record the resource in state so that it can be cleaned up by a later delete operation. To distinguish this state, we set a special `NeedsRecreate` flag. However, we did not account for this in the helper functions that retrieve resources from the state. This made it possible for callers to get stale state for a resource that no longer exists. This commit fixes that issue by adding a check for the `NeedsRecreate` flag in the helper functions that all the resources use to inspect the state. I've also added a related fix in `NodeResource.Create` to handle cases where the first instance resource is in a bad state.

coderabbitai

🧹 Nitpick comments (1)

server/internal/database/instance_resource.go (1)
400-403: Tighten the inline comment to match the actual semantics.

This check now means “the resource has no prior state entry,” not strictly “it was never successfully created.” Updating the comment would make the retry behavior clearer. Based on learnings: In server/internal/database/postgres_database.go, isFirstTimeSetup() intentionally uses resource presence in state rather than NeedsRecreate because the post-restore flow is destructive and must not rerun on retry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/database/instance_resource.go` around lines 400 - 403, Update
the inline comment above the check that calls rc.State.Get(r.Identifier()) and
returns !ok to reflect that it checks for absence of a prior state entry (i.e.,
"no prior state entry exists") instead of claiming the resource was "never
successfully created"; reference the exact symbols rc.State.Get and
r.Identifier() in the comment and, if helpful, mirror the phrasing used by
isFirstTimeSetup() in server/internal/database/postgres_database.go to make the
retry/destructive-restore semantics clear.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/internal/database/instance_resource.go`:
- Around line 400-403: Update the inline comment above the check that calls
rc.State.Get(r.Identifier()) and returns !ok to reflect that it checks for
absence of a prior state entry (i.e., "no prior state entry exists") instead of
claiming the resource was "never successfully created"; reference the exact
symbols rc.State.Get and r.Identifier() in the comment and, if helpful, mirror
the phrasing used by isFirstTimeSetup() in
server/internal/database/postgres_database.go to make the
retry/destructive-restore semantics clear.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ebe0a971-1404-4c77-b291-92f53de036f1

📥 Commits

Reviewing files that changed from the base of the PR and between 1b1f8d7 and 16cd1bb.

📒 Files selected for processing (3)

server/internal/database/instance_resource.go
server/internal/database/node_resource.go
server/internal/resource/state.go

🚧 Files skipped from review as they are similar to previous changes (2)

server/internal/resource/state.go
server/internal/database/node_resource.go

jason-lynch requested a review from mmols March 11, 2026 21:55

mmols approved these changes Mar 11, 2026

View reviewed changes

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

server/internal/database/node_resource.go Show resolved Hide resolved

server/internal/resource/state.go Show resolved Hide resolved

jason-lynch force-pushed the fix/ignore-stale-resource-state branch from d1a1c05 to 1b1f8d7 Compare March 11, 2026 22:10

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

server/internal/database/node_resource.go Show resolved Hide resolved

jason-lynch force-pushed the fix/ignore-stale-resource-state branch from 1b1f8d7 to 16cd1bb Compare March 11, 2026 22:41

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

jason-lynch merged commit 1dc6516 into main Mar 11, 2026
3 checks passed

jason-lynch deleted the fix/ignore-stale-resource-state branch March 11, 2026 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ignore stale resource state#296

fix: ignore stale resource state#296
jason-lynch merged 1 commit intomainfrom
fix/ignore-stale-resource-state

jason-lynch commented Mar 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Walkthrough

Changes

Poem

❌ Failed checks (1 warning)

Uh oh!

mmols left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jason-lynch commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

❌ Failed checks (1 warning)

Uh oh!

mmols left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jason-lynch commented Mar 11, 2026 •

edited

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading