Skip to content

fix: ignore stale resource state#296

Merged
jason-lynch merged 1 commit intomainfrom
fix/ignore-stale-resource-state
Mar 11, 2026
Merged

fix: ignore stale resource state#296
jason-lynch merged 1 commit intomainfrom
fix/ignore-stale-resource-state

Conversation

@jason-lynch
Copy link
Member

@jason-lynch jason-lynch commented Mar 11, 2026

Summary

When a resource Create or Refresh fails, we still record the resource in state so that it can be cleaned up by a later delete operation. To distinguish this state, we set a special NeedsRecreate flag. However, we did not account for this in the helper functions that retrieve resources from the state. This made it possible for callers to get stale state for a resource that no longer exists.

This commit fixes that issue by adding a check for the NeedsRecreate flag in the helper functions that all the resources use to inspect the state.

I've also added a related fix in NodeResource.Create to handle cases where the first instance resource is in a bad state.

Testing

The scenario that this PR fixes:

# create a database with one node and two instances:
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] }
    ]
  }
}
EOF

# once that finishes, stop the host-1 instance
docker service scale postgres-storefront-n1-689qacsi=0

# then update the database to remove the first instance
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-2"] }
    ]
  }
}
EOF

Prior to this fix, you would get an error like:

failed to execute plan update: failed to plan database update: node n1 has no primary instance

@jason-lynch jason-lynch requested a review from mmols March 11, 2026 21:55
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

NodeResource.Create now scans all InstanceIDs, ignoring ErrNotFound until finding a non-empty PrimaryInstanceID; FromState returns not-found when ResourceData.NeedsRecreate is true; InstanceResource.isFirstTimeSetup signature was simplified to return only a bool and its logic reduced to an identifier existence check.

Changes

Cohort / File(s) Summary
Node / Instance database
server/internal/database/node_resource.go, server/internal/database/instance_resource.go
NodeResource.Create iterates all InstanceIDs, ignores ErrNotFound, and assigns n.PrimaryInstanceID from the first instance that exposes a non-empty PrimaryInstanceID (adds errors import). InstanceResource.isFirstTimeSetup signature changed to bool and simplified to check rc.State.Get(r.Identifier()).
Resource state handling
server/internal/resource/state.go
FromState now returns a not-found error when ResourceData.NeedsRecreate is true, short-circuiting before converting state to a resource.

Poem

🐇 I hopped through instance lists today,
I checked each burrow, sniffed the primary way,
If a state begs "recreate", I skip that den,
I find the true leader, then hop again,
Little whiskers, careful checks — hooray! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: ignore stale resource state' accurately captures the main objective of the PR, which is to prevent stale resources marked with NeedsRecreate from being returned by helper functions.
Description check ✅ Passed The PR description follows the template structure with Summary, Changes (implicit in detailed explanation), and Testing sections, though the checklist is not included.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/ignore-stale-resource-state

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Member

@mmols mmols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I suggest following later with an E2E focused around this case

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/database/node_resource.go`:
- Around line 68-75: The loop currently returns on any error from
resource.FromContext[*InstanceResource](rc, InstanceResourceIdentifier(id));
instead, check if err == resource.ErrNotFound and in that case skip this id and
continue scanning for other instances, only returning an error for
non-ErrNotFound failures; once a valid instance is returned, if
instance.PrimaryInstanceID != "" assign it to n.PrimaryInstanceID and break as
before. Ensure you reference the FromContext[*InstanceResource] call, the
InstanceResourceIdentifier(id) argument, the resource.ErrNotFound sentinel, and
the instance.PrimaryInstanceID check when making this change.

In `@server/internal/resource/state.go`:
- Around line 425-426: FromState must not map NeedsRecreate to ErrNotFound
because callers (e.g., update_database.go and plan_restore.go) rely on
ErrNotFound meaning “truly absent”; instead introduce and return a distinct
exported sentinel (e.g., ErrNeedsRecreate) from the resource package when
data.NeedsRecreate is true, update FromState to return that sentinel (not
ErrNotFound), and then adjust only the resource-lifecycle-only helpers or
callers that actually want to treat recreate as absent to check errors.Is(err,
resource.ErrNeedsRecreate) (or keep existing resource-only filtering) while
leaving callers that use errors.Is(..., resource.ErrNotFound) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 07210c25-2804-43c5-9e2b-3a09d85bbf64

📥 Commits

Reviewing files that changed from the base of the PR and between 004c9df and d1a1c05.

📒 Files selected for processing (2)
  • server/internal/database/node_resource.go
  • server/internal/resource/state.go

@jason-lynch jason-lynch force-pushed the fix/ignore-stale-resource-state branch from d1a1c05 to 1b1f8d7 Compare March 11, 2026 22:10
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/database/node_resource.go`:
- Around line 66-77: In node_resource.go inside the loop that iterates
n.InstanceIDs (the block calling resource.FromContext[*InstanceResource] with
InstanceResourceIdentifier), reset n.PrimaryInstanceID to empty string before
the loop begins, and after calling resource.FromContext add a continue when err
!= nil && errors.Is(err, resource.ErrNotFound) so the nil instance is skipped;
keep the existing return for other errors and only access
instance.PrimaryInstanceID when instance != nil to avoid nil dereference.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6d896f10-3bce-47e1-8708-f1c254c5ad5f

📥 Commits

Reviewing files that changed from the base of the PR and between d1a1c05 and 1b1f8d7.

📒 Files selected for processing (2)
  • server/internal/database/node_resource.go
  • server/internal/resource/state.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/internal/resource/state.go

When a resource `Create` or `Refresh` fails, we still record the
resource in state so that it can be cleaned up by a later delete
operation. To distinguish this state, we set a special
`NeedsRecreate` flag. However, we did not account for this in the
helper functions that retrieve resources from the state. This made it
possible for callers to get stale state for a resource that no longer
exists.

This commit fixes that issue by adding a check for the `NeedsRecreate`
flag in the helper functions that all the resources use to inspect the
state.

I've also added a related fix in `NodeResource.Create` to handle cases
where the first instance resource is in a bad state.
@jason-lynch jason-lynch force-pushed the fix/ignore-stale-resource-state branch from 1b1f8d7 to 16cd1bb Compare March 11, 2026 22:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/internal/database/instance_resource.go (1)

400-403: Tighten the inline comment to match the actual semantics.

This check now means “the resource has no prior state entry,” not strictly “it was never successfully created.” Updating the comment would make the retry behavior clearer. Based on learnings: In server/internal/database/postgres_database.go, isFirstTimeSetup() intentionally uses resource presence in state rather than NeedsRecreate because the post-restore flow is destructive and must not rerun on retry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/database/instance_resource.go` around lines 400 - 403, Update
the inline comment above the check that calls rc.State.Get(r.Identifier()) and
returns !ok to reflect that it checks for absence of a prior state entry (i.e.,
"no prior state entry exists") instead of claiming the resource was "never
successfully created"; reference the exact symbols rc.State.Get and
r.Identifier() in the comment and, if helpful, mirror the phrasing used by
isFirstTimeSetup() in server/internal/database/postgres_database.go to make the
retry/destructive-restore semantics clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/internal/database/instance_resource.go`:
- Around line 400-403: Update the inline comment above the check that calls
rc.State.Get(r.Identifier()) and returns !ok to reflect that it checks for
absence of a prior state entry (i.e., "no prior state entry exists") instead of
claiming the resource was "never successfully created"; reference the exact
symbols rc.State.Get and r.Identifier() in the comment and, if helpful, mirror
the phrasing used by isFirstTimeSetup() in
server/internal/database/postgres_database.go to make the
retry/destructive-restore semantics clear.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ebe0a971-1404-4c77-b291-92f53de036f1

📥 Commits

Reviewing files that changed from the base of the PR and between 1b1f8d7 and 16cd1bb.

📒 Files selected for processing (3)
  • server/internal/database/instance_resource.go
  • server/internal/database/node_resource.go
  • server/internal/resource/state.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • server/internal/resource/state.go
  • server/internal/database/node_resource.go

@jason-lynch jason-lynch merged commit 1dc6516 into main Mar 11, 2026
3 checks passed
@jason-lynch jason-lynch deleted the fix/ignore-stale-resource-state branch March 11, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants