Skip to content

fix: enable Spock repair mode for service user creation#287

Merged
rshoemaker merged 5 commits intomainfrom
fix/PLAT-478/svc_user_tx_with_repair_mode
Mar 11, 2026
Merged

fix: enable Spock repair mode for service user creation#287
rshoemaker merged 5 commits intomainfrom
fix/PLAT-478/svc_user_tx_with_repair_mode

Conversation

@rshoemaker
Copy link
Contributor

@rshoemaker rshoemaker commented Mar 6, 2026

fix: resolve concurrent service user creation failures

Summary

  • Deduplicate ServiceUserRole by ServiceID so only one database user role is created per service, shared across all instances
  • Replace HostExecutor with PrimaryExecutor for dynamic routing to the current primary node
  • Remove Spock repair mode — the role is created once on the primary and Spock replicates it to all other nodes automatically

Context

TestProvisionMultiHostMCPService was failing intermittently with:

ERROR: tuple concurrently updated (SQLSTATE XX000)

When a database has multiple MCP service instances (e.g., on host-1, host-2, host-3), each previously got its own ServiceUserRole resource. These executed concurrently on the same Postgres instance, and their GRANT statements conflicted at the catalog level.

The fix changes ServiceUserRole identity from ServiceInstanceID to ServiceID, so the resource system deduplicates them — only one role is created regardless of how many service instances exist. This eliminates concurrent creation entirely rather than retrying through it.

Other changes:

  • GenerateServiceUsername now takes only serviceID (no hostID)
  • ServiceInstanceResource gains a ServiceSpecID field to avoid conflation with the Docker Swarm service ID set during Refresh
  • ResourceVersion bumped to "3" to trigger recreation

Verified with 30 consecutive test runs, 0 failures.

Test plan

  • Unit tests pass (make test)
  • All service E2E tests pass (8/8)
  • TestProvisionMultiHostMCPService passes 30/30 runs

PLAT-478

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Service user-role resources were restructured to use ServiceID/ServiceSpecID and NodeName routing, username generation now derives from serviceID only, creation moved into a per-database transactional helper, and primary connections now accept an explicit dbName (AdminDSN(dbName)).

Changes

Cohort / File(s) Summary
Service user role core
server/internal/orchestrator/swarm/service_user_role.go
Public shape changed to ServiceID, NodeName, DatabaseID, DatabaseName, Username, Password; ResourceVersion() -> "3"; DiffIgnore() now ignores /node_name; Identifier() uses ServiceID; Executor() returns resource.PrimaryExecutor(r.NodeName); creation refactored into createUserRole(...); connectToPrimary now takes dbName and uses AdminDSN(dbName).
Username generation & tests
server/internal/database/service_instance.go, server/internal/database/service_instance_test.go
GenerateServiceUsername signature changed to GenerateServiceUsername(serviceID string); usernames now based on serviceID only (svc_{service_id} / long form with hash); tests updated; ServiceInstanceSpec.PostgresHostID renamed to NodeName.
Orchestrator wiring & MCP config
server/internal/orchestrator/swarm/mcp_config_resource.go, server/internal/orchestrator/swarm/orchestrator.go
MCPConfigResource and related wiring now carry ServiceID; dependencies and credentials lookup use ServiceID instead of instance IDs.
Service instance resources
server/internal/orchestrator/swarm/service_instance.go, server/internal/orchestrator/swarm/service_instance_spec.go
Added ServiceSpecID to ServiceInstanceResource; DiffIgnore() updated to include /service_spec_id; Dependencies() and populateCredentials() now reference ServiceSpecID / ServiceSpec.ServiceID for ServiceUserRole resolution.
Plan/update routing
server/internal/workflows/plan_update.go
Replaced postgres-host routing with nodeName: getServiceResources signature and call sites updated; validation and ServiceInstanceSpec population now use NodeName instead of postgresHostID.

Poem

🐇 I hop through code with nimble feet,
One service name, one routing seat,
Transactions wrap each role I make,
Primary calls for safety's sake,
I nibble bugs and leave things neat.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'fix: enable Spock repair mode for service user creation' is misleading—the PR actually removes Spock repair mode and refocuses on deduplication via ServiceID. It describes the opposite of the main change. Update title to reflect the actual primary fix: 'fix: deduplicate ServiceUserRole by ServiceID to prevent concurrent creation failures' or similar.
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The PR description covers the summary, context, changes, and verification clearly, but the provided test plan checklist is incomplete with unchecked boxes and lacks actual test results or references.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/PLAT-478/svc_user_tx_with_repair_mode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 130-133: The call to postgres.IsSpockEnabled().Scalar silently
hides query errors because Scalar currently returns nil on row.Scan failures;
update the Scalar implementation in the postgres statement (the method that
implements Scalar on the Statement/Query type) so that any row.Scan error is
returned to the caller instead of nil, ensuring postgres.IsSpockEnabled().Scalar
propagates the scan/DB error back to the caller (so the existing error handling
in service_user_role.go around enabled, err := ... Scalar(ctx, tx) works as
intended).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1ad74e6-e8db-4423-8957-d107629152ec

📥 Commits

Reviewing files that changed from the base of the PR and between 87528bb and c216d57.

📒 Files selected for processing (1)
  • server/internal/orchestrator/swarm/service_user_role.go

@jason-lynch
Copy link
Member

jason-lynch commented Mar 10, 2026

The Spock repair mode change is necessary (although unrelated to the E2E failures since there’s only one database instance), but I think the retries are papering over the real issue. As you pointed out, the problem is that each instance has its own ServiceUserRole, but they all use the same role name.

I think we should remove the retries and do either one of the following:

  • Make each role name unique to the instance.

Or:

  • Change the ServiceUserRole identity to use ServiceID rather than ServiceInstanceID, like:
func ServiceUserRoleIdentifier(serviceID string) resource.Identifier {
	return resource.Identifier{
		ID:   serviceID,
		Type: ResourceTypeServiceUserRole,
	}
}

// ...

func (r *ServiceUserRole) Identifier() resource.Identifier {
	return ServiceUserRoleIdentifier(r.ServiceID)
}

We already deduplicate resources by ID, so even if every GenerateServiceInstanceResources method returns the same resource, only one of them will end up in the state, and it will only be deleted after all of its dependents are deleted. We use that same pattern in a few places, such as the pgBackRest stanza and the Patroni cluster info in Etcd.

@rshoemaker
Copy link
Contributor Author

@jason-lynch I've switched from using ServiceInstanceID to ServiceID so there's only one db role shared across service instances.

@@ -34,6 +34,7 @@
// - users.yaml: Application-owned, written only on first Create if init_users is set
type MCPConfigResource struct {
ServiceInstanceID string `json:"service_instance_id"`
Copy link
Member

@jason-lynch jason-lynch Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to request another change here - but now that we're making one of these per service rather than one per instance, we shouldn't have the instance ID or the host ID as properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jason-lynch did you mean to suggest removing ServiceInstanceID from here in mcp_config_resource or did you mean to make the suggestion in service_user_role? The service user role is now per-service, but the config file resource is still per-service-instance.

Assuming you meant for the comment to be in service_user_role, do you think HostID can also be removed? I think that might be unnecessary also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you meant for the comment to be in service_user_role, do you think HostID can also be removed? I think that might be unnecessary also.

Yes, you're right. I also made this comment on the wrong resource, and you're right that the same comment about removing HostID still applies. As mentioned offline, we'll want to use PrimaryExecutor instead of HostExecutor.

}

func (r *MCPConfigResource) Executor() resource.Executor {
return resource.HostExecutor(r.HostID)
Copy link
Member

@jason-lynch jason-lynch Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this before, but using the service host here is going to prevent us from running this service on a host without a database instance. Instead, we should pick any one of the nodes in the database and use resource.PrimaryExecutor(r.NodeName).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got these files mixed up when I was reviewing this. Sorry about that!

Copy link
Member

@jason-lynch jason-lynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since you fixed the original issue. Feel free to address the other comments here or in a separate PR.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
server/internal/orchestrator/swarm/service_instance.go (1)

31-39: ⚠️ Potential issue | 🟠 Major

Backfill ServiceSpecID during Refresh() or upgraded resources stay broken.

ServiceSpecID is now the stable dependency key, but it is ignored in diffs and never repopulated from the desired spec. Any ServiceInstanceResource already in state before this PR keeps the zero value forever, so later dependency resolution still points at ServiceUserRoleIdentifier("") instead of the shared role. Set it from desired.ServiceSpec.ServiceID while refreshing current state.

♻️ Suggested backfill
desired, err := resource.FromContext[*ServiceInstanceSpecResource](rc, ServiceInstanceSpecResourceIdentifier(s.ServiceInstanceID))
if err != nil {
	return fmt.Errorf("failed to get desired service spec from state: %w", err)
}
+ s.ServiceSpecID = desired.ServiceSpec.ServiceID

resp, err := client.ServiceInspectByLabels(ctx, map[string]string{

Also applies to: 45-67, 69-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/service_instance.go` around lines 31 - 39,
During Refresh(), populate ServiceInstanceResource.ServiceSpecID from the
desired spec so resources created before this change get backfilled: in the
Refresh implementation where you map current+desired state (look for function
Refresh and the code handling ServiceInstanceResource and desired.ServiceSpec),
set resource.ServiceSpecID = desired.ServiceSpec.ServiceID (or the equivalent
desired.ServiceSpecID) when desired is non-nil; do this when constructing or
updating the current resource so dependency resolution uses the stable
ServiceSpecID while preserving other fields like ServiceID (Docker Swarm service
ID).
server/internal/database/service_instance.go (1)

128-148: ⚠️ Potential issue | 🟠 Major

Distinct service IDs can collapse onto the same database username.

Because normalization happens before both the short-name return and the hash input, foo-bar and foo_bar now generate the same username. With one shared role per service, those two resources would fight over the same Postgres role/password. Preserve the original serviceID in the uniqueness material so normalization does not erase distinctions between valid service IDs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/database/service_instance.go` around lines 128 - 148, The
GenerateServiceUsername function currently normalizes serviceID before both
building the short username and computing the hash, causing collisions for
inputs like "foo-bar" vs "foo_bar"; fix by saving the original serviceID (e.g.,
orig := serviceID) before replacing hyphens, use the sanitized value for the
returned username and for length/truncation logic (the short-name check and raw
prefix), but compute the SHA256 hash/suffix from the original unmodified
serviceID (sha256.Sum256([]byte(orig))) so normalization doesn't remove
uniqueness between distinct IDs.
server/internal/orchestrator/swarm/service_user_role.go (1)

173-185: ⚠️ Potential issue | 🟠 Major

Revoke the app-db grants before DROP ROLE.

The service user role created at lines 173-183 has grants on r.DatabaseName (CONNECT, schema/table privileges, and default privileges), but the Delete method (lines 212-229) attempts DROP ROLE without revoking these first. PostgreSQL will not drop a role with active privileges, so the role persists in the database. Run REASSIGN OWNED BY and DROP OWNED BY against r.DatabaseName before dropping the role.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/service_user_role.go` around lines 173 -
185, The Delete method currently drops the role without revoking DB-specific
privileges created in the grants block (the grants variable referencing
r.DatabaseName and r.Username), causing DROP ROLE to fail; update the Delete
implementation (the method that calls DROP ROLE for r.Username) to first run
REASSIGN OWNED BY <user> TO <target> and DROP OWNED BY <user> IN DATABASE <db>
(or equivalent GRANT/REVOKE statements) against
sanitizeIdentifier(r.DatabaseName) for sanitizeIdentifier(r.Username), then
revoke schema/table/default privileges (the same patterns used when creating
grants) before attempting DROP ROLE so the role can be removed cleanly. Ensure
use of the same sanitizeIdentifier(r.Username) and
sanitizeIdentifier(r.DatabaseName) helpers and execute these statements via the
same tx/Exec flow used by grants.
🧹 Nitpick comments (1)
server/internal/orchestrator/swarm/service_user_role.go (1)

125-192: Add transient error retry for concurrent role creation.

The createUserRole() method checks if the role exists before creating it, but concurrent requests can still race past this check and both attempt CREATE ROLE, triggering PostgreSQL's catalog conflict error (XX000 / "tuple concurrently updated"). While the workflow eventually retries failed creates on the next reconciliation, adding immediate exponential backoff retry for transient catalog conflicts here would eliminate the transient failure window and improve resilience during parallel service provisioning. The utils.Retry() function is already available in the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/service_user_role.go` around lines 125 -
192, The createUserRole function can race on CREATE ROLE; wrap the role-creation
sequence in utils.Retry with exponential backoff: for each attempt open a new
transaction (use connectToPrimary -> conn.Begin), check/enable Spock, generate
postgres.CreateUserRole(...) and call statements.Exec (and then apply
grants.Exec and tx.Commit); on transient catalog-conflict errors (Postgres
SQLSTATE "XX000" or error text contains "tuple concurrently updated" /
pgconn.PgError.Code == "XX000") return a retryable error to utils.Retry,
otherwise return the error immediately; ensure tx.Rollback on non-nil tx before
retrying and only return nil after a successful Commit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/orchestrator/swarm/mcp_config_resource.go`:
- Around line 37-38: The Refresh logic in MCPConfigResource must not attempt
credential resolution for legacy state: remove or skip the call to
populateCredentials() inside MCPConfigResource.Refresh (and any direct calls to
ServiceUserRoleIdentifier when r.ServiceID == ""), and instead only verify
existence/validity of config.yaml in Refresh; leave populateCredentials() to be
invoked only during reconciliation where ServiceID is guaranteed (or backfill
ServiceID prior to calling populateCredentials()). Update Refresh to
early-return after checking config.yaml if ServiceID is empty, and ensure no
ServiceUserRoleIdentifier("") calls occur from Refresh.

---

Outside diff comments:
In `@server/internal/database/service_instance.go`:
- Around line 128-148: The GenerateServiceUsername function currently normalizes
serviceID before both building the short username and computing the hash,
causing collisions for inputs like "foo-bar" vs "foo_bar"; fix by saving the
original serviceID (e.g., orig := serviceID) before replacing hyphens, use the
sanitized value for the returned username and for length/truncation logic (the
short-name check and raw prefix), but compute the SHA256 hash/suffix from the
original unmodified serviceID (sha256.Sum256([]byte(orig))) so normalization
doesn't remove uniqueness between distinct IDs.

In `@server/internal/orchestrator/swarm/service_instance.go`:
- Around line 31-39: During Refresh(), populate
ServiceInstanceResource.ServiceSpecID from the desired spec so resources created
before this change get backfilled: in the Refresh implementation where you map
current+desired state (look for function Refresh and the code handling
ServiceInstanceResource and desired.ServiceSpec), set resource.ServiceSpecID =
desired.ServiceSpec.ServiceID (or the equivalent desired.ServiceSpecID) when
desired is non-nil; do this when constructing or updating the current resource
so dependency resolution uses the stable ServiceSpecID while preserving other
fields like ServiceID (Docker Swarm service ID).

In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 173-185: The Delete method currently drops the role without
revoking DB-specific privileges created in the grants block (the grants variable
referencing r.DatabaseName and r.Username), causing DROP ROLE to fail; update
the Delete implementation (the method that calls DROP ROLE for r.Username) to
first run REASSIGN OWNED BY <user> TO <target> and DROP OWNED BY <user> IN
DATABASE <db> (or equivalent GRANT/REVOKE statements) against
sanitizeIdentifier(r.DatabaseName) for sanitizeIdentifier(r.Username), then
revoke schema/table/default privileges (the same patterns used when creating
grants) before attempting DROP ROLE so the role can be removed cleanly. Ensure
use of the same sanitizeIdentifier(r.Username) and
sanitizeIdentifier(r.DatabaseName) helpers and execute these statements via the
same tx/Exec flow used by grants.

---

Nitpick comments:
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 125-192: The createUserRole function can race on CREATE ROLE; wrap
the role-creation sequence in utils.Retry with exponential backoff: for each
attempt open a new transaction (use connectToPrimary -> conn.Begin),
check/enable Spock, generate postgres.CreateUserRole(...) and call
statements.Exec (and then apply grants.Exec and tx.Commit); on transient
catalog-conflict errors (Postgres SQLSTATE "XX000" or error text contains "tuple
concurrently updated" / pgconn.PgError.Code == "XX000") return a retryable error
to utils.Retry, otherwise return the error immediately; ensure tx.Rollback on
non-nil tx before retrying and only return nil after a successful Commit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3f6b7faf-dbc0-4fb1-a191-7a4b0f888deb

📥 Commits

Reviewing files that changed from the base of the PR and between 3f4a388 and 2984822.

📒 Files selected for processing (7)
  • server/internal/database/service_instance.go
  • server/internal/database/service_instance_test.go
  • server/internal/orchestrator/swarm/mcp_config_resource.go
  • server/internal/orchestrator/swarm/orchestrator.go
  • server/internal/orchestrator/swarm/service_instance.go
  • server/internal/orchestrator/swarm/service_instance_spec.go
  • server/internal/orchestrator/swarm/service_user_role.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/internal/workflows/plan_update.go (1)

60-67: Consider adding a comment clarifying NodeName routing semantics.

The code picks nodeInstances[0].NodeName for routing. While this works correctly (the PrimaryExecutor resolution dynamically finds the actual primary via Patroni regardless of which node name is used), a brief comment would help future readers understand that any valid node name suffices for primary routing.

💡 Suggested clarifying comment
 	// Generate service instance resources.
-	// Pick any node name for ServiceUserRole PrimaryExecutor routing.
+	// Pick any node name for ServiceUserRole PrimaryExecutor routing.
+	// The actual primary is resolved dynamically via Patroni during execution,
+	// so any valid database node name works here.
 	var nodeName string
 	if len(nodeInstances) > 0 {
 		nodeName = nodeInstances[0].NodeName
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/workflows/plan_update.go` around lines 60 - 67, Add a short
clarifying comment above the nodeName selection in the plan update logic
explaining that choosing nodeInstances[0].NodeName is intentional because
ServiceUserRole/PrimaryExecutor routing will resolve the actual primary via
Patroni irrespective of which valid node name is provided; reference the
nodeName variable, nodeInstances slice, PrimaryExecutor and ServiceUserRole
resolution so readers understand any node name suffices for routing when
input.Spec.Services is non-empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/internal/workflows/plan_update.go`:
- Around line 60-67: Add a short clarifying comment above the nodeName selection
in the plan update logic explaining that choosing nodeInstances[0].NodeName is
intentional because ServiceUserRole/PrimaryExecutor routing will resolve the
actual primary via Patroni irrespective of which valid node name is provided;
reference the nodeName variable, nodeInstances slice, PrimaryExecutor and
ServiceUserRole resolution so readers understand any node name suffices for
routing when input.Spec.Services is non-empty.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 76f8b620-cfb4-4f54-a4fd-ee7c4b329895

📥 Commits

Reviewing files that changed from the base of the PR and between 2984822 and c7bbfc6.

📒 Files selected for processing (4)
  • server/internal/database/service_instance.go
  • server/internal/orchestrator/swarm/orchestrator.go
  • server/internal/orchestrator/swarm/service_user_role.go
  • server/internal/workflows/plan_update.go

Copy link
Member

@jason-lynch jason-lynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thank you for making these changes on the fly.

Wrap service user role creation in a transaction with Spock repair
mode enabled. This prevents "tuple concurrently updated" errors
when the same user is being created on multiple instances
simultaneously and Spock replicates the changes.
Wrap createUserRole in a retry loop (3 attempts, exponential backoff)
to handle transient "tuple concurrently updated" errors when multiple
service user roles are created concurrently against the same Postgres
instance.

Also connect to the application database instead of "postgres" so
that Spock repair mode is correctly enabled (Spock is installed
per-database).

# Conflicts:
#	server/internal/orchestrator/swarm/service_user_role.go
Change the ServiceUserRole resource identity from ServiceInstanceID to
ServiceID so that only one database user role is created per service,
shared across all instances. This eliminates the concurrent "tuple
concurrently updated" (SQLSTATE XX000) errors that occurred when
multiple service instances tried to create separate roles on the same
Postgres instance simultaneously.

  - ServiceUserRoleIdentifier now keys on ServiceID
  - GenerateServiceUsername takes only serviceID (no hostID)
  - All dependent resources updated to look up role by ServiceID
  - MCPConfigResource gains a ServiceID field for the lookup
  - ResourceVersion bumped to "3" to trigger recreation
ServiceInstanceResource.ServiceID gets overwritten during Refresh with
the Docker Swarm service ID, so using it for the ServiceUserRole
dependency lookup caused "not found" errors on subsequent updates.

Add a ServiceSpecID field that holds the logical service ID from the
spec and use it in Dependencies() instead.

Before the dedup change, ServiceInstanceResource.Dependencies() used
ServiceUserRoleIdentifier(s.ServiceInstanceID), which is stable — it
never gets overwritten during Refresh. The ServiceID field wasn't
used in any dependency lookups.

The dedup change switched to ServiceUserRoleIdentifier(s.ServiceID),
which inadvertently used the field that Refresh overwrites with the
Docker Swarm service ID.
Remove ServiceInstanceID and HostID from ServiceUserRole since the
resource is now per-service, not per-instance. Replace HostExecutor
with PrimaryExecutor to dynamically route to the current primary
rather than hardcoding a specific host.

Remove Spock repair mode from role creation — the role is created
once on the primary and Spock replicates it to all other nodes
automatically.
@rshoemaker rshoemaker force-pushed the fix/PLAT-478/svc_user_tx_with_repair_mode branch from c7bbfc6 to 756c1d1 Compare March 11, 2026 15:34
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
server/internal/orchestrator/swarm/mcp_config_resource.go (1)

257-258: ⚠️ Potential issue | 🟡 Minor

Keep credential resolution out of Refresh().

This helper is still invoked from Refresh(), so a missing or transient ServiceUserRole is reported as MCP config drift before config.yaml is even checked. Refresh() for this resource should stay focused on the file on disk; resolve credentials in Create()/Update() only.

Based on learnings, "Refresh should only verify the existence of config.yaml. Tokens.yaml and users.yaml are owned and managed by the MCP server at runtime and should not be rewritten on Update."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/mcp_config_resource.go` around lines 257 -
258, populateCredentials is being invoked from Refresh(), causing
missing/transient ServiceUserRole errors to surface as config drift; remove any
call to MCPConfigResource.populateCredentials (and any
resource.FromContext[*ServiceUserRole] lookup using ServiceUserRoleIdentifier)
from Refresh(), and instead call populateCredentials only from
MCPConfigResource.Create() and MCPConfigResource.Update() paths after validating
config.yaml; ensure Refresh() only checks for existence/contents of config.yaml
and does not touch tokens.yaml or users.yaml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/orchestrator/swarm/service_instance.go`:
- Around line 35-36: The etcd upsert is still writing the ephemeral Docker Swarm
ID (ServiceID) instead of the stable spec identifier, so change the persistence
calls in Create() (and the other upsert sites around the same area) to use
ServiceSpecID when storing the logical service identifier; update any upsert/Put
call that currently uses s.ServiceID to use s.ServiceSpecID, and ensure
Refresh() continues to set ServiceID for runtime use while persistence uses
ServiceSpecID so retries/Re-create paths don't overwrite etcd with the
Docker-assigned ID; touch the same change for the other occurrences mentioned
(the block around lines 64-65) and keep ProvisionServices behavior unchanged.

In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 59-60: ResourceVersion on ServiceUserRole is not a recreate
trigger in this framework (resource/resource.go ignores version-only changes),
so simply returning "3" in ServiceUserRole.ResourceVersion won't force a
rebuild; you must add an explicit recreate path. Update the ServiceUserRole
reconciliation/ensure logic to detect the version bump (compare the stored
resource meta or previous ResourceVersion) and return an explicit recreate
signal (e.g., return the framework's recreate result or error that causes
delete+create) or implement a dedicated NeedsRecreate/EnsureResult flag in
ServiceUserRole so the framework will delete and recreate dependents instead of
treating "3" as no-op (refer to ServiceUserRole.ResourceVersion and
server/internal/resource/resource.go for where to hook this logic).

---

Duplicate comments:
In `@server/internal/orchestrator/swarm/mcp_config_resource.go`:
- Around line 257-258: populateCredentials is being invoked from Refresh(),
causing missing/transient ServiceUserRole errors to surface as config drift;
remove any call to MCPConfigResource.populateCredentials (and any
resource.FromContext[*ServiceUserRole] lookup using ServiceUserRoleIdentifier)
from Refresh(), and instead call populateCredentials only from
MCPConfigResource.Create() and MCPConfigResource.Update() paths after validating
config.yaml; ensure Refresh() only checks for existence/contents of config.yaml
and does not touch tokens.yaml or users.yaml.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 15313633-372b-403b-ac28-46163496d4ea

📥 Commits

Reviewing files that changed from the base of the PR and between c7bbfc6 and 756c1d1.

📒 Files selected for processing (8)
  • server/internal/database/service_instance.go
  • server/internal/database/service_instance_test.go
  • server/internal/orchestrator/swarm/mcp_config_resource.go
  • server/internal/orchestrator/swarm/orchestrator.go
  • server/internal/orchestrator/swarm/service_instance.go
  • server/internal/orchestrator/swarm/service_instance_spec.go
  • server/internal/orchestrator/swarm/service_user_role.go
  • server/internal/workflows/plan_update.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • server/internal/orchestrator/swarm/service_instance_spec.go
  • server/internal/orchestrator/swarm/orchestrator.go

@rshoemaker rshoemaker merged commit 004c9df into main Mar 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants