fix: enable Spock repair mode for service user creation#287
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughService user-role resources were restructured to use ServiceID/ServiceSpecID and NodeName routing, username generation now derives from serviceID only, creation moved into a per-database transactional helper, and primary connections now accept an explicit dbName (AdminDSN(dbName)). Changes
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 130-133: The call to postgres.IsSpockEnabled().Scalar silently
hides query errors because Scalar currently returns nil on row.Scan failures;
update the Scalar implementation in the postgres statement (the method that
implements Scalar on the Statement/Query type) so that any row.Scan error is
returned to the caller instead of nil, ensuring postgres.IsSpockEnabled().Scalar
propagates the scan/DB error back to the caller (so the existing error handling
in service_user_role.go around enabled, err := ... Scalar(ctx, tx) works as
intended).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c1ad74e6-e8db-4423-8957-d107629152ec
📒 Files selected for processing (1)
server/internal/orchestrator/swarm/service_user_role.go
|
The Spock repair mode change is necessary (although unrelated to the E2E failures since there’s only one database instance), but I think the retries are papering over the real issue. As you pointed out, the problem is that each instance has its own I think we should remove the retries and do either one of the following:
Or:
func ServiceUserRoleIdentifier(serviceID string) resource.Identifier {
return resource.Identifier{
ID: serviceID,
Type: ResourceTypeServiceUserRole,
}
}
// ...
func (r *ServiceUserRole) Identifier() resource.Identifier {
return ServiceUserRoleIdentifier(r.ServiceID)
}We already deduplicate resources by ID, so even if every |
|
@jason-lynch I've switched from using |
| @@ -34,6 +34,7 @@ | |||
| // - users.yaml: Application-owned, written only on first Create if init_users is set | |||
| type MCPConfigResource struct { | |||
| ServiceInstanceID string `json:"service_instance_id"` | |||
There was a problem hiding this comment.
Sorry to request another change here - but now that we're making one of these per service rather than one per instance, we shouldn't have the instance ID or the host ID as properties.
There was a problem hiding this comment.
@jason-lynch did you mean to suggest removing ServiceInstanceID from here in mcp_config_resource or did you mean to make the suggestion in service_user_role? The service user role is now per-service, but the config file resource is still per-service-instance.
Assuming you meant for the comment to be in service_user_role, do you think HostID can also be removed? I think that might be unnecessary also.
There was a problem hiding this comment.
Assuming you meant for the comment to be in service_user_role, do you think HostID can also be removed? I think that might be unnecessary also.
Yes, you're right. I also made this comment on the wrong resource, and you're right that the same comment about removing HostID still applies. As mentioned offline, we'll want to use PrimaryExecutor instead of HostExecutor.
| } | ||
|
|
||
| func (r *MCPConfigResource) Executor() resource.Executor { | ||
| return resource.HostExecutor(r.HostID) |
There was a problem hiding this comment.
I didn't notice this before, but using the service host here is going to prevent us from running this service on a host without a database instance. Instead, we should pick any one of the nodes in the database and use resource.PrimaryExecutor(r.NodeName).
There was a problem hiding this comment.
I got these files mixed up when I was reviewing this. Sorry about that!
jason-lynch
left a comment
There was a problem hiding this comment.
Approving since you fixed the original issue. Feel free to address the other comments here or in a separate PR.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
server/internal/orchestrator/swarm/service_instance.go (1)
31-39:⚠️ Potential issue | 🟠 MajorBackfill
ServiceSpecIDduringRefresh()or upgraded resources stay broken.
ServiceSpecIDis now the stable dependency key, but it is ignored in diffs and never repopulated from the desired spec. AnyServiceInstanceResourcealready in state before this PR keeps the zero value forever, so later dependency resolution still points atServiceUserRoleIdentifier("")instead of the shared role. Set it fromdesired.ServiceSpec.ServiceIDwhile refreshing current state.♻️ Suggested backfill
desired, err := resource.FromContext[*ServiceInstanceSpecResource](rc, ServiceInstanceSpecResourceIdentifier(s.ServiceInstanceID)) if err != nil { return fmt.Errorf("failed to get desired service spec from state: %w", err) } + s.ServiceSpecID = desired.ServiceSpec.ServiceID resp, err := client.ServiceInspectByLabels(ctx, map[string]string{Also applies to: 45-67, 69-94
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/swarm/service_instance.go` around lines 31 - 39, During Refresh(), populate ServiceInstanceResource.ServiceSpecID from the desired spec so resources created before this change get backfilled: in the Refresh implementation where you map current+desired state (look for function Refresh and the code handling ServiceInstanceResource and desired.ServiceSpec), set resource.ServiceSpecID = desired.ServiceSpec.ServiceID (or the equivalent desired.ServiceSpecID) when desired is non-nil; do this when constructing or updating the current resource so dependency resolution uses the stable ServiceSpecID while preserving other fields like ServiceID (Docker Swarm service ID).server/internal/database/service_instance.go (1)
128-148:⚠️ Potential issue | 🟠 MajorDistinct service IDs can collapse onto the same database username.
Because normalization happens before both the short-name return and the hash input,
foo-barandfoo_barnow generate the same username. With one shared role per service, those two resources would fight over the same Postgres role/password. Preserve the originalserviceIDin the uniqueness material so normalization does not erase distinctions between valid service IDs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/database/service_instance.go` around lines 128 - 148, The GenerateServiceUsername function currently normalizes serviceID before both building the short username and computing the hash, causing collisions for inputs like "foo-bar" vs "foo_bar"; fix by saving the original serviceID (e.g., orig := serviceID) before replacing hyphens, use the sanitized value for the returned username and for length/truncation logic (the short-name check and raw prefix), but compute the SHA256 hash/suffix from the original unmodified serviceID (sha256.Sum256([]byte(orig))) so normalization doesn't remove uniqueness between distinct IDs.server/internal/orchestrator/swarm/service_user_role.go (1)
173-185:⚠️ Potential issue | 🟠 MajorRevoke the app-db grants before
DROP ROLE.The service user role created at lines 173-183 has grants on
r.DatabaseName(CONNECT, schema/table privileges, and default privileges), but the Delete method (lines 212-229) attemptsDROP ROLEwithout revoking these first. PostgreSQL will not drop a role with active privileges, so the role persists in the database. RunREASSIGN OWNED BYandDROP OWNED BYagainstr.DatabaseNamebefore dropping the role.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/swarm/service_user_role.go` around lines 173 - 185, The Delete method currently drops the role without revoking DB-specific privileges created in the grants block (the grants variable referencing r.DatabaseName and r.Username), causing DROP ROLE to fail; update the Delete implementation (the method that calls DROP ROLE for r.Username) to first run REASSIGN OWNED BY <user> TO <target> and DROP OWNED BY <user> IN DATABASE <db> (or equivalent GRANT/REVOKE statements) against sanitizeIdentifier(r.DatabaseName) for sanitizeIdentifier(r.Username), then revoke schema/table/default privileges (the same patterns used when creating grants) before attempting DROP ROLE so the role can be removed cleanly. Ensure use of the same sanitizeIdentifier(r.Username) and sanitizeIdentifier(r.DatabaseName) helpers and execute these statements via the same tx/Exec flow used by grants.
🧹 Nitpick comments (1)
server/internal/orchestrator/swarm/service_user_role.go (1)
125-192: Add transient error retry for concurrent role creation.The
createUserRole()method checks if the role exists before creating it, but concurrent requests can still race past this check and both attempt CREATE ROLE, triggering PostgreSQL's catalog conflict error (XX000 / "tuple concurrently updated"). While the workflow eventually retries failed creates on the next reconciliation, adding immediate exponential backoff retry for transient catalog conflicts here would eliminate the transient failure window and improve resilience during parallel service provisioning. Theutils.Retry()function is already available in the codebase.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/swarm/service_user_role.go` around lines 125 - 192, The createUserRole function can race on CREATE ROLE; wrap the role-creation sequence in utils.Retry with exponential backoff: for each attempt open a new transaction (use connectToPrimary -> conn.Begin), check/enable Spock, generate postgres.CreateUserRole(...) and call statements.Exec (and then apply grants.Exec and tx.Commit); on transient catalog-conflict errors (Postgres SQLSTATE "XX000" or error text contains "tuple concurrently updated" / pgconn.PgError.Code == "XX000") return a retryable error to utils.Retry, otherwise return the error immediately; ensure tx.Rollback on non-nil tx before retrying and only return nil after a successful Commit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/internal/orchestrator/swarm/mcp_config_resource.go`:
- Around line 37-38: The Refresh logic in MCPConfigResource must not attempt
credential resolution for legacy state: remove or skip the call to
populateCredentials() inside MCPConfigResource.Refresh (and any direct calls to
ServiceUserRoleIdentifier when r.ServiceID == ""), and instead only verify
existence/validity of config.yaml in Refresh; leave populateCredentials() to be
invoked only during reconciliation where ServiceID is guaranteed (or backfill
ServiceID prior to calling populateCredentials()). Update Refresh to
early-return after checking config.yaml if ServiceID is empty, and ensure no
ServiceUserRoleIdentifier("") calls occur from Refresh.
---
Outside diff comments:
In `@server/internal/database/service_instance.go`:
- Around line 128-148: The GenerateServiceUsername function currently normalizes
serviceID before both building the short username and computing the hash,
causing collisions for inputs like "foo-bar" vs "foo_bar"; fix by saving the
original serviceID (e.g., orig := serviceID) before replacing hyphens, use the
sanitized value for the returned username and for length/truncation logic (the
short-name check and raw prefix), but compute the SHA256 hash/suffix from the
original unmodified serviceID (sha256.Sum256([]byte(orig))) so normalization
doesn't remove uniqueness between distinct IDs.
In `@server/internal/orchestrator/swarm/service_instance.go`:
- Around line 31-39: During Refresh(), populate
ServiceInstanceResource.ServiceSpecID from the desired spec so resources created
before this change get backfilled: in the Refresh implementation where you map
current+desired state (look for function Refresh and the code handling
ServiceInstanceResource and desired.ServiceSpec), set resource.ServiceSpecID =
desired.ServiceSpec.ServiceID (or the equivalent desired.ServiceSpecID) when
desired is non-nil; do this when constructing or updating the current resource
so dependency resolution uses the stable ServiceSpecID while preserving other
fields like ServiceID (Docker Swarm service ID).
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 173-185: The Delete method currently drops the role without
revoking DB-specific privileges created in the grants block (the grants variable
referencing r.DatabaseName and r.Username), causing DROP ROLE to fail; update
the Delete implementation (the method that calls DROP ROLE for r.Username) to
first run REASSIGN OWNED BY <user> TO <target> and DROP OWNED BY <user> IN
DATABASE <db> (or equivalent GRANT/REVOKE statements) against
sanitizeIdentifier(r.DatabaseName) for sanitizeIdentifier(r.Username), then
revoke schema/table/default privileges (the same patterns used when creating
grants) before attempting DROP ROLE so the role can be removed cleanly. Ensure
use of the same sanitizeIdentifier(r.Username) and
sanitizeIdentifier(r.DatabaseName) helpers and execute these statements via the
same tx/Exec flow used by grants.
---
Nitpick comments:
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 125-192: The createUserRole function can race on CREATE ROLE; wrap
the role-creation sequence in utils.Retry with exponential backoff: for each
attempt open a new transaction (use connectToPrimary -> conn.Begin),
check/enable Spock, generate postgres.CreateUserRole(...) and call
statements.Exec (and then apply grants.Exec and tx.Commit); on transient
catalog-conflict errors (Postgres SQLSTATE "XX000" or error text contains "tuple
concurrently updated" / pgconn.PgError.Code == "XX000") return a retryable error
to utils.Retry, otherwise return the error immediately; ensure tx.Rollback on
non-nil tx before retrying and only return nil after a successful Commit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3f6b7faf-dbc0-4fb1-a191-7a4b0f888deb
📒 Files selected for processing (7)
server/internal/database/service_instance.goserver/internal/database/service_instance_test.goserver/internal/orchestrator/swarm/mcp_config_resource.goserver/internal/orchestrator/swarm/orchestrator.goserver/internal/orchestrator/swarm/service_instance.goserver/internal/orchestrator/swarm/service_instance_spec.goserver/internal/orchestrator/swarm/service_user_role.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/internal/workflows/plan_update.go (1)
60-67: Consider adding a comment clarifying NodeName routing semantics.The code picks
nodeInstances[0].NodeNamefor routing. While this works correctly (thePrimaryExecutorresolution dynamically finds the actual primary via Patroni regardless of which node name is used), a brief comment would help future readers understand that any valid node name suffices for primary routing.💡 Suggested clarifying comment
// Generate service instance resources. - // Pick any node name for ServiceUserRole PrimaryExecutor routing. + // Pick any node name for ServiceUserRole PrimaryExecutor routing. + // The actual primary is resolved dynamically via Patroni during execution, + // so any valid database node name works here. var nodeName string if len(nodeInstances) > 0 { nodeName = nodeInstances[0].NodeName }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/workflows/plan_update.go` around lines 60 - 67, Add a short clarifying comment above the nodeName selection in the plan update logic explaining that choosing nodeInstances[0].NodeName is intentional because ServiceUserRole/PrimaryExecutor routing will resolve the actual primary via Patroni irrespective of which valid node name is provided; reference the nodeName variable, nodeInstances slice, PrimaryExecutor and ServiceUserRole resolution so readers understand any node name suffices for routing when input.Spec.Services is non-empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/internal/workflows/plan_update.go`:
- Around line 60-67: Add a short clarifying comment above the nodeName selection
in the plan update logic explaining that choosing nodeInstances[0].NodeName is
intentional because ServiceUserRole/PrimaryExecutor routing will resolve the
actual primary via Patroni irrespective of which valid node name is provided;
reference the nodeName variable, nodeInstances slice, PrimaryExecutor and
ServiceUserRole resolution so readers understand any node name suffices for
routing when input.Spec.Services is non-empty.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 76f8b620-cfb4-4f54-a4fd-ee7c4b329895
📒 Files selected for processing (4)
server/internal/database/service_instance.goserver/internal/orchestrator/swarm/orchestrator.goserver/internal/orchestrator/swarm/service_user_role.goserver/internal/workflows/plan_update.go
jason-lynch
left a comment
There was a problem hiding this comment.
Awesome! Thank you for making these changes on the fly.
Wrap service user role creation in a transaction with Spock repair mode enabled. This prevents "tuple concurrently updated" errors when the same user is being created on multiple instances simultaneously and Spock replicates the changes.
Wrap createUserRole in a retry loop (3 attempts, exponential backoff) to handle transient "tuple concurrently updated" errors when multiple service user roles are created concurrently against the same Postgres instance. Also connect to the application database instead of "postgres" so that Spock repair mode is correctly enabled (Spock is installed per-database). # Conflicts: # server/internal/orchestrator/swarm/service_user_role.go
Change the ServiceUserRole resource identity from ServiceInstanceID to ServiceID so that only one database user role is created per service, shared across all instances. This eliminates the concurrent "tuple concurrently updated" (SQLSTATE XX000) errors that occurred when multiple service instances tried to create separate roles on the same Postgres instance simultaneously. - ServiceUserRoleIdentifier now keys on ServiceID - GenerateServiceUsername takes only serviceID (no hostID) - All dependent resources updated to look up role by ServiceID - MCPConfigResource gains a ServiceID field for the lookup - ResourceVersion bumped to "3" to trigger recreation
ServiceInstanceResource.ServiceID gets overwritten during Refresh with the Docker Swarm service ID, so using it for the ServiceUserRole dependency lookup caused "not found" errors on subsequent updates. Add a ServiceSpecID field that holds the logical service ID from the spec and use it in Dependencies() instead. Before the dedup change, ServiceInstanceResource.Dependencies() used ServiceUserRoleIdentifier(s.ServiceInstanceID), which is stable — it never gets overwritten during Refresh. The ServiceID field wasn't used in any dependency lookups. The dedup change switched to ServiceUserRoleIdentifier(s.ServiceID), which inadvertently used the field that Refresh overwrites with the Docker Swarm service ID.
Remove ServiceInstanceID and HostID from ServiceUserRole since the resource is now per-service, not per-instance. Replace HostExecutor with PrimaryExecutor to dynamically route to the current primary rather than hardcoding a specific host. Remove Spock repair mode from role creation — the role is created once on the primary and Spock replicates it to all other nodes automatically.
c7bbfc6 to
756c1d1
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
server/internal/orchestrator/swarm/mcp_config_resource.go (1)
257-258:⚠️ Potential issue | 🟡 MinorKeep credential resolution out of
Refresh().This helper is still invoked from
Refresh(), so a missing or transientServiceUserRoleis reported as MCP config drift beforeconfig.yamlis even checked.Refresh()for this resource should stay focused on the file on disk; resolve credentials inCreate()/Update()only.Based on learnings, "Refresh should only verify the existence of config.yaml. Tokens.yaml and users.yaml are owned and managed by the MCP server at runtime and should not be rewritten on Update."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/swarm/mcp_config_resource.go` around lines 257 - 258, populateCredentials is being invoked from Refresh(), causing missing/transient ServiceUserRole errors to surface as config drift; remove any call to MCPConfigResource.populateCredentials (and any resource.FromContext[*ServiceUserRole] lookup using ServiceUserRoleIdentifier) from Refresh(), and instead call populateCredentials only from MCPConfigResource.Create() and MCPConfigResource.Update() paths after validating config.yaml; ensure Refresh() only checks for existence/contents of config.yaml and does not touch tokens.yaml or users.yaml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/internal/orchestrator/swarm/service_instance.go`:
- Around line 35-36: The etcd upsert is still writing the ephemeral Docker Swarm
ID (ServiceID) instead of the stable spec identifier, so change the persistence
calls in Create() (and the other upsert sites around the same area) to use
ServiceSpecID when storing the logical service identifier; update any upsert/Put
call that currently uses s.ServiceID to use s.ServiceSpecID, and ensure
Refresh() continues to set ServiceID for runtime use while persistence uses
ServiceSpecID so retries/Re-create paths don't overwrite etcd with the
Docker-assigned ID; touch the same change for the other occurrences mentioned
(the block around lines 64-65) and keep ProvisionServices behavior unchanged.
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 59-60: ResourceVersion on ServiceUserRole is not a recreate
trigger in this framework (resource/resource.go ignores version-only changes),
so simply returning "3" in ServiceUserRole.ResourceVersion won't force a
rebuild; you must add an explicit recreate path. Update the ServiceUserRole
reconciliation/ensure logic to detect the version bump (compare the stored
resource meta or previous ResourceVersion) and return an explicit recreate
signal (e.g., return the framework's recreate result or error that causes
delete+create) or implement a dedicated NeedsRecreate/EnsureResult flag in
ServiceUserRole so the framework will delete and recreate dependents instead of
treating "3" as no-op (refer to ServiceUserRole.ResourceVersion and
server/internal/resource/resource.go for where to hook this logic).
---
Duplicate comments:
In `@server/internal/orchestrator/swarm/mcp_config_resource.go`:
- Around line 257-258: populateCredentials is being invoked from Refresh(),
causing missing/transient ServiceUserRole errors to surface as config drift;
remove any call to MCPConfigResource.populateCredentials (and any
resource.FromContext[*ServiceUserRole] lookup using ServiceUserRoleIdentifier)
from Refresh(), and instead call populateCredentials only from
MCPConfigResource.Create() and MCPConfigResource.Update() paths after validating
config.yaml; ensure Refresh() only checks for existence/contents of config.yaml
and does not touch tokens.yaml or users.yaml.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 15313633-372b-403b-ac28-46163496d4ea
📒 Files selected for processing (8)
server/internal/database/service_instance.goserver/internal/database/service_instance_test.goserver/internal/orchestrator/swarm/mcp_config_resource.goserver/internal/orchestrator/swarm/orchestrator.goserver/internal/orchestrator/swarm/service_instance.goserver/internal/orchestrator/swarm/service_instance_spec.goserver/internal/orchestrator/swarm/service_user_role.goserver/internal/workflows/plan_update.go
🚧 Files skipped from review as they are similar to previous changes (2)
- server/internal/orchestrator/swarm/service_instance_spec.go
- server/internal/orchestrator/swarm/orchestrator.go
fix: resolve concurrent service user creation failures
Summary
HostExecutorwithPrimaryExecutorfor dynamic routing to the current primary nodeContext
TestProvisionMultiHostMCPServicewas failing intermittently with:ERROR: tuple concurrently updated (SQLSTATE XX000)
When a database has multiple MCP service instances (e.g., on host-1, host-2, host-3), each previously got its own
ServiceUserRoleresource. These executed concurrently on the same Postgres instance, and their GRANT statements conflicted at the catalog level.The fix changes
ServiceUserRoleidentity fromServiceInstanceIDtoServiceID, so the resource system deduplicates them — only one role is created regardless of how many service instances exist. This eliminates concurrent creation entirely rather than retrying through it.Other changes:
GenerateServiceUsernamenow takes onlyserviceID(nohostID)ServiceInstanceResourcegains aServiceSpecIDfield to avoid conflation with the Docker Swarm service ID set during RefreshResourceVersionbumped to "3" to trigger recreationVerified with 30 consecutive test runs, 0 failures.
Test plan
make test)TestProvisionMultiHostMCPServicepasses 30/30 runsPLAT-478