Skip to content

fix(tables): serialize schema mutations to prevent parallel column clobber#4812

Merged
TheodoreSpeaks merged 6 commits into
stagingfrom
fix/mothership-add-multiple-wf-column
May 30, 2026
Merged

fix(tables): serialize schema mutations to prevent parallel column clobber#4812
TheodoreSpeaks merged 6 commits into
stagingfrom
fix/mothership-add-multiple-wf-column

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

Summary

  • Every table schema mutator in lib/table/service.ts (add/rename/delete column, change type/constraints, add/update/delete workflow group + outputs) did an unguarded read-modify-write of the schema JSONB — read outside any lock, compute, blind-write the whole schema. Parallel calls (e.g. Mothership fanning out three add_workflow_group_output calls for separate columns) all read the same baseline and last-writer-wins, so columns silently vanished while every call reported success.
  • Added a withLockedTable helper that takes a transaction-scoped advisory lock keyed on tableId, then re-reads the table inside the lock so each serialized writer computes against the prior writer's committed columns. Wired all 11 mutators through it.
  • Used an advisory lock (not SELECT ... FOR UPDATE on the definition row) so it adds no edges to the row-lock graph — a FOR UPDATE would invert lock order against the row-count trigger and risk deadlocks on the row-write path. Bumped idle_in_transaction_session_timeout for the two paths that load a workflow inside the lock.
  • Scope: serializes all schema/structure changes per-table across the grid UI, public v1 API, and Mothership tool. Row data writes and other tables are unaffected.

Type of Change

  • Bug fix

Testing

  • Tested manually
  • bun run lint clean, bun run check:api-validation:strict passed
  • 169 table service + user_table tool tests pass

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 30, 2026 7:24pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 30, 2026

PR Summary

Medium Risk
Serializes all per-table schema/structure mutations and touches row JSONB on rename/delete; incorrect locking or timeout tuning could cause waits or failed large-table operations, but the change directly addresses data-loss races in a core persistence path.

Overview
Fixes lost columns when multiple schema changes hit the same user table at once (e.g. parallel Mothership add_workflow_group_output calls). Table mutators no longer do unguarded read–modify–write on the schema JSONB.

Introduces withLockedTable: a per-table transaction advisory lock, then a fresh read inside that transaction so each writer builds on the previous commit. Column add/rename/delete, type/constraint updates, and workflow-group CRUD all run through it; writes use the same trx. getTableById accepts an optional transaction. Heavy workflow loading for updateWorkflowGroup / addWorkflowGroupOutput is moved outside the lock where possible; remap leaf types are applied only if workflowId is unchanged under the lock.

Also allows metadata.description to be null on workflow state wire payloads, and scales idle-in-transaction timeout for type-change paths that scan rows inside the lock.

Reviewed by Cursor Bugbot for commit dd6b077. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/lib/table/service.ts
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 30, 2026

Greptile Summary

This PR fixes a silent column-clobber race in lib/table/service.ts where all 11 schema mutators (add/rename/delete column, type/constraint changes, workflow group operations) performed unguarded read-modify-write of the schema JSONB column — allowing last-writer-wins semantics that made columns silently disappear when Mothership fanned out concurrent calls. The fix introduces a withLockedTable helper that acquires a per-table PostgreSQL advisory lock, re-reads the table inside the lock, and hands the fresh snapshot to each mutator callback.

  • All 11 schema mutators are now serialized through withLockedTable; workflow loads (addWorkflowGroupOutput, updateWorkflowGroup) are moved to a pre-lock phase-1 pass to keep the critical section fast, addressing lock-hold concerns from the previous review.
  • updateColumnConstraints now sets idleMs alongside statementMs (matching updateColumnType), preventing the 5 s idle_in_transaction_session_timeout from aborting valid constraint changes on large tables.
  • workflowStateSchema.metadata.description is widened to .nullable().optional() to match the database reality of null descriptions.

Confidence Score: 5/5

Safe to merge — the advisory-lock fix is architecturally sound, the critical section is kept tight, and all previously raised concerns (workflow load inside lock, idle timeout for updateColumnConstraints) are addressed.

All 11 schema mutators are correctly serialized through the new advisory lock. The two pre-lock phase-1 patterns include proper staleness guards under the lock. The updateColumnConstraints idle-timeout gap from the prior review is fixed. The only observations are two pre-existing unguarded schema writers (updateTableMetadata, pruneStaleWorkflowGroupOutputs) that fall outside the PR's stated scope, plus a one-word doc comment correction.

apps/sim/lib/table/service.ts — the two pre-existing unguarded schema writers (updateTableMetadata lines 713–766, pruneStaleWorkflowGroupOutputs lines 796–854) are the most natural follow-up candidates for serialization coverage.

Important Files Changed

Filename Overview
apps/sim/lib/table/service.ts Core of the fix: withLockedTable helper added, all 11 schema mutators wired through it. Pre-lock workflow loads correctly moved outside the advisory lock critical section. updateColumnConstraints idle timeout addressed. Two pre-existing unguarded schema writers remain outside the lock.
apps/sim/lib/api/contracts/workflows.ts Schema broadening: description in workflowStateSchema.metadata now accepts null in addition to `string

Sequence Diagram

sequenceDiagram
    participant C1 as Caller 1
    participant C2 as Caller 2
    participant WLT as withLockedTable
    participant PG as PostgreSQL

    C1->>WLT: addWorkflowGroupOutput (phase 1)
    C2->>WLT: addWorkflowGroupOutput (phase 1)
    Note over C1,C2: Both load workflow outside lock (parallel OK)
    C1->>PG: loadWorkflowFromNormalizedTables
    C2->>PG: loadWorkflowFromNormalizedTables
    PG-->>C1: workflow graph
    PG-->>C2: workflow graph

    C1->>PG: BEGIN + setTableTxTimeouts + pg_advisory_xact_lock
    Note over C1,PG: C1 acquires lock
    C2->>PG: BEGIN + setTableTxTimeouts + pg_advisory_xact_lock
    Note over C2,PG: C2 waits (lock_timeout=3s)

    C1->>PG: SELECT table (fresh read under lock)
    PG-->>C1: current schema
    C1->>PG: UPDATE schema (new column added)
    C1->>PG: COMMIT → lock released
    Note over C1: Lock released

    Note over C2: C2 now acquires lock
    C2->>PG: SELECT table (sees C1 committed column)
    PG-->>C2: fresh schema with C1 column
    C2->>PG: UPDATE schema (adds its column on top)
    C2->>PG: COMMIT
Loading

Reviews (5): Last reviewed commit: "fix(tables): scale idle timeout in updat..." | Re-trigger Greptile

Comment thread apps/sim/lib/table/service.ts Outdated
Comment thread apps/sim/lib/table/service.ts
Comment thread apps/sim/lib/table/service.ts Outdated
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

Addressed in d05348d:

P1 — advisory lock held during workflow load (addWorkflowGroupOutput + updateWorkflowGroup) — Moved the loadWorkflowFromNormalizedTables call out of the locked critical section. Both functions now resolve the workflow output / remap leaf-types in a phase 1 (no lock) pre-read, then take the advisory lock only for the fast in-memory splice + schema UPDATE. addWorkflowGroupOutput captures workflowId from the pre-read group and re-validates inside the lock that the group still maps to the same workflow (throws a clear conflict if it was remapped concurrently). updateWorkflowGroup's remap type-resolution is best-effort out-of-lock and applied against the fresh schema under the lock. Dropped the 30s idle_in_transaction_session_timeout bump since no I/O happens inside the lock anymore — this resolves the lock-timeout-under-fan-out concern.

P2 — getTableById tx type — Switched the tx param from the file-local DbTransaction to the shared DbOrTx union, matching the rest of the service.

Typecheck, lint, check:api-validation:strict, and the 169 table/tool tests all pass.

@greptile review

Comment thread apps/sim/lib/table/service.ts
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

Addressed in 9cbcfcc:

Idle timeout aborts type change (updateColumnType) — valid catch. The per-row compatibility check iterates in Node between the row SELECT and the schema UPDATE, leaving the transaction idle; on a large table that gap could exceed the default 5s idle_in_transaction_session_timeout and abort a valid type change (a regression from wrapping the op in a transaction). Now scales both statement_timeout and idle_in_transaction_session_timeout to row count (max(60s, rows·2ms), capped at 10min), so the Node-side check has the same row-proportional budget as the SQL work.

Typecheck, lint, check:api-validation:strict, and 169 table/tool tests pass.

@greptile review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9cbcfcc. Configure here.

Comment thread apps/sim/lib/table/service.ts
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

Addressed in dd6b077:

Stale remap types after workflow change (updateWorkflowGroup) — good consistency catch vs addWorkflowGroupOutput. Phase 1 now records the workflowId it resolved the remap leaf-types against; phase 2 only applies those types if the group still points at that same workflow under the lock. If a concurrent writer changed workflowId in between, the resolved types are stale, so we leave column types unchanged (best-effort, same as a resolution failure) and log it, rather than stamping types from the old workflow.

Typecheck, lint, check:api-validation:strict, and 169 table/tool tests pass.

@greptile review

Comment thread apps/sim/lib/table/service.ts Outdated
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

Addressed in 704944b:

updateColumnConstraints idle timeout — applied the same idleMs scaling as updateColumnType for consistency. The idle gap here is smaller (the required/unique checks are server-side queries, not a per-row Node loop), but matching the pattern keeps every in-lock validation path uniformly safe against the 5s idle_in_transaction_session_timeout on large tables.

Typecheck, lint, check:api-validation:strict, and 169 table/tool tests pass.

@greptile review

@TheodoreSpeaks TheodoreSpeaks merged commit 329bf48 into staging May 30, 2026
14 checks passed
@TheodoreSpeaks TheodoreSpeaks deleted the fix/mothership-add-multiple-wf-column branch May 30, 2026 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant