Skip to content

Fixes #28718: prevent lost-update race in PolicyConditionUpdater#28720

Open
harshach wants to merge 2 commits into
mainfrom
harshach/toronto
Open

Fixes #28718: prevent lost-update race in PolicyConditionUpdater#28720
harshach wants to merge 2 commits into
mainfrom
harshach/toronto

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented Jun 4, 2026

Describe your changes:

Fixes #28718

I reworked PolicyConditionUpdater.updateAllPolicyConditions because concurrent tag/glossary renames touching overlapping policy rows could clobber each other's condition rewrites — each changed policy was written with a blind full-row update and no conflict check (classic lost-update). The blind getDao().update(policy) is replaced with an optimistic content-based compare-and-swap (new EntityDAO.updateIfMatches, which writes only if the stored JSON still equals the snapshot last read) plus a bounded retry that re-reads with a FOR UPDATE lock and re-applies the rewriter. I chose content-CAS over the existing version-CAS so the automated rewrite keeps its silent semantics (no version bump, no version-history entry). I also paginated the policy scan to drop the silent 10000-policy cap.

Type of change:

  • Bug fix

High-level design:

EntityDAO.updateIfMatches(...) adds AND json = CAST(:expectedJson AS JSON) (MySQL) / AND json = (:expectedJson :: jsonb) (Postgres) to the row update, returning rows-updated (0 = a concurrent writer won → retry). rewriteSinglePolicy re-reads the row via findByIdForUpdate (SELECT json ... FOR UPDATE), re-applies the rewriter, and CAS-writes; on conflict it re-reads and retries up to 5×. The locking read is essential: this runs inside the enclosing rename/delete @Transaction, so under MySQL's default REPEATABLE READ a non-locking re-read returns the transaction's stale snapshot and the retry could never converge (the second rename would be silently dropped) — FOR UPDATE returns the latest committed row and serializes conflicting writers. Version-CAS (updateWithVersion) was rejected because it only detects conflicts when each write bumps the version, which would break the deliberate silent-rewrite behavior and desync version history.

Tests:

Use cases covered

  • Two tags referenced by the same policy rule are renamed concurrently — both FQN rewrites survive in the policy condition.
  • Instances with >10000 policies are fully scanned (no silent tail truncation).

Unit tests

  • Existing PolicyConditionUpdaterTest (41 tests) still green; the pure rewrite helpers are unchanged.

Backend integration tests

  • Added PolicyResourceIT.test_concurrentTagRenamesBothUpdatePolicyCondition (barrier-aligned concurrent renames, repeated, asserts both rewrites persist). Runs against MySQL in CI.

Ingestion integration tests

  • Not applicable (no ingestion changes).

Playwright (UI) tests

  • Not applicable (no UI changes).

Manual testing performed

Reproduced the race on real MySQL 8.3 with two pymysql connections in REPEATABLE READ transactions (the same isolation as production), each holding a snapshot of the policy row:

  1. Connection A CAS-writes rename Create teams page similar to tags #1 and commits.
  2. Connection B's CAS for rename Replace catalog to openmetadata #2 correctly returns 0 rows (conflict detected).
  3. With a plain re-read, B keeps reading its stale snapshot, exhausts all 5 retries, and rename Replace catalog to openmetadata #2 is lostmatchAnyTag(A2, B).
  4. With the FOR UPDATE re-read, B reads the latest committed row, re-applies, and succeeds in 1 attempt — both survive: matchAnyTag(A2, B2).

Same harness, only the re-read strategy differs, confirming both the bug and the fix.

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • My PR is linked to a GitHub issue via Fixes #<issue-number> above.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added tests (unit / integration / Playwright as applicable) and listed them above.
  • I have added a test that covers the exact scenario we are fixing (test_concurrentTagRenamesBothUpdatePolicyCondition, see Lost-update race in PolicyConditionUpdater on concurrent tag/glossary renames #28718).

Concurrent tag/glossary renames touching overlapping policy rows could
clobber each other's condition rewrites: each changed policy was written
with a blind full-row update and no conflict check (classic lost-update).

Replace the blind getDao().update(policy) with an optimistic
content-based compare-and-swap (new EntityDAO.updateIfMatches) plus a
bounded re-read / re-apply retry, so a concurrent writer's change is
never lost. Content-CAS is used instead of version-CAS so the automated
rewrite keeps its silent semantics (no version bump, no version history).
Also paginate the policy scan to drop the silent 10000-policy cap.

Validated the CAS SQL on real MySQL 8.3 and Postgres 15; added a
concurrent-rename regression test in PolicyResourceIT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

🟡 Playwright Results — all passed (9 flaky)

✅ 4271 passed · ❌ 0 failed · 🟡 9 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 301 0 0 4
✅ Shard 2 804 0 0 9
🟡 Shard 3 801 0 4 8
✅ Shard 4 855 0 0 12
🟡 Shard 5 719 0 2 47
🟡 Shard 6 791 0 3 8
🟡 9 flaky test(s) (passed on retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test upvote and downvote buttons (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
  • Flow/ExploreAggregationCountsMatching.spec.ts › should verify left panel counts and tab search results for normal search (shard 3, 1 retry)
  • Pages/Entity.spec.ts › Domain Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should handle lineage expansion buttons for dashboardDataModel (shard 5, 1 retry)
  • Pages/InputOutputPorts.spec.ts › Remove single input port (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for table -> table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

The content-CAS retry re-read used a non-locking SELECT. Inside the
enclosing rename/delete transaction, MySQL's default REPEATABLE READ
serves the transaction's stale snapshot for non-locking reads, so after
a CAS conflict the retry kept re-reading the pre-conflict row and the
CAS never converged -- the second concurrent rename was dropped (the
exact bug this PR fixes). Postgres READ COMMITTED happened to work.

Re-read with EntityDAO.findByIdForUpdate (SELECT ... FOR UPDATE) so the
retry sees the latest committed row and conflicting writers serialize.

Verified on real MySQL 8.3 in REPEATABLE READ transactions: two
concurrent renames both survive with the locking read; the second is
lost with a plain read (retry exhausts re-reading the stale snapshot).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Jun 4, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Prevents lost-update race conditions in PolicyConditionUpdater by implementing a content-based compare-and-swap with bounded retries. The concurrent update logic now correctly handles stale re-reads, ensuring all policy condition rewrites persist.

✅ 1 resolved
Bug: CAS retry re-read is stale under MySQL REPEATABLE READ, fix defeated

📄 openmetadata-service/src/main/java/org/openmetadata/service/security/policyevaluator/PolicyConditionUpdater.java:191-205 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityDAO.java:785-799 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/PolicyResourceIT.java:856-870
updateAllPolicyConditions runs inside an enclosing DB transaction: for tag renames it is called from TagRepository.TagUpdater.updateNameAndParent, whose entitySpecificUpdate is annotated @Transaction (TagRepository.java:925/1007); role deletes go through bulkHardDeleteSubtree (also @Transaction). The new rewriteSinglePolicy retry loop relies on re-reading the latest committed row via dao.findJsonById(...) (PolicyConditionUpdater.java:199) after a CAS conflict, then re-applying and CAS-writing again.

MySQL's default isolation is REPEATABLE READ, and the deployment does not override it (only useLocalTransactionState=true is set in conf). Under REPEATABLE READ, a non-locking consistent read (plain SELECT, which is what findById/findJsonById issues) returns the snapshot established at the transaction's first read for the whole transaction lifetime. So after a concurrent writer commits J1 and our updateIfMatches ... WHERE json = :expectedJson(J0) returns 0 rows, the retry's findJsonById still returns the stale J0, the rewriter re-produces J0+ourChange, and the CAS keeps matching J0 — which no longer exists — so it returns 0 again. The loop exhausts all 5 attempts, logs the WARN at line 208, and silently drops this writer's rename.

Net effect on MySQL (the primary supported engine): the lost-update is now detected but still not recovered — the second concurrent rename is dropped, exactly the bug this PR claims to fix. The included regression test test_concurrentTagRenamesBothUpdatePolicyCondition will therefore be flaky/failing on MySQL (it asserts BOTH FQNs survive). Postgres works because its default READ COMMITTED re-reads fresh committed data each statement; the author's manual test exercised raw SQL outside the app transaction, masking this.

Fix options: (a) make the retry re-read a locking read — add a findJsonByIdForUpdate issuing SELECT json ... FOR UPDATE, which under REPEATABLE READ reads the latest committed version and serializes conflicting writers; (b) run the CAS read+write at READ COMMITTED for this operation; or (c) keep the version-CAS / EntityUpdater path. Option (a) is the most targeted.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lost-update race in PolicyConditionUpdater on concurrent tag/glossary renames

2 participants