Fixes #28718: prevent lost-update race in PolicyConditionUpdater by harshach · Pull Request #28720 · open-metadata/OpenMetadata

harshach · 2026-06-04T19:50:18Z

Describe your changes:

I reworked PolicyConditionUpdater.updateAllPolicyConditions because concurrent tag/glossary renames touching overlapping policy rows could clobber each other's condition rewrites — each changed policy was written with a blind full-row update and no conflict check (classic lost-update). The blind getDao().update(policy) is replaced with an optimistic content-based compare-and-swap (new EntityDAO.updateIfMatches, which writes only if the stored JSON still equals the snapshot last read) plus a bounded retry that re-reads with a FOR UPDATE lock and re-applies the rewriter. I chose content-CAS over the existing version-CAS so the automated rewrite keeps its silent semantics (no version bump, no version-history entry). I also paginated the policy scan to drop the silent 10000-policy cap.

Type of change:

Bug fix

High-level design:

EntityDAO.updateIfMatches(...) adds AND json = CAST(:expectedJson AS JSON) (MySQL) / AND json = (:expectedJson :: jsonb) (Postgres) to the row update, returning rows-updated (0 = a concurrent writer won → retry). rewriteSinglePolicy re-reads the row via findByIdForUpdate (SELECT json ... FOR UPDATE), re-applies the rewriter, and CAS-writes; on conflict it re-reads and retries up to 5×. The locking read is essential: this runs inside the enclosing rename/delete @Transaction, so under MySQL's default REPEATABLE READ a non-locking re-read returns the transaction's stale snapshot and the retry could never converge (the second rename would be silently dropped) — FOR UPDATE returns the latest committed row and serializes conflicting writers. Version-CAS (updateWithVersion) was rejected because it only detects conflicts when each write bumps the version, which would break the deliberate silent-rewrite behavior and desync version history.

Tests:

Use cases covered

Two tags referenced by the same policy rule are renamed concurrently — both FQN rewrites survive in the policy condition.
Instances with >10000 policies are fully scanned (no silent tail truncation).

Unit tests

Existing PolicyConditionUpdaterTest (41 tests) still green; the pure rewrite helpers are unchanged.

Backend integration tests

Added PolicyResourceIT.test_concurrentTagRenamesBothUpdatePolicyCondition (barrier-aligned concurrent renames, repeated, asserts both rewrites persist). Runs against MySQL in CI.

Ingestion integration tests

Not applicable (no ingestion changes).

Playwright (UI) tests

Not applicable (no UI changes).

Manual testing performed

Reproduced the race on real MySQL 8.3 with two pymysql connections in REPEATABLE READ transactions (the same isolation as production), each holding a snapshot of the policy row:

Connection A CAS-writes rename Create teams page similar to tags #1 and commits.
Connection B's CAS for rename Replace catalog to openmetadata #2 correctly returns 0 rows (conflict detected).
With a plain re-read, B keeps reading its stale snapshot, exhausts all 5 retries, and rename Replace catalog to openmetadata #2 is lost — matchAnyTag(A2, B).
With the FOR UPDATE re-read, B reads the latest committed row, re-applies, and succeeds in 1 attempt — both survive: matchAnyTag(A2, B2).

Same harness, only the re-read strategy differs, confirming both the bug and the fix.

UI screen recording / screenshots:

Not applicable.

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
My PR is linked to a GitHub issue via Fixes #<issue-number> above.
I have commented on my code, particularly in hard-to-understand areas.
I have added tests (unit / integration / Playwright as applicable) and listed them above.
I have added a test that covers the exact scenario we are fixing (test_concurrentTagRenamesBothUpdatePolicyCondition, see Lost-update race in PolicyConditionUpdater on concurrent tag/glossary renames #28718).

Concurrent tag/glossary renames touching overlapping policy rows could clobber each other's condition rewrites: each changed policy was written with a blind full-row update and no conflict check (classic lost-update). Replace the blind getDao().update(policy) with an optimistic content-based compare-and-swap (new EntityDAO.updateIfMatches) plus a bounded re-read / re-apply retry, so a concurrent writer's change is never lost. Content-CAS is used instead of version-CAS so the automated rewrite keeps its silent semantics (no version bump, no version history). Also paginate the policy scan to drop the silent 10000-policy cap. Validated the CAS SQL on real MySQL 8.3 and Postgres 15; added a concurrent-rename regression test in PolicyResourceIT. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-04T22:23:27Z

🟡 Playwright Results — all passed (9 flaky)

✅ 4271 passed · ❌ 0 failed · 🟡 9 flaky · ⏭️ 88 skipped

Shard	Passed	Flaky	Skipped
✅ Shard 1	301	0	4
✅ Shard 2	804	0	9
🟡 Shard 3	801	4	8
✅ Shard 4	855	0	12
🟡 Shard 5	719	2	47
🟡 Shard 6	791	3	8

🟡 9 flaky test(s) (passed on retry)

Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test upvote and downvote buttons (shard 3, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
Flow/ExploreAggregationCountsMatching.spec.ts › should verify left panel counts and tab search results for normal search (shard 3, 1 retry)
Pages/Entity.spec.ts › Domain Add, Update and Remove (shard 5, 1 retry)
Pages/ExplorePageRightPanel.spec.ts › Should handle lineage expansion buttons for dashboardDataModel (shard 5, 1 retry)
Pages/InputOutputPorts.spec.ts › Remove single input port (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for table -> table (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

The content-CAS retry re-read used a non-locking SELECT. Inside the enclosing rename/delete transaction, MySQL's default REPEATABLE READ serves the transaction's stale snapshot for non-locking reads, so after a CAS conflict the retry kept re-reading the pre-conflict row and the CAS never converged -- the second concurrent rename was dropped (the exact bug this PR fixes). Postgres READ COMMITTED happened to work. Re-read with EntityDAO.findByIdForUpdate (SELECT ... FOR UPDATE) so the retry sees the latest committed row and conflicting writers serialize. Verified on real MySQL 8.3 in REPEATABLE READ transactions: two concurrent renames both survive with the locking read; the second is lost with a plain read (retry exhausts re-reading the stale snapshot). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-06-04T23:16:46Z

Code Review ✅ Approved 1 resolved / 1 findings

Prevents lost-update race conditions in PolicyConditionUpdater by implementing a content-based compare-and-swap with bounded retries. The concurrent update logic now correctly handles stale re-reads, ensuring all policy condition rewrites persist.

✅ 1 resolved

✅ Bug: CAS retry re-read is stale under MySQL REPEATABLE READ, fix defeated

📄 openmetadata-service/src/main/java/org/openmetadata/service/security/policyevaluator/PolicyConditionUpdater.java:191-205 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityDAO.java:785-799 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/PolicyResourceIT.java:856-870
updateAllPolicyConditions runs inside an enclosing DB transaction: for tag renames it is called from TagRepository.TagUpdater.updateNameAndParent, whose entitySpecificUpdate is annotated @Transaction (TagRepository.java:925/1007); role deletes go through bulkHardDeleteSubtree (also @Transaction). The new rewriteSinglePolicy retry loop relies on re-reading the latest committed row via dao.findJsonById(...) (PolicyConditionUpdater.java:199) after a CAS conflict, then re-applying and CAS-writing again.

MySQL's default isolation is REPEATABLE READ, and the deployment does not override it (only useLocalTransactionState=true is set in conf). Under REPEATABLE READ, a non-locking consistent read (plain SELECT, which is what findById/findJsonById issues) returns the snapshot established at the transaction's first read for the whole transaction lifetime. So after a concurrent writer commits J1 and our updateIfMatches ... WHERE json = :expectedJson(J0) returns 0 rows, the retry's findJsonById still returns the stale J0, the rewriter re-produces J0+ourChange, and the CAS keeps matching J0 — which no longer exists — so it returns 0 again. The loop exhausts all 5 attempts, logs the WARN at line 208, and silently drops this writer's rename.

Net effect on MySQL (the primary supported engine): the lost-update is now detected but still not recovered — the second concurrent rename is dropped, exactly the bug this PR claims to fix. The included regression test test_concurrentTagRenamesBothUpdatePolicyCondition will therefore be flaky/failing on MySQL (it asserts BOTH FQNs survive). Postgres works because its default READ COMMITTED re-reads fresh committed data each statement; the author's manual test exercised raw SQL outside the app transaction, masking this.

Fix options: (a) make the retry re-read a locking read — add a findJsonByIdForUpdate issuing SELECT json ... FOR UPDATE, which under REPEATABLE READ reads the latest committed version and serializes conflicting writers; (b) run the CAS read+write at READ COMMITTED for this operation; or (c) keep the version-CAS / EntityUpdater path. Option (a) is the most targeted.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-06-05T00:48:52Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 4, 2026

gitar-bot Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread .../src/main/java/org/openmetadata/service/security/policyevaluator/PolicyConditionUpdater.java

harshach temporarily deployed to test June 4, 2026 20:01 — with GitHub Actions Inactive

harshach temporarily deployed to test June 4, 2026 23:25 — with GitHub Actions Inactive

yan-3005 approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #28718: prevent lost-update race in PolicyConditionUpdater#28720

Fixes #28718: prevent lost-update race in PolicyConditionUpdater#28720
harshach wants to merge 2 commits into
mainfrom
harshach/toronto

harshach commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

gitar-bot Bot commented Jun 4, 2026

Uh oh!

sonarqubecloud Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harshach commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

High-level design:

Tests:

Use cases covered

Unit tests

Backend integration tests

Ingestion integration tests

Playwright (UI) tests

Manual testing performed

UI screen recording / screenshots:

Checklist:

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Playwright Results — all passed (9 flaky)

Uh oh!

gitar-bot Bot commented Jun 4, 2026

Uh oh!

sonarqubecloud Bot commented Jun 5, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harshach commented Jun 4, 2026 •

edited

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading