feat(txn): surface transaction abort reasons to clients#9747
Open
rahst12 wants to merge 2 commits into
Open
Conversation
6eff47d to
1042bb1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When Dgraph aborts a transaction, Zero knows why — but the client never finds out.
Zero distinguishes at least three causes internally:
while a move is in flight), and
change, not a real conflict).
All three collapse into
src.Aborted = trueinsideServer.commit()(
dgraph/cmd/zero/oracle.go), and by the time the error reaches the caller it is oneopaque string:
Worse, the
checkPredspath builds specific, human-readable messages for predicatemoves and then throws them away. With no discriminator, a client cannot make the one
decision the category enables: retry immediately (conflict) vs. back off
(predicate is moving).
Why we can't just name the conflicting predicate
For conflicts, you might expect the abort to name the contended predicate/UID. It can't —
not without extra bookkeeping. For every mutated edge, Alpha computes a conflict key in
GetConflictKey()(posting/list.go) as a one-way fingerprint:Only this
uint64is retained (conflicts map[uint64]struct{}inposting/oracle.go);the predicate name and UID are dropped immediately. It is then base-36 encoded into
TxnContext.Keysand shipped to Zero, which detects a conflict purely by matchingfingerprints — it never sees a predicate name. Because the value is a fingerprint XORed
with a UID, it cannot be inverted to recover either input. Naming the predicate would
require remembering what was hashed (a per-transaction reverse map on Alpha), which is
larger scope. So this PR delivers the category, which needs no reverse mapping, and
leaves predicate/UID fidelity to follow-up work.
Fix
Thread a categorized reason out of Zero to the client. No change to conflict detection,
and no proto change — the reason rides on the existing
codes.AbortedgRPC status as"<reason>: <detail>".dgraph/cmd/zero/oracle.go—commit()tags each abort with a category(
conflict,stale-startts,predicate-move); predicate-move now reuses the existingcheckPredsmessages instead of swallowing them.worker/mutation.go—CommitOverNetworkforwards the reason and records theabort metric, instead of flattening it to the reasonless
dgo.ErrAborted.edgraph/server.go— both commit paths (CommitNowand explicitCommitOrAbort)pass the reason through and set
Aborted.Backward compatible: the existing message is preserved and the status code stays
codes.Aborted, so existing client retry logic is unaffected. This phase iscategory-only; naming the contended predicate/UID is planned follow-up.
Future work
This PR is deliberately scoped to the category only — the cheapest, lowest-risk slice
that is immediately useful to clients — because it needs no reverse mapping, no new proto
fields, and no change to conflict detection. All correctness risk stays in what we
report, not what we decide. Richer fidelity is layered, additive work that can land
independently:
Alpha (
fingerprint → {predicate, uid, kind}), populated where the conflict key isalready computed. The aborting transaction's own Alpha still holds this metadata, so
Zero need only echo back which fingerprint matched — no hash inversion. The cost lands
only on the abort path. Target the single-request
CommitNowpath first.gRPC rich-error model (
ErrorInfodetails) so clients get typed fields instead ofparsing a message string. Still no proto change.
TxnContextfield in thedgoproto, giving uniform structured access across alllanguage clients — at the cost of a coordinated cross-repo release.
predicate-movecategory. Today allcheckPredsfailures report aspredicate-move, but some are not moves (e.g. a malformed/internal predicate key, or apredicate not currently served by any group). A later phase can split these into honest
categories (
predicate-unavailable,internal) and revisit retryability — aninternal/malformed abort is not retryable, unlike a move.Checklist
Conventional Commits syntax, leading
with
fix:,feat:,chore:,ci:, etc.