Skip to content

feat(txn): surface transaction abort reasons to clients#9747

Open
rahst12 wants to merge 2 commits into
dgraph-io:mainfrom
rahst12:txn-abort-reason-surface-phase-1
Open

feat(txn): surface transaction abort reasons to clients#9747
rahst12 wants to merge 2 commits into
dgraph-io:mainfrom
rahst12:txn-abort-reason-surface-phase-1

Conversation

@rahst12

@rahst12 rahst12 commented Jun 17, 2026

Copy link
Copy Markdown

Problem

When Dgraph aborts a transaction, Zero knows why — but the client never finds out.
Zero distinguishes at least three causes internally:

  • a write-write conflict,
  • a predicate move (the tablet relocated mid-transaction, or commits are blocked
    while a move is in flight), and
  • a stale start-ts (the transaction predates the current Zero leader — i.e. a leader
    change, not a real conflict).

All three collapse into src.Aborted = true inside Server.commit()
(dgraph/cmd/zero/oracle.go), and by the time the error reaches the caller it is one
opaque string:

Transaction has been aborted. Please retry

Worse, the checkPreds path builds specific, human-readable messages for predicate
moves and then throws them away. With no discriminator, a client cannot make the one
decision the category enables: retry immediately (conflict) vs. back off
(predicate is moving).

Why we can't just name the conflicting predicate

For conflicts, you might expect the abort to name the contended predicate/UID. It can't —
not without extra bookkeeping. For every mutated edge, Alpha computes a conflict key in
GetConflictKey() (posting/list.go) as a one-way fingerprint:

farm.Fingerprint64(key) ^ uid

Only this uint64 is retained (conflicts map[uint64]struct{} in posting/oracle.go);
the predicate name and UID are dropped immediately. It is then base-36 encoded into
TxnContext.Keys and shipped to Zero, which detects a conflict purely by matching
fingerprints — it never sees a predicate name. Because the value is a fingerprint XORed
with a UID, it cannot be inverted to recover either input. Naming the predicate would
require remembering what was hashed (a per-transaction reverse map on Alpha), which is
larger scope. So this PR delivers the category, which needs no reverse mapping, and
leaves predicate/UID fidelity to follow-up work.

Fix

Thread a categorized reason out of Zero to the client. No change to conflict detection,
and no proto change — the reason rides on the existing codes.Aborted gRPC status as
"<reason>: <detail>".

  • dgraph/cmd/zero/oracle.gocommit() tags each abort with a category
    (conflict, stale-startts, predicate-move); predicate-move now reuses the existing
    checkPreds messages instead of swallowing them.
  • worker/mutation.goCommitOverNetwork forwards the reason and records the
    abort metric, instead of flattening it to the reasonless dgo.ErrAborted.
  • edgraph/server.go — both commit paths (CommitNow and explicit CommitOrAbort)
    pass the reason through and set Aborted.

Backward compatible: the existing message is preserved and the status code stays
codes.Aborted, so existing client retry logic is unaffected. This phase is
category-only; naming the contended predicate/UID is planned follow-up.

Future work

This PR is deliberately scoped to the category only — the cheapest, lowest-risk slice
that is immediately useful to clients — because it needs no reverse mapping, no new proto
fields, and no change to conflict detection. All correctness risk stays in what we
report
, not what we decide. Richer fidelity is layered, additive work that can land
independently:

  • Name the contended predicate(s). Retain a small per-transaction reverse map on
    Alpha (fingerprint → {predicate, uid, kind}), populated where the conflict key is
    already computed. The aborting transaction's own Alpha still holds this metadata, so
    Zero need only echo back which fingerprint matched — no hash inversion. The cost lands
    only on the abort path. Target the single-request CommitNow path first.
  • Structured, machine-readable detail. Carry category + predicate + uid/token via the
    gRPC rich-error model (ErrorInfo details) so clients get typed fields instead of
    parsing a message string. Still no proto change.
  • First-class API field. Eventually promote the reason/conflict detail to a real
    TxnContext field in the dgo proto, giving uniform structured access across all
    language clients — at the cost of a coordinated cross-repo release.
  • Refine the predicate-move category. Today all checkPreds failures report as
    predicate-move, but some are not moves (e.g. a malformed/internal predicate key, or a
    predicate not currently served by any group). A later phase can split these into honest
    categories (predicate-unavailable, internal) and revisit retryability — an
    internal/malformed abort is not retryable, unlike a move.

Checklist

  • The PR title follows the
    Conventional Commits syntax, leading
    with fix:, feat:, chore:, ci:, etc.
  • Code compiles correctly and linting (via trunk) passes locally
  • Tests added for new functionality, or regression tests for bug fixes added as applicable
  • Refer to dgraph4j, pydgraph, and dgraph-docs as corresponding updates are made.

@matthewmcneely matthewmcneely force-pushed the txn-abort-reason-surface-phase-1 branch from 6eff47d to 1042bb1 Compare June 18, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant