Skip to content

fix: confirm durable writes before advancing indexer state#784

Open
damilolaedwards wants to merge 1 commit into
ethpandaops:masterfrom
damilolaedwards:fix/durable-write-before-state-advance
Open

fix: confirm durable writes before advancing indexer state#784
damilolaedwards wants to merge 1 commit into
ethpandaops:masterfrom
damilolaedwards:fix/durable-write-before-state-advance

Conversation

@damilolaedwards

Copy link
Copy Markdown
Contributor

Summary

Three indexer paths mutated durable-facing state (a cursor, or the in-memory copy of data) before the corresponding database or blockdb write was confirmed, and only logged the write error. On a transient failure the data was then lost with no retry. Each is fixed by confirming the write first and propagating the error.

Transaction matcher drops a match on a persist error

runTransactionMatcher advanced MatchHeight unconditionally and discarded the persistMatches return value, returning only the state-persist result. On sqlite a failed statement does not abort the transaction, so the tx committed with the match missing but the height advanced; on postgres the tx rolled back but the in-memory height did not. Either way the failed range was never revisited and its request rows kept an empty tx hash. The matches and the advanced height are now persisted together in one transaction, and the in-memory height is rolled back if that transaction fails, so the range is retried.

Bid cache loses the flushed window on a db error

checkAndFlush and flushAll removed bids from the cache (and advanced minSlot) before calling InsertBids; on error they logged and returned with the bids already gone, and they cannot be re-added. The write now happens first, and the cached copies are dropped only after it succeeds.

Finalization can leave a block body unrecoverable on a blockdb write error

finalizeEpoch deletes the unfinalized_blocks rows in the committed finalization transaction, then writes the block bodies to the blockdb with only-logged errors, then advances lastFinalizedEpoch and frees the bodies from the cache regardless of the write result. If the blockdb write failed, the body then existed nowhere. On a blockdb write failure the epoch now returns without advancing, which routes it through the same resync path already used for a failed finalization transaction, leaving the cached bodies in place for the resync to rewrite.

Tests

  • Matcher: a failed persist rolls the height back and stores nothing; a successful one advances and stores it.
  • Bid cache: bids are kept when the db write fails and dropped only after it succeeds.

Both fail without their fix. The finalization change has no unit test (that path needs a full indexer) and reuses the existing resync behavior. go build, go vet and gofmt are clean.

Three indexer paths advanced a cursor or dropped their in-memory copy of data
before confirming the durable write, and only logged the write error, so a
transient failure lost the data with no retry.

- The system-contract transaction matcher discarded the persistMatches error and
  advanced MatchHeight regardless, so a failed range was skipped and its request
  rows kept an empty tx hash forever. Persist the matches and the height together
  and roll the height back on error.
- The bid cache removed bids from the cache before writing them, losing the
  flushed window on a transient db error. Write first, drop the cached copies
  only after the write succeeds.
- Epoch finalization advanced lastFinalizedEpoch and freed the block bodies even
  when the blockdb write failed, after the unfinalized_blocks rows had already
  been deleted, leaving the bodies unrecoverable. Trigger a resync of the epoch
  on a blockdb write failure instead of advancing.

Add tests for the matcher height rollback and the bid-cache write ordering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant