Skip to content

fix(purger): preserve partial-batch progress in error log on mid-loop failure#55

Open
endrju19 wants to merge 1 commit into
mainfrom
batchloggin
Open

fix(purger): preserve partial-batch progress in error log on mid-loop failure#55
endrju19 wants to merge 1 commit into
mainfrom
batchloggin

Conversation

@endrju19
Copy link
Copy Markdown
Collaborator

Summary

OutboxPurger.tick() purges DELIVERED entries in a do { ... } while loop where each iteration runs an independent transaction. When iteration N throws (transient DB error, lock timeout, etc.), iterations 0..N-1 committed in their own transactions and their deletes are durable — but the previous error log emitted only the bare exception with no count, so an operator paged on "Outbox purge failed" had no way to tell whether 0 or 9000 entries were purged before the failure without inspecting the database directly.

Why this matters

  • Operations: on-call engineer sees "Outbox purge failed after 7 batches (700 entries purged this tick), will retry at next scheduled interval" instead of just "Outbox purge failed, ...". Tells them at a glance whether the failure was early (real outage) or late (transient hiccup after most of the work was done).
  • Throughput observability: the next tick at config.interval later will re-issue the purge from a fresh cutoff. If N-1 batches succeeded, the next tick has N-1 fewer rows to scan. Logging the count makes this visible without DB inspection.

What changed

  • OutboxPurger.kt — hoist totalDeleted and batches out of the try block so the catch handler can include them as logger arguments. No behavioural change beyond the log message format — the broad catch is correctly placed at the scheduler-task boundary (must stay broad to prevent ScheduledExecutorService from cancelling future ticks).

  • okapi-core/build.gradle.kts — swap testRuntimeOnly(libs.slf4jSimple) for testImplementation(libs.logbackClassic). Logback is now the sole SLF4J binding on the test classpath (no multi-binding warning) and is required by the new ListAppender capture in the test. Production code stays free of any logging backend.

  • OutboxPurgerTest.kt — new test "error log preserves partial batch progress on mid-loop failure". Asserts on the typed argument array (Int batches, Int totalDeleted) attached to the captured ILoggingEvent, not on the formatted message text — so future log-wording tweaks won't break the regression check.

How this was identified

Multi-agent review of #54 (pr-review-toolkit:silent-failure-hunter) flagged this as MEDIUM. #54's body explicitly defers the logging improvement to a separate follow-up; this PR is that follow-up.

Out of scope

  • The analogous OutboxScheduler.tick() catch — no batch loop, no partial-progress information to lose.
  • Restructuring the broad catch (correctly placed at the scheduler task boundary).
  • Telemetry / metrics — Micrometer counters for the purger live in okapi-micrometer; this PR is logging-only.

Compatibility with #54

Branches off main (where transactionRunner is still nullable with elvis fallback). When #54 lands, this commit rebases trivially — the hoisted-counter pattern is structurally identical regardless of whether the runner is nullable.

Test plan

  • ./gradlew :okapi-core:test — green
  • ./gradlew ktlintCheck — green
  • ./gradlew build — full build green (all tests, all modules)
  • Manual check: only the purger's error message string changed; the scheduler's similar message untouched

… failure

OutboxPurger.tick() purges in a do-while loop where each iteration runs an
independent transaction. When iteration N throws, batches 0..N-1 are already
durably committed -- but the previous error log emitted only the bare exception
with no count, leaving an operator paged on "Outbox purge failed" no way to
tell whether 0 or 9000 entries were purged before the failure without querying
the database directly.

Hoist totalDeleted and batches out of the try block so the catch handler can
include them in the error log. No behavioural change beyond the log message
format -- the catch is correctly placed at the scheduler task boundary and
must stay broad to prevent ScheduledExecutorService from cancelling future
ticks.

Test infrastructure: swap testRuntimeOnly(slf4j-simple) for
testImplementation(logback-classic) in okapi-core so the new test can capture
ILoggingEvent via ListAppender. Logback remains the sole SLF4J binding on the
test classpath -- no multi-binding warning -- and production stays free of any
logging backend.

The new test asserts on the logger's typed argument array (Int batches,
Int totalDeleted), not on the formatted message text, so future log-wording
tweaks won't break the regression check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant