Skip to content

fix: cap putObjectWithVersion retries to prevent runaway workers#291

Merged
kptdobe merged 1 commit into
mainfrom
fix/put-object-unbounded-retry
Jun 15, 2026
Merged

fix: cap putObjectWithVersion retries to prevent runaway workers#291
kptdobe merged 1 commit into
mainfrom
fix/put-object-unbounded-retry

Conversation

@kptdobe

@kptdobe kptdobe commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • putObjectWithVersion had two unbounded recursive retry paths (new-object 404 path and existing-object update path). Neither passed an attempt counter, so the function could loop indefinitely on sustained 412 conflicts.
  • Each iteration re-invoked writeAuditEntry (which itself retries 6× with jitter), producing 7–75 audit-error log entries per request and wall times of 66 s–939 s before Cloudflare canceled the worker.
  • Root trigger: high-concurrency translation writes to the same .da/translation/active/*.json file from adobecom/da-cc and adobecom/edu — all writers compete for the same audit shard ETag, and no one ever wins the race.

Fix: adds MAX_PUT_ATTEMPTS = 5 and threads a putAttempt counter through both retry paths. Returns 412 when the cap is hit instead of looping forever.

Test plan

  • putObjectWithVersion stops retrying new document after MAX_PUT_ATTEMPTS — asserts exactly 6 S3 calls (initial + 5 retries) then returns 412
  • putObjectWithVersion stops retrying existing document after MAX_PUT_ATTEMPTS — same assertion for the update path
  • All 396 existing tests still pass

Related

  • Observed in production logs (2026-06-15): 264 canceled POST requests with avg wall time 109 s, max 939 s, all in adobecom/da-cc and adobecom/edu translation paths
  • Related: da-nx#468 (serialized versionsource concurrent writes) — the /source/ POST path has the same contention and is not yet protected by that fix

🤖 Generated with Claude Code

putObjectWithVersion had two unbounded recursive retry paths (404 create
and existing-object update) — neither passed an attempt counter, so a
sustained 412 storm (e.g. high-concurrency translation writes to the
same .da/translation/active/*.json file) caused the function to loop
indefinitely. Each iteration re-invoked writeAuditEntry, which itself
retried 6× with jitter, leading to 7–75 audit-error log entries per
request and wall times of 66s–939s before Cloudflare canceled the worker.

Adds MAX_PUT_ATTEMPTS = 5 and threads a putAttempt counter through both
retry paths. Returns 412 once the cap is hit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@kptdobe kptdobe requested a review from bosschaert June 15, 2026 09:12
@kptdobe kptdobe merged commit fa5dad5 into main Jun 15, 2026
6 checks passed
@kptdobe kptdobe deleted the fix/put-object-unbounded-retry branch June 15, 2026 10:10
adobe-bot pushed a commit that referenced this pull request Jun 15, 2026
## [1.10.1](v1.10.0...v1.10.1) (2026-06-15)

### Bug Fixes

* cap putObjectWithVersion retries to prevent runaway workers ([#291](#291)) ([fa5dad5](fa5dad5)), closes [hi#concurrency](https://github.com/hi/issues/concurrency)
@adobe-bot

Copy link
Copy Markdown
Collaborator

🎉 This PR is included in version 1.10.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants