disagg: adapt GC merge fan-in after S3 read failures#10800
disagg: adapt GC merge fan-in after S3 read failures#10800ti-chi-bot[bot] merged 5 commits intopingcap:masterfrom
Conversation
|
Review Complete Findings: 0 issues ℹ️ Learn more details on Pantheon AI. |
📝 WalkthroughWalkthroughThis PR introduces adaptive capacity management for DeltaMerge GC merges and bounded retry exhaustion handling for S3 reads. When S3 errors occur during background GC, the mergeable segments cap reduces incrementally; on successful merges, it recovers. S3 stream retry failures now throw Changes
Sequence Diagram(s)sequenceDiagram
participant GC as DeltaMerge GC Thread
participant Merge as gcTrySegmentMerge()
participant S3 as S3RandomAccessFile
participant Store as DeltaMergeStore
GC->>Merge: onSyncGc() - attempt merge
Merge->>S3: read()/seek() with bounded retries
alt S3 succeeds
S3-->>Merge: data/offset
Merge->>Store: recoverGcMergeableSegmentsCap()
Store->>Store: cap += recover_step (up to default)
Merge-->>GC: success
else S3 retry exhausted
S3->>S3: throwRetryExhaustedError()
S3-->>Merge: Exception(S3_ERROR)
Merge->>GC: Exception propagates to onSyncGc
GC->>Merge: catch S3_ERROR
GC->>Store: reduceGcMergeableSegmentsCap()
Store->>Store: cap = cap/2 (down to min)
GC->>GC: rethrow, retry later with lower cap
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/cherry-pick release-nextgen-20251011 |
|
@JaySon-Huang: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
/cherry-pick release-nextgen-202603 |
|
@JaySon-Huang: once the present PR merges, I will cherry-pick it on top of release-nextgen-202603 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CalvinNeo, JinheLin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@JaySon-Huang: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
@JaySon-Huang: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
ref #10794, close #10798 33e66a4 disagg: tighten S3 forward seek reopen threshold bd070ca disagg: classify retry-exhausted S3 reads 7cbe114 disagg: adapt GC merge fan-in after S3 errors 085cb15 disagg: clarify S3 retry exhaustion flow 1f87ade disagg: narrow GC cap adjustment triggers Co-authored-by: JaySon-Huang <tshent@qq.com>
ref #10794, close #10798 33e66a4 disagg: tighten S3 forward seek reopen threshold bd070ca disagg: classify retry-exhausted S3 reads 7cbe114 disagg: adapt GC merge fan-in after S3 errors 085cb15 disagg: clarify S3 retry exhaustion flow 1f87ade disagg: narrow GC cap adjustment triggers Co-authored-by: JaySon-Huang <tshent@qq.com>
What problem does this PR solve?
Issue Number: close #10798, ref #10794
Problem Summary:
Background GC merges can still fan in too many segments after bounded S3 stream retries are exhausted. That keeps a single GC merge exposed to long remote-read windows, while medium forward seeks still prefer draining the existing stream instead of reopening from the target offset.
What is changed and how it works?
S3RandomAccessFile::readandS3RandomAccessFile::seekthrowErrorCodes::S3_ERRORafter the bounded stream retry budget is exhausted, with doc comments for the public APIs and concise comments around the retry-exhaustion branchgc_mergeable_segments_capso background GC merge fan-in is capped and can be reduced afterS3_ERRORgcTrySegmentMergefails withS3_ERROR, and recover the cap only aftercheckSegmentUpdatesucceedsCheck List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
Chores