disagg: bound stream retries and reopen large forward seeks#10795
Conversation
|
Review Complete Findings: 0 issues ℹ️ Learn more details on Pantheon AI. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughReworks S3RandomAccessFile read/seek retry logic to bound retries and reopen at target offsets for large forward seeks; adds a seek failpoint, new ProfileEvent, MockS3Client GET observation hooks, and tests exercising retry/seek semantics. Changes
Sequence Diagram(s)sequenceDiagram
participant Reader as File consumer
participant S3File as S3RandomAccessFile
participant S3 as S3/HTTP backend
participant Mock as MockS3Client (tests)
Reader->>S3File: read()/seek(offset)
S3File->>S3: GET (Range=cur_offset or target)
S3-->>S3File: HTTP 200/206 stream
alt stream error during ignore/read
S3File->>S3File: shouldRetryStreamError? (increment cur_retry)
alt retry within budget
S3File->>S3File: reopenAt(target_offset) / initialize(action)
S3File->>S3: new GET at reopened offset
else retry exhausted
S3File-->>Reader: fail (EOF/error)
end
else forward seek > threshold
S3File->>S3File: reopenAt(target_offset)
S3File->>S3: GET (Range=target_offset-)
end
S3File->>ProfileEvents: increment S3IOSeekReopen / S3ReadBytes as appropriate
Note over Mock,S3: Tests observe GetObject calls and ranges via MockS3Client hooks
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dbms/src/Storages/S3/S3RandomAccessFile.cpp`:
- Around line 379-385: S3RandomAccessFile::reopenAt currently assigns cur_offset
before calling initialize(action), which can leave the object in an inconsistent
state if initialize() throws; change the logic to either call initialize(action)
first and only set cur_offset on success, or wrap initialize(action) in a
try/catch and restore the previous cur_offset before rethrowing; also ensure
cur_retry reset semantics remain correct (reset after a successful initialize or
restore on failure) so that members cur_offset, cur_retry and initialize(action)
are updated atomically.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: ce087909-16c3-4f22-9852-bf43aa66b3de
📒 Files selected for processing (3)
dbms/src/Common/ProfileEvents.cppdbms/src/Storages/S3/S3RandomAccessFile.cppdbms/src/Storages/S3/S3RandomAccessFile.h
|
/hold |
|
/cherry-pick release-nextgen-20251011 |
|
@JaySon-Huang: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
/cherry-pick release-nextgen-202603 |
|
@JaySon-Huang: once the present PR merges, I will cherry-pick it on top of release-nextgen-202603 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
/unhold |
[LGTM Timeline notifier]Timeline:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CalvinNeo, JinheLin, Lloyd-Pottiger The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@JaySon-Huang: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
@JaySon-Huang: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
What problem does this PR solve?
Issue Number: close #10794
Problem Summary:
S3RandomAccessFilecould retry forever after a retryable stream-sideread()/ignore()failure. In the reported OSS case, repeated206range responses could keep ending with stream errors, and the unbounded outer retry loop could stall DeltaMerge GC. Forward seek was also still draining the current HTTP body withignore(), which is fragile for large skips.What is changed and how it works?
The change does two things:
read()/seek()tomax_retry = 3total attempts, while keepingcur_retryscoped to per-initializeGetObjectfailures onlyreopenAt(...)so backward seek and stream retry recovery reopen from a committed offset with a fresh initialize budget128 KiBthreshold: small skips still drain the current stream, while large skips reopen from the target offset directlyMockS3Clientto observeGetObjectcount and last range so the threshold split can be asserted in unit testsCheck List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Observability
Tests
Chores