fix(buffers): disk buffers v2 reopen buffer after restart crash loop by blt · Pull Request #25570 · vectordotdev/vector

blt · 2026-06-03T23:45:26Z

Summary

We have observed in Antithesis scenario vector_to_vector_e2e_disk that
the disk buffer can be driven to a state where Vector will crash loop
without recovery. This happens when the reader is set to resume from a
data file that no longer exists and will not be created.

Strictly, the disk buffer reader is meant to chase the writer. This
happens because the deletion of completed data files is not atomic. When
a reader finishes an acks a data file delete_completed_data_file:

unlinks the .dat file
advances the reader id in the ledger
fsyncs the ledger

If Vector crashes after the unlink but before the ledger is fsync'ed
this will cause the next instance of Vector to attempt to read a .dat
that will never appear, crashing said instance. Mostly. OS page cache
complicates this some, as the newly introduced
reopen_recovers_with_reader_resume_data_file_is_missing discusses.

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - make fmt
  - make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
  - make test
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

blt · 2026-06-03T23:45:41Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(buffers): disk buffers v2 reopen buffer after restart crash loop #25570 👈 (View in Graphite)
chore(tests): Antithesis Vector end-to-end ack scenario #25571
chore(tests): antithesis harness, durability harness scenario #25562 : 1 other dependent PR (#25525 )
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 17992e7285

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T16:38:38Z

+                                data_file_path = data_file_path.to_string_lossy().as_ref(),
+                                "Reader resume data file is missing; it was fully acknowledged before deletion. Advancing past it."
+                            );
                            self.ledger.increment_acked_reader_file_id();


Repair ledger accounting before skipping missing files

When a crash happens after delete_completed_data_file unlinks a data file but before its mmap changes are durably flushed, the restored ledger can still include that file's bytes and last-read record state even though the file is gone. This branch only advances the reader file id, so the skipped file's size remains in total_buffer_size and its record ids may be treated as gaps; after the remaining files drain, the buffer can still appear non-empty/full and the reader can wait for a next writer file instead of reaching an empty state. The recovery path needs to make the same accounting durable before unlink, or explicitly repair the skipped file's ledger state here.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6dfab2e1a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T17:13:36Z

+                                data_file_path = data_file_path.to_string_lossy().as_ref(),
+                                "Reader resume data file is missing; it was fully acknowledged before deletion. Advancing past it."
+                            );
                            self.ledger.increment_acked_reader_file_id();


Do not assume missing reader files were acknowledged

This branch also matches buffers affected by the creation-side durability bug this patch calls out: before this change (and still on non-Unix, where sync_directory is a no-op), a crash after the ledger/file contents were flushed but before the new data file's directory entry was durable can leave reader_file_id < writer_file_id with the reader's data file missing even though none of its records were delivered. Incrementing the acked reader file id here silently skips those unread records (and leaves the ledger's byte/record accounting inconsistent), so the recovery path should distinguish/repair the deleted-after-ack case rather than treating every missing file below the writer as fully acknowledged.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 80aa8c93f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T17:35:29Z

            let data_file_mmap = self
                .ledger
                .filesystem()


Recheck reader position before mmaping writer file

When recovery skips a missing reader file here, ensure_ready_for_read() can advance reader_file_id until it equals writer_file_id and open the writer's current file. Because the outer while condition is not rechecked before this mmap, a crash after the writer rolled to a freshly created but still-empty next data file still reopens into open_mmap_readable() on that empty writer file, which fails instead of falling back to the slow path. Recheck the reader/writer IDs after ensure_ready_for_read() and break once they are equal.

Useful? React with 👍 / 👎.

We have observed in Antithesis scenario `vector_to_vector_e2e_disk` that the disk buffer can be driven to a state where Vector will crash loop without recovery. This happens when the reader is set to resume from a data file that no longer exists and will not be created. Strictly, the disk buffer reader is meant to chase the writer. This happens because the deletion of completed data files is not atomic. When a reader finishes an acks a data file `delete_completed_data_file`: * unlinks the .dat file * advances the reader id in the ledger * fsyncs the ledger If Vector crashes after the unlink but before the ledger is fsync'ed this will cause the next instance of Vector to attempt to read a .dat that will never appear, crashing said instance. Mostly. OS page cache complicates this some, as the newly introduced `reopen_recovers_with_reader_resume_data_file_is_missing` discusses. REF SMPTNG-749

This commit fixes the crash-loop by carefully re-arranging when the ledger is updated and ensuring that directory metadata is also fsynced. This later is important as ledger still updates after the unlink takes place. If a file is gone we can assume that it has been read iff the ledger points to a new dat file, meaning that the data was previously read entirely and readers may skip the 'stale' reference. I have left this skip event as a warn! but considering how likely this is in practice in the event of node termination, for example, it might rightly be a debug. Overflow of dat file indexes remains a concern but 1. I would like to demonstrate this in an antithesis run and 2. that belongs in a separate commit.

This was referenced Jun 3, 2026

chore(tests): antithesis harness, durability harness scenario #25562

Open

chore(unit tests): failing disk_v2 bug demonstrations #25525

Closed

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from 7b0865a to db4960b Compare June 3, 2026 23:46

This comment has been minimized.

Sign in to view

blt force-pushed the blt/antithesis-harness-first-scenario branch from 3614c5a to 5209ae6 Compare June 4, 2026 00:25

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from db4960b to e3cc97c Compare June 4, 2026 00:25

blt mentioned this pull request Jun 4, 2026

chore(tests): Antithesis Vector end-to-end ack scenario #25571

Open

9 tasks

blt changed the base branch from blt/antithesis-harness-first-scenario to graphite-base/25570 June 4, 2026 01:06

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from e3cc97c to f6542c0 Compare June 4, 2026 01:06

blt changed the base branch from graphite-base/25570 to blt/antithesis_vector_end-to-end_ack_scenario June 4, 2026 01:06

blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 3cce5aa to 84df4d7 Compare June 4, 2026 04:48

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from f6542c0 to 61e1f3e Compare June 4, 2026 04:48

github-advanced-security AI found potential problems Jun 4, 2026

View reviewed changes

Comment thread changelog.d/disk_v2_reopen_recovers_missing_data_file.fix.md Fixed

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch 2 times, most recently from 6cf0d05 to 597b732 Compare June 4, 2026 13:17

blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 84df4d7 to c066e34 Compare June 4, 2026 13:17

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from 597b732 to b3c5011 Compare June 4, 2026 14:57

blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch 2 times, most recently from 8ff610c to 4811fb8 Compare June 4, 2026 15:05

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch 2 times, most recently from 05d79f8 to fe305fa Compare June 4, 2026 15:20

blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 4811fb8 to 85cfa4b Compare June 4, 2026 15:20

github-actions Bot added the domain: vdev Anything related to the vdev tooling label Jun 4, 2026

blt changed the title ~~disk buffers v2 reopen buffer after restart crash loop~~ fix: disk buffers v2 reopen buffer after restart crash loop Jun 4, 2026

blt changed the title ~~fix: disk buffers v2 reopen buffer after restart crash loop~~ fix(buffers): disk buffers v2 reopen buffer after restart crash loop Jun 4, 2026

blt marked this pull request as ready for review June 4, 2026 15:28

blt requested a review from a team as a code owner June 4, 2026 15:28

blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 85cfa4b to d40777e Compare June 4, 2026 16:32

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from fe305fa to 17992e7 Compare June 4, 2026 16:32

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from 17992e7 to 6dfab2e Compare June 4, 2026 17:08

blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from d40777e to 7ecc846 Compare June 4, 2026 17:08

github-actions Bot removed the domain: vdev Anything related to the vdev tooling label Jun 4, 2026

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 7ecc846 to b4576c4 Compare June 4, 2026 17:30

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from 6dfab2e to 80aa8c9 Compare June 4, 2026 17:30

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from 80aa8c9 to cf9f640 Compare June 4, 2026 18:28

blt marked this pull request as draft June 4, 2026 18:42

blt added 2 commits June 12, 2026 16:55

blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from b4576c4 to f538c36 Compare June 12, 2026 16:57

blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from cf9f640 to 0e78f49 Compare June 12, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(buffers): disk buffers v2 reopen buffer after restart crash loop#25570

fix(buffers): disk buffers v2 reopen buffer after restart crash loop#25570
blt wants to merge 2 commits into
blt/antithesis_vector_end-to-end_ack_scenariofrom
blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop

blt commented Jun 3, 2026 •

edited

Loading

Uh oh!

blt commented Jun 3, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

blt commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Uh oh!

blt commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

blt commented Jun 3, 2026 •

edited

Loading

blt commented Jun 3, 2026 •

edited

Loading