storage: add diagnostics for invalid upsert diff_sum state#35433
storage: add diagnostics for invalid upsert diff_sum state#35433DAlperin merged 1 commit intoMaterializeInc:mainfrom
Conversation
- Enrich the ensure_decoded panic with the key hash, value byte length, and decoded row shape (column count, byte length) without logging PII. - Warn when a persist feedback batch contains a key with net diff outside [-1, 1], indicating duplicate data in the shard. Includes whether the operator is in rehydration or steady-state to distinguish pre-existing corruption from active bugs. - Add trace-level logging to consolidate_chunk tagging each call as rehydration or steady-state.
PR SummaryLow Risk Overview
During persist feedback ingestion, the continual-feedback operator now warns when a batch contains any key whose consolidated net diff is outside Written by Cursor Bugbot for commit 96fb326. This will update automatically on new commits. Configure here. |
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
| // If diff_sum is odd, value_xor holds the bincode of a | ||
| // single value (even XORs cancel out). Try to decode it | ||
| // so we can log the shape (not contents) for debugging. | ||
| let value_byte_len = usize::try_from(consolidating.len_sum.0 / other).ok(); |
There was a problem hiding this comment.
Division overflow can panic before diagnostic message
Low Severity
The expression consolidating.len_sum.0 / other performs a standard i64 division which panics on overflow when len_sum.0 is i64::MIN and other is -1. Since len_sum is Wrapping<i64>, its inner value can be any i64 — especially in the corrupted state this code path handles. A diff_sum of -1 (one extra retraction) is a plausible error case. The overflow panic would replace the intended diagnostic panic message, defeating the purpose of this PR. Using checked_div would preserve the diagnostics.
martykulma
left a comment
There was a problem hiding this comment.
I assume the goal here is to correlate messages, not necessarily identify the key, is that correct?
Instead of logging the key, could we log a hash of the key?
|
@martykulma the UpsertKey is a sha256 hash, not actually the data |


Remove these sections if your commit already has a good description!
Motivation
Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.
If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.
Description
What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.
Verification
How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.