Skip to content

storage: add diagnostics for invalid upsert diff_sum state#35433

Merged
DAlperin merged 1 commit intoMaterializeInc:mainfrom
DAlperin:dov/log-upsert-data
Mar 12, 2026
Merged

storage: add diagnostics for invalid upsert diff_sum state#35433
DAlperin merged 1 commit intoMaterializeInc:mainfrom
DAlperin:dov/log-upsert-data

Conversation

@DAlperin
Copy link
Member

  • Enrich the ensure_decoded panic with the key hash, value byte length, and decoded row shape (column count, byte length) without logging PII.
  • Warn when a persist feedback batch contains a key with net diff outside [-1, 1], indicating duplicate data in the shard. Includes whether the operator is in rehydration or steady-state to distinguish pre-existing corruption from active bugs.
  • Add trace-level logging to consolidate_chunk tagging each call as rehydration or steady-state.

Remove these sections if your commit already has a good description!

Motivation

Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.

If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.

Description

What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.

Verification

How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.

- Enrich the ensure_decoded panic with the key hash, value byte length,
  and decoded row shape (column count, byte length) without logging PII.
- Warn when a persist feedback batch contains a key with net diff outside
  [-1, 1], indicating duplicate data in the shard. Includes whether the
  operator is in rehydration or steady-state to distinguish pre-existing
  corruption from active bugs.
- Add trace-level logging to consolidate_chunk tagging each call as
  rehydration or steady-state.
@DAlperin DAlperin requested a review from a team as a code owner March 11, 2026 15:43
@cursor
Copy link

cursor bot commented Mar 11, 2026

PR Summary

Low Risk
Primarily adds logging and richer panic diagnostics; behavior changes are limited to additional work on error paths and extra warn/trace output in hot loops.

Overview
Adds targeted diagnostics around corrupted UPSERT consolidation state.

StateValue::ensure_decoded now accepts an optional UpsertKey and, when diff_sum is not 0/1, attempts a best-effort decode to include non-PII value shape (row byte length/column count), inferred value byte length, and the key in the panic message. Both classic and continual-feedback upsert paths pass the key through.

During persist feedback ingestion, the continual-feedback operator now warns when a batch contains any key whose consolidated net diff is outside [-1, 1], and consolidate_chunk emits trace logs tagging calls as rehydration vs steady-state for easier correlation.

Written by Cursor Bugbot for commit 96fb326. This will update automatically on new commits. Configure here.

@github-actions
Copy link

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

// If diff_sum is odd, value_xor holds the bincode of a
// single value (even XORs cancel out). Try to decode it
// so we can log the shape (not contents) for debugging.
let value_byte_len = usize::try_from(consolidating.len_sum.0 / other).ok();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division overflow can panic before diagnostic message

Low Severity

The expression consolidating.len_sum.0 / other performs a standard i64 division which panics on overflow when len_sum.0 is i64::MIN and other is -1. Since len_sum is Wrapping<i64>, its inner value can be any i64 — especially in the corrupted state this code path handles. A diff_sum of -1 (one extra retraction) is a plausible error case. The overflow panic would replace the intended diagnostic panic message, defeating the purpose of this PR. Using checked_div would preserve the diagnostics.

Fix in Cursor Fix in Web

Copy link
Contributor

@martykulma martykulma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the goal here is to correlate messages, not necessarily identify the key, is that correct?

Instead of logging the key, could we log a hash of the key?

@DAlperin
Copy link
Member Author

@martykulma the UpsertKey is a sha256 hash, not actually the data

@DAlperin DAlperin merged commit e1e5d20 into MaterializeInc:main Mar 12, 2026
128 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants