Skip to content

persist: Harden durable-state and binary decoding against malformed input#36985

Open
def- wants to merge 11 commits into
MaterializeInc:mainfrom
def-:fuzz-persist
Open

persist: Harden durable-state and binary decoding against malformed input#36985
def- wants to merge 11 commits into
MaterializeInc:mainfrom
def-:fuzz-persist

Conversation

@def-

@def- def- commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hardens decode paths in persist, persist-client, persist-types, pgrepr, and postgres-util against malformed/untrusted input: reject rollup-less/empty-frontier/hollow-only/invalid StateDiff state instead of panicking or debug-asserting, guard UUID parsing, fix i16 overflow in pgrepr numeric-scale binary decode, and propagate u16-conversion errors in the postgres table-desc protos.

Found by the cargo-fuzz suite (separate infra PR). Each fix has a regression test.

🤖 Generated with Claude Code

def- and others added 11 commits June 11, 2026 10:57
`PostgresKeyDesc::cols` is `Vec<u16>` but the proto carries it as
`Vec<u32>`; the decode path did `c.try_into().expect("values roundtrip")`,
which panicked on any value above 65535 — reachable from untrusted
proto bytes (the fuzz target found this from many distinct inputs).
Propagate the conversion error via `TryFromProtoError` instead.

Also fixes two build errors found while bringing up new fuzz crates:

- mysql-util fuzz target was importing via `mz_mysql_util::desc::*`,
  but `desc` is a private module; use the top-level re-exports.
- pgwire's `Codec` was a fully-private struct, blocking the
  `fuzz_exports` re-export. Make it `pub(crate)` so the same-crate
  `pub use` in lib.rs works under `#[cfg(feature = "fuzz")]`.
…nDesc

Same shape as the `ProtoPostgresKeyDesc.cols` fix: `col_num` is `u16`
in Rust but `u32` on the wire, and the decode path used
`.expect("u16 must roundtrip")` — reachable from untrusted proto
bytes (the fuzz target found this from many distinct inputs once the
key-desc panic was out of the way).

Also bumps `pgwire::codec::Codec` from `pub(crate)` to `pub` — needed
for the `fuzz_exports` re-export to compile (`pub use` can't widen a
`pub(crate)` item to a fully-public reexport). `mod codec` remains
private, so the type is only reachable via the `fuzz` feature.
`Numeric`'s binary `FromSql` computed `(units - weight - 1) * 4` in i16,
but `units`/`weight` are attacker-controlled i16 header fields from a
client bind parameter. Extreme `weight` values overflow i16 — a panic
under overflow checks (a client-triggerable crash) and a silent
wraparound to a wrong scale otherwise. Compute the scale (and the
`in_scale` comparison) in i32; the values are well within range.

Found by the new value_decode_binary cargo-fuzz target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…panicking

`from_proto_with_type` zipped the proto's `children` against the fields
its declared data type expects using `zip_eq`, which panics when the
counts disagree. `children` is reconstructed from (untrusted-on-disk)
Parquet part bytes, so a corrupted or crafted part panicked the reader.
Check the lengths and return a `TryFromProtoError` on mismatch.

Found by an exploratory cargo-fuzz target over the `ProtoArrayData ->
ArrayData` decode path. (That target is not committed: adversarial-but-
build-valid layouts also panic inside arrow's own ArrayData build/align/
equal routines, which is an upstream concern, so the target can't yet run
green in CI.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-asserting

`StateDiff`'s proto decode `debug_assert_eq!(field_diffs.validate(), Ok(()))`
on field diffs reconstructed from an untrusted consensus blob. Under debug
assertions (and fuzzing) a corrupt/crafted diff panics; in release the
assert is compiled out and the unvalidated diffs flow through. Validate
and return a `TryFromProtoError::InvalidPersistState` on failure so the
behaviour is consistent and a bad blob is a decode error, not a panic.

Found by the new state_diff_proto_roundtrip cargo-fuzz target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tead of asserting

`BatchPart::from_proto`'s inline branch asserted (`assert_eq!`/`assert_none!`,
not `debug_assert` — so it panicked even in release) that the hollow-only
fields (`encoded_size_bytes`, `key_lower`, `key_stats`, `diffs_sum`) were
unset. These are decoded from an untrusted blob (read on every rollup/state
load), so a crafted/corrupted inline `ProtoHollowBatchPart` panicked the
reader. Validate and return a `TryFromProtoError` instead.

Found by the rollup_proto_roundtrip cargo-fuzz target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`Description::from_proto` passed the decoded `lower` antichain straight to
`Description::new`, which asserts a non-empty lower frontier. `lower` comes
from an untrusted blob, so a crafted/corrupted `ProtoHollowBatch` with an
empty lower frontier panicked on rollup/state load. Validate it and return
a `TryFromProtoError` instead.

Found by the rollup_proto_roundtrip cargo-fuzz target (a second, distinct
rollup decode panic after the inline-batch-part fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lformed input

`parse_id` (persist-client) and `ShardId::from_str` (persist-types) hand the
post-prefix string straight to `Uuid::parse_str`, which panics — rather than
returning an error — while building its parse error for some malformed inputs:
non-ASCII bytes make it slice the input on a non-char boundary. A crafted
`ProtoRollup` reaches these via the ID `RustType` decode, so the panic is
reachable from untrusted durable state. Reject anything that can't be a UUID
(fewer than 32 chars, or non-ASCII) before calling the crate, matching the
guard in `mz_repr::strconv::parse_uuid`. Regression for the
rollup_proto_roundtrip cargo-fuzz finding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every persisted state maintains the invariant that it has at least one
rollup (State::latest_rollup .expect()s it). ProtoRollup::into_rust called
latest_rollup while decoding, so an untrusted/corrupt rollup proto whose
state carries no rollups panicked instead of being rejected — a
rollup_proto_roundtrip fuzz crash (and a decode-time availability risk for
a corrupt blob).

Validate the invariant in from_proto and return TryFromProtoError::
InvalidPersistState when it is violated, before latest_rollup is reached
(and before the decoded Rollup is later re-encoded). Adds a regression
test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
My earlier guard rejected any rollup proto whose state had no rollups, but
that was too broad: it broke applier_version_state, which decodes a freshly
initialized state that legitimately has no rollups yet. latest_rollup -- the
actual panic site -- is only reached when the proto carries diffs (to validate
the diff bounds against it), so move the check there. A rollup-less state
without diffs decodes again, while a proto that has diffs but no rollups is
still rejected instead of panicking. The regression test is updated to build
the diffs-but-no-rollups case it now guards.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A rollup's trace is decoded from an untrusted blob, but Trace::unflatten
reconstructed the spine through code whose invariants are enforced by panics,
so a corrupted or crafted blob could kill the decoder instead of failing with
a decode error. The rollup_proto_roundtrip fuzz target hit the first of these
in release qualification: legacy batches that don't tile the timeline
contiguously trip Spine::insert's (always-on) contiguity assert. Probing the
neighboring reconstruction code found the rest of the class, all reachable
from crafted bytes (cargo-fuzz builds enable debug assertions and overflow
checks, so the debug_asserts and arithmetic count too):

* Spine::insert also asserts a non-empty time range per legacy batch.
* An absurd batch len overflows the spine's maintenance arithmetic
  (len.next_power_of_two(), summing lens of merged batches).
* An absurd spine batch level overflows `level + 1` and sizes a vec![]
  allocation; a level past the first batch's indexes out of bounds.
* More descs than parts underflows the padding subtraction.
* A spine batch whose parts don't tile its id range trips SpineBatch::id's
  debug_asserts.
* MergeState::push_batch asserts id/desc chaining within a level, expects at
  most two batches per level, and that no incomplete merge is present.
* Anything else invalid (e.g. a batch since past the trace since) only failed
  Trace::validate, which ran under debug_assert: a panic in fuzz/dev builds,
  silent acceptance of corrupted state in release.

Validate all of this in unflatten (plus a fallible try_push_batch that
push_batch now wraps) and run Trace::validate unconditionally at the end,
returning decode errors. For real blobs nothing changes: every state a
correct writer flattens satisfies these checks (the flatten/unflatten
roundtrip proptest exercises both the structured and legacy paths), and prod
treats a decode error the same as the old panic -- UntypedState::decode
expects rollups to decode and State::apply_diffs panics on apply errors --
just with a diagnosable message instead of an assert deep in spine internals.

Adds decode-rejection regression tests for each vector alongside the
existing rollup proto tests, named after what the crafted proto violates.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@def- def- requested a review from DAlperin June 12, 2026 10:40
@def- def- marked this pull request as ready for review June 12, 2026 10:41
@def- def- requested review from a team as code owners June 12, 2026 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant