persist: Harden durable-state and binary decoding against malformed input#36985
Open
def- wants to merge 11 commits into
Open
persist: Harden durable-state and binary decoding against malformed input#36985def- wants to merge 11 commits into
def- wants to merge 11 commits into
Conversation
`PostgresKeyDesc::cols` is `Vec<u16>` but the proto carries it as
`Vec<u32>`; the decode path did `c.try_into().expect("values roundtrip")`,
which panicked on any value above 65535 — reachable from untrusted
proto bytes (the fuzz target found this from many distinct inputs).
Propagate the conversion error via `TryFromProtoError` instead.
Also fixes two build errors found while bringing up new fuzz crates:
- mysql-util fuzz target was importing via `mz_mysql_util::desc::*`,
but `desc` is a private module; use the top-level re-exports.
- pgwire's `Codec` was a fully-private struct, blocking the
`fuzz_exports` re-export. Make it `pub(crate)` so the same-crate
`pub use` in lib.rs works under `#[cfg(feature = "fuzz")]`.
…nDesc
Same shape as the `ProtoPostgresKeyDesc.cols` fix: `col_num` is `u16`
in Rust but `u32` on the wire, and the decode path used
`.expect("u16 must roundtrip")` — reachable from untrusted proto
bytes (the fuzz target found this from many distinct inputs once the
key-desc panic was out of the way).
Also bumps `pgwire::codec::Codec` from `pub(crate)` to `pub` — needed
for the `fuzz_exports` re-export to compile (`pub use` can't widen a
`pub(crate)` item to a fully-public reexport). `mod codec` remains
private, so the type is only reachable via the `fuzz` feature.
`Numeric`'s binary `FromSql` computed `(units - weight - 1) * 4` in i16, but `units`/`weight` are attacker-controlled i16 header fields from a client bind parameter. Extreme `weight` values overflow i16 — a panic under overflow checks (a client-triggerable crash) and a silent wraparound to a wrong scale otherwise. Compute the scale (and the `in_scale` comparison) in i32; the values are well within range. Found by the new value_decode_binary cargo-fuzz target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…panicking `from_proto_with_type` zipped the proto's `children` against the fields its declared data type expects using `zip_eq`, which panics when the counts disagree. `children` is reconstructed from (untrusted-on-disk) Parquet part bytes, so a corrupted or crafted part panicked the reader. Check the lengths and return a `TryFromProtoError` on mismatch. Found by an exploratory cargo-fuzz target over the `ProtoArrayData -> ArrayData` decode path. (That target is not committed: adversarial-but- build-valid layouts also panic inside arrow's own ArrayData build/align/ equal routines, which is an upstream concern, so the target can't yet run green in CI.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-asserting `StateDiff`'s proto decode `debug_assert_eq!(field_diffs.validate(), Ok(()))` on field diffs reconstructed from an untrusted consensus blob. Under debug assertions (and fuzzing) a corrupt/crafted diff panics; in release the assert is compiled out and the unvalidated diffs flow through. Validate and return a `TryFromProtoError::InvalidPersistState` on failure so the behaviour is consistent and a bad blob is a decode error, not a panic. Found by the new state_diff_proto_roundtrip cargo-fuzz target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tead of asserting `BatchPart::from_proto`'s inline branch asserted (`assert_eq!`/`assert_none!`, not `debug_assert` — so it panicked even in release) that the hollow-only fields (`encoded_size_bytes`, `key_lower`, `key_stats`, `diffs_sum`) were unset. These are decoded from an untrusted blob (read on every rollup/state load), so a crafted/corrupted inline `ProtoHollowBatchPart` panicked the reader. Validate and return a `TryFromProtoError` instead. Found by the rollup_proto_roundtrip cargo-fuzz target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`Description::from_proto` passed the decoded `lower` antichain straight to `Description::new`, which asserts a non-empty lower frontier. `lower` comes from an untrusted blob, so a crafted/corrupted `ProtoHollowBatch` with an empty lower frontier panicked on rollup/state load. Validate it and return a `TryFromProtoError` instead. Found by the rollup_proto_roundtrip cargo-fuzz target (a second, distinct rollup decode panic after the inline-batch-part fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lformed input `parse_id` (persist-client) and `ShardId::from_str` (persist-types) hand the post-prefix string straight to `Uuid::parse_str`, which panics — rather than returning an error — while building its parse error for some malformed inputs: non-ASCII bytes make it slice the input on a non-char boundary. A crafted `ProtoRollup` reaches these via the ID `RustType` decode, so the panic is reachable from untrusted durable state. Reject anything that can't be a UUID (fewer than 32 chars, or non-ASCII) before calling the crate, matching the guard in `mz_repr::strconv::parse_uuid`. Regression for the rollup_proto_roundtrip cargo-fuzz finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every persisted state maintains the invariant that it has at least one rollup (State::latest_rollup .expect()s it). ProtoRollup::into_rust called latest_rollup while decoding, so an untrusted/corrupt rollup proto whose state carries no rollups panicked instead of being rejected — a rollup_proto_roundtrip fuzz crash (and a decode-time availability risk for a corrupt blob). Validate the invariant in from_proto and return TryFromProtoError:: InvalidPersistState when it is violated, before latest_rollup is reached (and before the decoded Rollup is later re-encoded). Adds a regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
My earlier guard rejected any rollup proto whose state had no rollups, but that was too broad: it broke applier_version_state, which decodes a freshly initialized state that legitimately has no rollups yet. latest_rollup -- the actual panic site -- is only reached when the proto carries diffs (to validate the diff bounds against it), so move the check there. A rollup-less state without diffs decodes again, while a proto that has diffs but no rollups is still rejected instead of panicking. The regression test is updated to build the diffs-but-no-rollups case it now guards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A rollup's trace is decoded from an untrusted blob, but Trace::unflatten reconstructed the spine through code whose invariants are enforced by panics, so a corrupted or crafted blob could kill the decoder instead of failing with a decode error. The rollup_proto_roundtrip fuzz target hit the first of these in release qualification: legacy batches that don't tile the timeline contiguously trip Spine::insert's (always-on) contiguity assert. Probing the neighboring reconstruction code found the rest of the class, all reachable from crafted bytes (cargo-fuzz builds enable debug assertions and overflow checks, so the debug_asserts and arithmetic count too): * Spine::insert also asserts a non-empty time range per legacy batch. * An absurd batch len overflows the spine's maintenance arithmetic (len.next_power_of_two(), summing lens of merged batches). * An absurd spine batch level overflows `level + 1` and sizes a vec![] allocation; a level past the first batch's indexes out of bounds. * More descs than parts underflows the padding subtraction. * A spine batch whose parts don't tile its id range trips SpineBatch::id's debug_asserts. * MergeState::push_batch asserts id/desc chaining within a level, expects at most two batches per level, and that no incomplete merge is present. * Anything else invalid (e.g. a batch since past the trace since) only failed Trace::validate, which ran under debug_assert: a panic in fuzz/dev builds, silent acceptance of corrupted state in release. Validate all of this in unflatten (plus a fallible try_push_batch that push_batch now wraps) and run Trace::validate unconditionally at the end, returning decode errors. For real blobs nothing changes: every state a correct writer flattens satisfies these checks (the flatten/unflatten roundtrip proptest exercises both the structured and legacy paths), and prod treats a decode error the same as the old panic -- UntypedState::decode expects rollups to decode and State::apply_diffs panics on apply errors -- just with a diagnosable message instead of an assert deep in spine internals. Adds decode-rejection regression tests for each vector alongside the existing rollup proto tests, named after what the crafted proto violates. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hardens decode paths in
persist,persist-client,persist-types,pgrepr, andpostgres-utilagainst malformed/untrusted input: reject rollup-less/empty-frontier/hollow-only/invalid StateDiff state instead of panicking or debug-asserting, guard UUID parsing, fix i16 overflow in pgrepr numeric-scale binary decode, and propagate u16-conversion errors in the postgres table-desc protos.Found by the cargo-fuzz suite (separate infra PR). Each fix has a regression test.
🤖 Generated with Claude Code