Skip to content

fix(merge): adversarial-review hardening — CS-1, adapter self-verify, chunk-level test verifier#6426

Open
g-talbot wants to merge 2 commits into
gtt/streaming-merge-engine-multi-rg-2-adapterfrom
gtt/streaming-merge-engine-multi-rg-3-hardening
Open

fix(merge): adversarial-review hardening — CS-1, adapter self-verify, chunk-level test verifier#6426
g-talbot wants to merge 2 commits into
gtt/streaming-merge-engine-multi-rg-2-adapterfrom
gtt/streaming-merge-engine-multi-rg-3-hardening

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Slice 3 of 3 from the former PR #6410. Stacked on #6425 (adapter) and #6424 (engine). PR #6423 (legacy promotion + schema evolution) stacks on top of this.

Closes six findings from the adversarial review of the merger (.planning/reviews/2026-05-13-adversarial-review-parquet-merger.md):

F1: metastore rg_partition_prefix_len mirrors the writer's KV stamp (CS-1)

merge_parquet_split_metadata used to unconditionally demote to 0 when output.num_row_groups > 1. That contradicted the streaming writer (which stamps the inputs' prefix unchanged on prefix-aligned multi-RG output and verifies via assert_unique_rg_prefix_keys). On-disk KV said 2, metastore said 0 — aligned splits leaked into the unaligned bucket every merge.

Fix: new MergeOutputFile.output_rg_partition_prefix_len field carries the writer's actual stamp. Each writer (legacy merge/writer.rs and streaming merge/streaming/output.rs) populates it with what it actually wrote. merge_parquet_split_metadata propagates that one source of truth. Old test_output_prefix_len_demoted_when_multi_rg is replaced by test_output_prefix_len_carries_writers_value_when_demoted + a new test_output_prefix_len_preserved_on_multi_rg_streaming_engine regression case.

F8: legacy adapter rejects SS-1-violating inputs up front

The adapter walked rows in physical order and emitted one RG per prefix-value run. An unsorted input like [A,A,B,B,A,A] produced a 3-RG file with two RGs sharing prefix A, violating PA-3. The downstream merge engine would catch it later — but only after the bad file had been written.

Fix: compute_prefix_value_slices tracks each slice's composite prefix-value bytes. On duplicate, returns LegacyAdapterError::InputNotSorted { target, first_offset, second_offset }. Test: test_unsorted_legacy_input_rejected_by_adapter.

F12: consumer-side SS-3 (discovered during F10 wiring)

find_prefix_parquet_col_indices had a hard-coded "missing from schema → bail" branch. Combined with the adapter's producer-side SS-3 (which produces files claiming prefix_len=2 while missing one of the named cols from the schema), this meant adapter-produced SS-3 files were unreadable by the streaming engine.

Fix: returns Vec<Option<PrefixColumn>> (analogous to the adapter's resolve_prefix_sort_cols). extract_rg_composite_prefix_key emits a constant null marker (encode_byte_array_prefix(&[])) for None slots — constant across every RG → no ordering signal → ordering is driven entirely by the present columns. The SS-3 verifier in test_missing_prefix_col_treated_as_null_satisfies_alignment is now enabled.

Known limitation: cross-file SS-3 (some inputs have the col, others don't) uses [0x00, 0x00] which sorts BEFORE non-null per SS-2 — single-file SS-3 (adapter's case) is unaffected because every RG contributes the same constant.

F2 / F9 / F11: chunk-level verification in prefix-claiming tests

Tests asserting num_row_groups == N + qh.rg_partition_prefix_len KV stamped would have passed even with off-by-one slice boundaries or scrambled column content. Now six tests call assert_unique_rg_prefix_keys to verify PA-1 (intra-RG min == max) + PA-3 (inter-RG uniqueness) directly from chunk-level statistics:

  • test_streaming_merge_with_prefix_len_two, test_multi_rg_metric_aligned_input_produces_multi_rg_output, test_streaming_merge_with_desc_prefix_col (streaming engine outputs).
  • test_target_prefix_len_two_splits_by_metric_and_service, test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg, test_missing_prefix_col_treated_as_null_satisfies_alignment (adapter outputs).

Also: assert_unique_rg_prefix_keys no longer short-circuits on single-RG files — an unsorted single-RG file CAN have min != max on a prefix column, so the chunk check now runs in every nonzero-RG case.

Test plan

  • cargo test -p quickwit-parquet-engine --all-features — 485 unit tests pass at this slice's HEAD.
  • cargo clippy -p quickwit-parquet-engine --tests --all-features with -Dwarnings.
  • cargo fmt --all -- --check.
  • Tree at slice 3 HEAD equals the old PR-6410 HEAD (no functional changes, just commit topology).

🤖 Generated with Claude Code

g-talbot and others added 2 commits May 13, 2026 11:22
…KV stamp

The streaming merge engine produces sort-prefix-aligned multi-RG output
and stamps `qh.rg_partition_prefix_len = input_meta.rg_partition_prefix_len`
in the file's KV (verified by `assert_unique_rg_prefix_keys` before close).
`merge_parquet_split_metadata` then ran after and unconditionally demoted
to 0 whenever `output.num_row_groups > 1` — breaking CS-1 (metastore must
mirror on-disk KV) for every multi-RG streaming-engine output. Aligned
splits got tagged 0 in the metastore on every merge and leaked out of
the prefix-aligned compaction bucket on the next pass.

Carry the value the writer actually stamped via a new
`MergeOutputFile.output_rg_partition_prefix_len` field, then propagate
it as-is in metadata aggregation. Both engines populate the field:
- Legacy `merge/writer.rs` reports its demoted value (row-count-driven
  RG boundaries can't honor prefix alignment, so it stamps 0 on multi-RG).
- Streaming `merge/streaming/output.rs` reports the inputs' prefix
  unchanged (it splits at prefix transitions and the writer verifies).

CS-1 holds by construction — same source of truth, no re-derivation.

Tests:
- `test_output_prefix_len_demoted_when_multi_rg` → renamed to
  `test_output_prefix_len_carries_writers_value_when_demoted`; now
  asserts that the metastore mirrors the writer's reported value.
- New `test_output_prefix_len_preserved_on_multi_rg_streaming_engine`
  asserts that a multi-RG streaming output (writer reports prefix_len=2)
  keeps the prefix in the metastore — the regression case for F1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onger test verifiers

Three adversarial-review findings on the prefix/RG machinery, bundled
because they touch the same producer/consumer contract:

**F8: Legacy adapter rejects SS-1-violating input upfront.**
The adapter walked rows in physical order and emitted one RG per
prefix-value run. An unsorted legacy input (rows `[A,A,B,B,A,A]`)
produced a 3-RG file where two RGs shared prefix `A`, violating PA-3.
The streaming merge engine would later reject it mid-merge — but only
after a quietly-bad file had been built. Now `compute_prefix_value_slices`
tracks each slice's composite prefix-value bytes and bails with
`LegacyAdapterError::InputNotSorted` on duplicates, surfacing the
SS-1 violation before any file lands on disk.

**F12: Consumer-side SS-3 (cross-layer divergence, discovered while
wiring F2's chunk-level verifier into the SS-3 test).** The adapter
implements SS-3 correctly (missing-from-schema → synthesized NullArray
during slice computation, file stamps `prefix_len = N`). The streaming
engine's reader did not: `find_prefix_parquet_col_indices` hard-required
every named prefix column to be physically present, so a file the
adapter produced from an SS-3 input was unreadable by the merge engine.
Now `find_prefix_parquet_col_indices` returns `Vec<Option<PrefixColumn>>`
and `extract_rg_composite_prefix_key` emits a constant null marker
(`encode_byte_array_prefix(&[])`) for None slots. The column contributes
no cross-RG ordering signal (constant everywhere) so region boundaries
are driven entirely by the present columns. Both halves of SS-3 now
agree end-to-end.

Known limitation: cross-file SS-3 — where some inputs have a sort
column and others don't — uses [0x00, 0x00] for the null contribution,
which sorts BEFORE non-null per the encoded-empty-string convention.
That weakly violates SS-2 (nulls sort last). Single-file SS-3 is
correct because every RG in such a file contributes the same constant.
If cross-file SS-3 becomes a production scenario, the encoding needs
a leading-0xff sentinel instead. Not exercised today.

**F2/F9/F11: Wire `assert_unique_rg_prefix_keys` into prefix-claiming
tests.** Tests asserting `num_row_groups == N` + KV stamped to N would
have passed even with an off-by-one in slice-boundary detection or
column-content scrambling. The verifier reads chunk-level statistics
directly: PA-1 (intra-RG `min == max`) + PA-3 (inter-RG uniqueness)
on the composite key. Wired into six tests:
- streaming engine: `test_streaming_merge_with_prefix_len_two`,
  `test_multi_rg_metric_aligned_input_produces_multi_rg_output`,
  `test_streaming_merge_with_desc_prefix_col`
- legacy adapter: `test_target_prefix_len_two_splits_by_metric_and_service`,
  `test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg`,
  `test_missing_prefix_col_treated_as_null_satisfies_alignment` (now
  passes thanks to F12).

Also: `assert_unique_rg_prefix_keys` no longer short-circuits on
single-RG files — they still go through PA-1 because an unsorted
single-RG file CAN have `min != max` on a prefix column.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant