feat(legacy-adapter): prefix-aware output with caller-supplied target_prefix_len#6425
Open
g-talbot wants to merge 2 commits into
Conversation
This was referenced May 13, 2026
93bb993 to
bd66d6c
Compare
0c3ae7c to
bc10992
Compare
bd66d6c to
5c7d823
Compare
bc10992 to
2085c8d
Compare
5c7d823 to
dfe8bb8
Compare
ceed395 to
7b8317b
Compare
dfe8bb8 to
c6e88ba
Compare
7b8317b to
a5dfd72
Compare
c6e88ba to
536389b
Compare
The legacy adapter previously consolidated multi-RG legacy inputs into a single oversized row group and left `rg_partition_prefix_len` at the original's (typically `0`). The streaming merge engine then sent these single-RG/prefix=0 inputs through the new sub-region splitting path — correct, but it forfeits the prefix-aware fast path for outputs derived from legacy inputs and gives up the row-group pruning that prefix alignment enables. After consolidating, the adapter now slices the resulting record batch at first-sort-col transitions (typically `metric_name`) and emits one parquet row group per slice, stamping the re-encoded file with `qh.rg_partition_prefix_len = 1`. The merge engine then reads it through the prefix-aware fast path: one region per metric_name, the existing duplicate-prefix invariant on read validates uniqueness. Fallback: if the original file has no `qh.sort_fields` KV, the sort-fields string fails to parse, the first column can't be resolved in the arrow schema, or the consolidated batch is empty, the adapter reverts to a single-RG re-encode without claiming any prefix alignment. That input still works — the engine's prefix_len=0 sub-region splitting path picks it up. This keeps the adapter robust for files written by very early versions of the indexer that may pre-date the standard KV layout. Implementation: `reencode_prefix_aligned` replaces `reencode_as_single_row_group` and either dispatches to the new multi-RG writer or to the legacy single-RG writer based on whether the first sort col is resolvable. `RowConverter` handles the prefix-value equality check uniformly across dictionary, utf8, and primitive types. The KV injection helper replaces (rather than appends) any existing `qh.rg_partition_prefix_len` so re-runs and files mistakenly carrying a stale value still land at the freshly synthesized prefix. Tests: - `test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg` — 3 metrics × 40 rows, multi-RG input → 3 prefix-aligned output RGs and `qh.rg_partition_prefix_len = 1` KV. - `test_legacy_input_single_metric_yields_one_rg_with_prefix_kv` — one metric → one RG, prefix KV still stamped (vacuously aligned). - `test_legacy_input_without_sort_fields_falls_back_to_single_rg` — fallback path preserved when sort-fields KV is missing. - All existing tests pass unchanged (they use empty KVs or unparseable sort-fields strings, both of which exercise the fallback path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-prefix support `LegacyInputAdapter::try_open` now takes `target_prefix_len: u32` chosen by the caller, matching the merge plan's consensus prefix length. The adapter slices the consolidated batch at every transition of the first N sort columns (composite key, via `RowConverter` over all N fields) and emits one output row group per slice, stamping the output with `qh.rg_partition_prefix_len = target_prefix_len`. With `target_prefix_len = 0` the adapter takes the original single-RG passthrough path with no prefix-alignment claim. A sort column that is named in `qh.sort_fields` but missing from the file's arrow schema is treated as implicitly null at every row per SS-3. A constantly-null column trivially satisfies alignment on that column (null == null) and contributes no transitions, so the split boundaries are driven by the columns that are present. This matches the merge engine's compaction-time treatment of missing columns and keeps a legacy file with an evolved schema usable as a prefix-aligned input. `PrefixUnresolvable` now fires only on cases where the file doesn't advertise enough sort *names* to honor the request: - `qh.sort_fields` absent or unparseable - `qh.sort_fields` declares fewer sort columns than `target_prefix_len` A column missing from the arrow schema no longer counts as unresolvable; the adapter materialises a `NullArray` of the batch's length in that slot and proceeds. Tests: - `test_target_prefix_len_zero_passes_through_as_single_rg` — explicit N=0 fallback, no prefix KV stamped. - `test_target_prefix_len_two_splits_by_metric_and_service` — composite prefix (`metric_name`, `service`) → 4 RGs, KV declares prefix_len=2. - `test_target_prefix_len_one_without_sort_fields_returns_unresolvable` — no `qh.sort_fields` KV → `PrefixUnresolvable`. - `test_target_prefix_len_exceeds_declared_sort_cols_returns_unresolvable` — sort schema declares 2 cols, caller asks 3 → `PrefixUnresolvable`. - `test_missing_prefix_col_treated_as_null_satisfies_alignment` — sort schema declares `metric_name|env|-timestamp_secs` but `env` is absent from the arrow schema → no error, only metric_name transitions split RGs, KV still stamps prefix_len=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
536389b to
fdfd01f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Slice 2 of 3 from the former PR #6410. Stacked on #6424 (engine), underneath #6426 (hardening).
Generalizes the legacy adapter from "single-RG passthrough" to "produce N-col-prefix-aligned multi-RG output on caller request." The streaming engine in #6424 reads the adapter's output through the prefix-aware fast path.
What's in here
Two commits:
synthesize prefix-aligned row groups— when the legacy file carries aqh.sort_fieldsKV and the caller wants prefix alignment, the adapter consolidates the input, slices the consolidated batch at first-sort-col transitions, and writes one RG per slice. Stampsqh.rg_partition_prefix_len = 1on the output. Single-metric files produce one RG (vacuously aligned) with the KV still set so the file looks identical to a metric-aligned new-format file.parameterize on target_prefix_len with composite-prefix support— caller chooses the target (0,1,2, ...). For0, single-RG passthrough; forN > 0, composite N-col prefix via Arrow'sRowConverterso dictionary / utf8 / primitive types are handled uniformly. NewLegacyAdapterError::PrefixUnresolvable { target, reason }when the file doesn't advertise enough sort-column names. SS-3 producer-side: columns named inqh.sort_fieldsbut absent from the arrow schema are treated as null at every row (NullArray in that slot during slice computation), which trivially satisfies alignment.Test plan
cargo test -p quickwit-parquet-engine --all-features— 483 tests pass at this slice's HEAD.test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg,test_legacy_input_single_metric_yields_one_rg_with_prefix_kv,test_target_prefix_len_zero_passes_through_as_single_rg,test_target_prefix_len_one_without_sort_fields_returns_unresolvable,test_target_prefix_len_exceeds_declared_sort_cols_returns_unresolvable,test_target_prefix_len_two_splits_by_metric_and_service,test_missing_prefix_col_treated_as_null_satisfies_alignment.Out of scope
🤖 Generated with Claude Code