Skip to content

feat(legacy-adapter): prefix-aware output with caller-supplied target_prefix_len#6425

Open
g-talbot wants to merge 2 commits into
gtt/streaming-merge-engine-multi-rg-1-enginefrom
gtt/streaming-merge-engine-multi-rg-2-adapter
Open

feat(legacy-adapter): prefix-aware output with caller-supplied target_prefix_len#6425
g-talbot wants to merge 2 commits into
gtt/streaming-merge-engine-multi-rg-1-enginefrom
gtt/streaming-merge-engine-multi-rg-2-adapter

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Slice 2 of 3 from the former PR #6410. Stacked on #6424 (engine), underneath #6426 (hardening).

Generalizes the legacy adapter from "single-RG passthrough" to "produce N-col-prefix-aligned multi-RG output on caller request." The streaming engine in #6424 reads the adapter's output through the prefix-aware fast path.

What's in here

Two commits:

  1. synthesize prefix-aligned row groups — when the legacy file carries a qh.sort_fields KV and the caller wants prefix alignment, the adapter consolidates the input, slices the consolidated batch at first-sort-col transitions, and writes one RG per slice. Stamps qh.rg_partition_prefix_len = 1 on the output. Single-metric files produce one RG (vacuously aligned) with the KV still set so the file looks identical to a metric-aligned new-format file.

  2. parameterize on target_prefix_len with composite-prefix support — caller chooses the target (0, 1, 2, ...). For 0, single-RG passthrough; for N > 0, composite N-col prefix via Arrow's RowConverter so dictionary / utf8 / primitive types are handled uniformly. New LegacyAdapterError::PrefixUnresolvable { target, reason } when the file doesn't advertise enough sort-column names. SS-3 producer-side: columns named in qh.sort_fields but absent from the arrow schema are treated as null at every row (NullArray in that slot during slice computation), which trivially satisfies alignment.

Test plan

  • cargo test -p quickwit-parquet-engine --all-features — 483 tests pass at this slice's HEAD.
  • New adapter tests: test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg, test_legacy_input_single_metric_yields_one_rg_with_prefix_kv, test_target_prefix_len_zero_passes_through_as_single_rg, test_target_prefix_len_one_without_sort_fields_returns_unresolvable, test_target_prefix_len_exceeds_declared_sort_cols_returns_unresolvable, test_target_prefix_len_two_splits_by_metric_and_service, test_missing_prefix_col_treated_as_null_satisfies_alignment.

Out of scope

🤖 Generated with Claude Code

@g-talbot g-talbot requested a review from a team as a code owner May 13, 2026 15:23
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-2-adapter branch from 93bb993 to bd66d6c Compare May 14, 2026 13:48
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-1-engine branch from 0c3ae7c to bc10992 Compare May 14, 2026 13:48
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-2-adapter branch from bd66d6c to 5c7d823 Compare May 14, 2026 14:23
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-1-engine branch from bc10992 to 2085c8d Compare May 14, 2026 14:23
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-2-adapter branch from 5c7d823 to dfe8bb8 Compare May 14, 2026 14:49
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-1-engine branch 2 times, most recently from ceed395 to 7b8317b Compare May 14, 2026 17:16
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-2-adapter branch from dfe8bb8 to c6e88ba Compare May 14, 2026 17:16
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-1-engine branch from 7b8317b to a5dfd72 Compare May 14, 2026 18:10
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-2-adapter branch from c6e88ba to 536389b Compare May 14, 2026 18:10
g-talbot and others added 2 commits May 14, 2026 15:54
The legacy adapter previously consolidated multi-RG legacy inputs
into a single oversized row group and left `rg_partition_prefix_len`
at the original's (typically `0`). The streaming merge engine then
sent these single-RG/prefix=0 inputs through the new sub-region
splitting path — correct, but it forfeits the prefix-aware fast path
for outputs derived from legacy inputs and gives up the row-group
pruning that prefix alignment enables.

After consolidating, the adapter now slices the resulting record
batch at first-sort-col transitions (typically `metric_name`) and
emits one parquet row group per slice, stamping the re-encoded file
with `qh.rg_partition_prefix_len = 1`. The merge engine then reads
it through the prefix-aware fast path: one region per metric_name,
the existing duplicate-prefix invariant on read validates uniqueness.

Fallback: if the original file has no `qh.sort_fields` KV, the
sort-fields string fails to parse, the first column can't be
resolved in the arrow schema, or the consolidated batch is empty,
the adapter reverts to a single-RG re-encode without claiming any
prefix alignment. That input still works — the engine's
prefix_len=0 sub-region splitting path picks it up. This keeps the
adapter robust for files written by very early versions of the
indexer that may pre-date the standard KV layout.

Implementation: `reencode_prefix_aligned` replaces
`reencode_as_single_row_group` and either dispatches to the new
multi-RG writer or to the legacy single-RG writer based on whether
the first sort col is resolvable. `RowConverter` handles the
prefix-value equality check uniformly across dictionary, utf8, and
primitive types. The KV injection helper replaces (rather than
appends) any existing `qh.rg_partition_prefix_len` so re-runs and
files mistakenly carrying a stale value still land at the freshly
synthesized prefix.

Tests:
- `test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg`
  — 3 metrics × 40 rows, multi-RG input → 3 prefix-aligned output
  RGs and `qh.rg_partition_prefix_len = 1` KV.
- `test_legacy_input_single_metric_yields_one_rg_with_prefix_kv` —
  one metric → one RG, prefix KV still stamped (vacuously aligned).
- `test_legacy_input_without_sort_fields_falls_back_to_single_rg` —
  fallback path preserved when sort-fields KV is missing.
- All existing tests pass unchanged (they use empty KVs or
  unparseable sort-fields strings, both of which exercise the
  fallback path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-prefix support

`LegacyInputAdapter::try_open` now takes `target_prefix_len: u32`
chosen by the caller, matching the merge plan's consensus prefix
length. The adapter slices the consolidated batch at every transition
of the first N sort columns (composite key, via `RowConverter` over
all N fields) and emits one output row group per slice, stamping the
output with `qh.rg_partition_prefix_len = target_prefix_len`. With
`target_prefix_len = 0` the adapter takes the original single-RG
passthrough path with no prefix-alignment claim.

A sort column that is named in `qh.sort_fields` but missing from the
file's arrow schema is treated as implicitly null at every row per
SS-3. A constantly-null column trivially satisfies alignment on that
column (null == null) and contributes no transitions, so the split
boundaries are driven by the columns that are present. This matches
the merge engine's compaction-time treatment of missing columns and
keeps a legacy file with an evolved schema usable as a prefix-aligned
input.

`PrefixUnresolvable` now fires only on cases where the file doesn't
advertise enough sort *names* to honor the request:
- `qh.sort_fields` absent or unparseable
- `qh.sort_fields` declares fewer sort columns than `target_prefix_len`

A column missing from the arrow schema no longer counts as
unresolvable; the adapter materialises a `NullArray` of the batch's
length in that slot and proceeds.

Tests:
- `test_target_prefix_len_zero_passes_through_as_single_rg` — explicit
  N=0 fallback, no prefix KV stamped.
- `test_target_prefix_len_two_splits_by_metric_and_service` — composite
  prefix (`metric_name`, `service`) → 4 RGs, KV declares prefix_len=2.
- `test_target_prefix_len_one_without_sort_fields_returns_unresolvable`
  — no `qh.sort_fields` KV → `PrefixUnresolvable`.
- `test_target_prefix_len_exceeds_declared_sort_cols_returns_unresolvable`
  — sort schema declares 2 cols, caller asks 3 → `PrefixUnresolvable`.
- `test_missing_prefix_col_treated_as_null_satisfies_alignment` —
  sort schema declares `metric_name|env|-timestamp_secs` but `env`
  is absent from the arrow schema → no error, only metric_name
  transitions split RGs, KV still stamps prefix_len=2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/streaming-merge-engine-multi-rg-2-adapter branch from 536389b to fdfd01f Compare May 14, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant