chunker: range-tighten clip + opt-in row-group splitter (no recall move; correctness only)#11
Merged
arnav2 merged 1 commit intoMay 20, 2026
Conversation
…p splitter
Two chunker-side changes, neither moves retrieval recall — explicit
calibration on the full 912 SpreadsheetBench v0.1 with both BGE-small
and text-embedding-3-large shows 0 instance flips for either change.
Shipping for correctness, not recall.
A. _tight_content_bbox + clip in _block_to_chunk
Walks `block.cell_range`, finds the bbox of cells whose `raw_value`
or non-empty `display_value` is set, and clips the chunk's claimed
`(top_left_cell, bottom_right_cell)` to that bbox. Fixes the over-
claim pathology where a sheet with styled-empty cells across XFD
width produces a chunk claiming `A1:XFD4` despite the actual data
sitting in a 5×3 corner. The renderer already iterates the original
range, so the narrowed claim is always a superset of cells that
contributed to render_text — invariant preserved.
Bench (full 912 / text-embedding-3-large):
recall_text@5: 0.750 → 0.750 (no change)
recall_geometric@5: 0.482 → 0.482 (no change)
recall_text@5_in_scope: 0.838 → 0.838 (no change)
mean parse_ms: 156 → 174 (+18 ms; bbox walk)
net instance flips: 0 miss→hit, 0 hit→miss
Why no recall change despite fixing a real over-claim: the over-
claims happen on sheets with empty-XFD blocks; the GT cells on
those sheets are in OTHER blocks (proper data regions). So the
over-claim was already a false negative for geom scoring, not a
false positive that needed correction. The pathology lives in the
dead zone of the retrieval metric. Citation UIs that highlight the
chunk's claimed range still benefit — that's the actual value here.
B. _split_block_by_rows + KS_CHUNK_BUDGET_CHARS env var
When a block's render_text exceeds a configurable budget, split it
into row-group sub-blocks with tight A1 ranges and non-overlapping
coverage. Default budget is 100,000 chars — effectively OFF for any
reasonable workbook on this corpus. Calibration on the 50-sample
showed every smaller budget (2k, 4k, 8k) was net-zero or net-
negative on retrieval recall because the embedding cannot
discriminate between same-shape row-group children. Lower it via
`KS_CHUNK_BUDGET_CHARS=2000` only if your downstream consumer has
a strict per-chunk token economy that demands fragmentation; bench
any such change against your own corpus first.
Tests: tests/test_chunker_range_tighten.py (3 cases) + tests/test_chunker_size_cap.py
(5 cases) — 8 added, 1071 → 1079 total passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two chunker-side changes. Neither moves retrieval recall on the full 912 SpreadsheetBench v0.1 with either BGE-small or text-embedding-3-large. I'm proposing this for correctness / citation-grade output, not for the bench number. Be honest about that going in.
Stacked on #10 (recall-90 harness PR) since the bench instrumentation it adds is what I used to measure the (lack of) impact below.
What changed
A · Range-tightening clip in
_block_to_chunk_tight_content_bbox(block, sheet)walks the cells insideblock.cell_rangeand returns the bbox of cells whoseraw_valueis non-None ordisplay_valueis non-empty whitespace. The chunk's claimed(top_left_cell, bottom_right_cell)is then clipped to that bbox before emission. The renderer continues iterating the originalblock.cell_range, so the narrowed claim is always a superset of cells that contributed torender_text— invariant preserved.Concrete pathology this fixes: on the SpreadsheetBench corpus, several sheets carry styled-empty cells stretching across the full XFD width (16,384 columns). The segmenter sees them, the chunker dutifully emits a chunk claiming
A1:XFD4despite zero actual data outside the upper-left corner. Without the clip, citation UIs would highlight the entire sheet width as the "source" of any retrieved chunk.B ·
_split_block_by_rows+KS_CHUNK_BUDGET_CHARSenv varRow-group splitter for oversize blocks. Each child has a tight, non-overlapping A1 range over its data rows; siblings sum to the parent's row coverage exactly. Default
KS_CHUNK_BUDGET_CHARS=100000— effectively OFF for any reasonable workbook. Available behind the env var for downstream consumers with strict per-chunk token economy.Empirical numbers (full 912 / text-embedding-3-large)
recall_text@5recall_text@5_in_scoperecall_geometric@5recall_geometric@5_in_scopemean_parse_msWhy no retrieval delta despite a real bug fix
The over-claims happen on sheets where the GT cells live in other blocks. So the over-claim was already a false negative for geom scoring, not a false positive that needed correction — the metric just didn't see the lie. Citation UIs that highlight the chunk's claimed range are the actual beneficiaries. Without a citation-accuracy metric we can't surface that gain numerically; that metric is follow-up work (see #issue-when-i-file-it).
Type of change
A1:XFD4-style ranges)Test plan
make testpasses — 1071 → 1079 teststext-embedding-3-large: 0 instance flips, recall unchangedtests/test_chunker_range_tighten.py— 3 cases including a corpus-fixture invarianttests/test_chunker_size_cap.py— 5 cases including the env-var-driven split + default-no-splitNotes for reviewers
🤖 Generated with Claude Code