renderer(tier-1): row anchors + number-format expansion + merged-cell propagation by arnav2 · Pull Request #12 · knowledgestack/excel-parser

arnav2 · 2026-05-20T07:46:44Z

Summary

Three surgical changes to text_renderer.render_block() — each addresses a distinct way the parser/chunker was losing signal between the workbook and the chunk's render_text. Stacked on #11. Together they move the parser-quality metric by +0.6 pp on the full 912 and lift text@5 by +1.0 pp / text@5 in-scope by +0.9 pp — the first measurable parser+chunker improvement of this session.

What changed

1. Row-number anchors (`r<N>` prefix per data row)

The renderer already emitted a block header with the A1 range, but data rows had no row number. Downstream consumers — especially the agent on ks-backend — couldn't compute cell coordinates from chunk text. Now:

[Sheet1!A1:D10] (table)
     | A    | B   | C   | D    |
     |------|-----|-----|------|
r1   | Name | Q1  | Q2  | Q3   |
r2   | Wgt  | 100 | 150 | 200  |

Row prefix width is sized to the largest row number in the block so the grid stays aligned regardless of block depth.

2. Number-format-aware rendering

When a cell's number_format produces a meaningfully different displayed string from the raw value, we now emit BOTH:

raw value	number_format	rendered
`0.06`	`0%`	`0.06 [6%]`
`1272`	`#,##0.00`	`1272 [1,272.00]`
`46022`	`yyyy-mm-dd`	already handled (date)

Substring-match retrieval can hit either form — the question may quote either, and answer.xlsx often uses the display form even though input.xlsx keeps the raw.

Trivial diffs (1272 → "1272.00", "1272.0") are NOT expanded — they add no retrieval-relevant tokens, only noise.

3. Merged-cell value propagation

Slave cells in a merged region used to render blank (openpyxl returns None for them). Questions that referenced the cell by a slave coordinate could never match. Now:

r1   | Total | ← Total | ← Total |  (master + 2 slaves)

The merged region's visible value appears at every position it appears in Excel.

Bench on full 912 with `text-embedding-3-large`

Metric	Before Tier-1	After Tier-1	Δ
Parser-quality (rank IS NOT None, in-scope)	0.843	0.849	+0.006 (+4 instances surfaced)
recall_text@5	0.750	0.760	+0.010
recall_text@5_in_scope	0.838	0.847	+0.009
recall_text@3	0.746	0.755	+0.009
recall_text@1	0.638	0.641	+0.003
recall_geometric@5	0.482	0.484	+0.002 (noise)
mean parse_ms	156	184	+27 ms
Per-instance flips	—	—	6 miss→hit, 0 hit→miss on text@5

The +0.9 pp text@5_in_scope move is modest but clean — zero regressions, no parser internals reshaped, all gains come from making the chunk text more faithful to what's visually in the workbook.

Type of change

✨ New feature (row anchors, format expansion, merged-cell propagation)
🚀 Performance (slight regression: +27ms parse for the format/merge lookups)
🧪 Parser edge case / new regression test (7 new cases)

Test plan

make test passes — 1079 → 1086 (+7 new)
Full 912 SpreadsheetBench v0.1 with text-embedding-3-large: +6 instances flip miss→hit on text@5, 0 regressions
tests/test_renderer_tier1.py covers all 3 changes + boundary cases (trivial diff suppression, no spurious markers on unmerged sheets)

Notes for reviewers

Stacks on chunker: range-tighten clip + opt-in row-group splitter (no recall move; correctness only) #11 — review that PR first (chunker correctness fix; 0 recall move).
No new dependencies. All changes are deterministic, no env vars, no opt-in flags. The renderer just emits more faithful chunks.
render_text gets ~5-10% longer because of the row anchors + format-expansion brackets + merged-cell ← markers. Within the embedder's context window for any reasonable block; doesn't trigger any chunk-size cap.

🤖 Generated with Claude Code

… propagation Three changes to text_renderer.render_block(), each addressing a distinct way the parser/chunker was losing signal between the workbook and the chunk's render_text. Together they move the parser-quality metric (rank IS NOT None — answer surfaced in some chunk's text) by +0.6 pp on the full 912 SpreadsheetBench v0.1, and lift text@5 by +1.0 pp / text@5_in_scope by +0.9 pp on the same run. 1. Row-number anchors Every data row of the markdown grid now carries an `r<N>` prefix where N is the sheet row (1-indexed): [Sheet1!A1:D10] (table) | A | B | C | D | |------|-----|-----|------| r1 | Name | Q1 | Q2 | Q3 | r2 | Wgt | 100 | 150 | 200 | A downstream LLM consuming the chunk can now compute cell coordinates deterministically: the block header gives the A1 range; per-row anchors close the gap to (row, col). Citation- grade output for the agent-side use cases on ks-backend. 2. Number-format-aware rendering When a cell's number_format produces a meaningfully-different displayed string (0.06 → "6%", 1272 → "1,272.00", 46022 → date), we now emit both: r2 | 0.06 [6%] | 1272 [1,272.00] | Substring-search retrieval hits either form — the question may quote the raw or the displayed, and answer.xlsx may use the display form even though input.xlsx keeps the raw. Trivial diffs (1272 → "1272.00", "1272.0") are NOT expanded — no information added, only noise. 3. Merged-cell value propagation Slave cells in a merged region currently render blank because openpyxl returns None for them. That kills text-match retrieval whenever a question references the cell by a slave coordinate. Renderer now looks up the master and emits the master's value at each slave with a `← ` propagation marker: r1 | Total | ← Total | ← Total | The merged region's visible value now appears at every position it appears in Excel, not just the top-left. Bench (full 912 / text-embedding-3-large): parser-quality (rank IS NOT None): 0.843 → 0.849 (+4 instances) recall_text@5: 0.750 → 0.760 (+0.010) recall_text@5_in_scope: 0.838 → 0.847 (+0.009) recall_geometric@5: 0.482 → 0.484 (no real change) mean parse_ms: 156 → 184 (+27 ms) per-instance: 6 miss→hit, 0 hit→miss Tests: tests/test_renderer_tier1.py — 7 cases (row anchor presence + correct sheet-row indexing, percent/decimal format expansion, trivial-diff suppression, merged-cell propagation + sanity). 1079 → 1086 total passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

renderer(tier-1): row anchors + number-format expansion + merged-cell propagation#12

renderer(tier-1): row anchors + number-format expansion + merged-cell propagation#12
arnav2 wants to merge 1 commit into
chunker/range-tighten-and-size-capfrom
chunker/render-anchors-formats-merges

arnav2 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arnav2 commented May 20, 2026

Summary

What changed

1. Row-number anchors (r<N> prefix per data row)

2. Number-format-aware rendering

3. Merged-cell value propagation

Bench on full 912 with text-embedding-3-large

Type of change

Test plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Row-number anchors (`r<N>` prefix per data row)

Bench on full 912 with `text-embedding-3-large`