From 421783f7d6c93bf23bec859736b191d47ba4e511 Mon Sep 17 00:00:00 2001
From: Arnav Goel <arnavgoel2@gmail.com>
Date: Tue, 19 May 2026 17:35:11 -0400
Subject: [PATCH 1/4] fix(packaging): nest everything under ks_xlsx_parser, add
 CI + bench infra
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Frank's report on the v0.2.0 PyPI release was correct: the wheel was
missing pipeline.py and api.py (top-level modules under src/, which
setuptools packages.find skips), AND it was leaking 13 generic top-level
packages (models, utils, parsers, ...) into site-packages. Confirmed by
building the wheel and inspecting top_level.txt + the file list.

Packaging
- Move 13 packages + pipeline.py + api.py + py.typed under
  src/ks_xlsx_parser/ via git mv (history preserved).
- Mechanical rewrite of 47 files: `from pipeline import` →
  `from ks_xlsx_parser.pipeline import`, etc. Dead `xlsx_parser.` refs
  in docstrings and examples updated to `ks_xlsx_parser.`.
- pyproject.toml: packages.find constrained to ks_xlsx_parser*, py.typed
  declared as package data, `xlsx-parser-api` console script fixed.
- Drop PYTHONPATH=src from Makefile — package is now importable.

Regression guard
- scripts/verify_wheel.py builds the wheel, installs in a clean venv,
  asserts top_level.txt = ['ks_xlsx_parser'] and the public import
  surface resolves. Wired into a new wheel-check CI job and the release
  workflow. The matrix `test` job uses an editable install, which
  structurally cannot catch a broken wheel — this is the gap that let
  v0.2.0 ship.

CI (ks-backend-style)
- Split ci.yml into test/lint/typecheck/wheel-check jobs, uv-backed
  installs, lint+typecheck non-blocking with TODO until the cleanup PR
  lands (114 → 45 ruff findings after safe autofix; 59 mypy unchanged).
- Add wheel verification step to release.yml before PyPI publish.

Benchmark + accuracy tracking
- Dockerfile.bench reproduces the SpreadsheetBench retrieval benchmark.
  Pre-warms BAAI/bge-small-en-v1.5; entrypoint ensures corpus, runs
  eval, appends to history, triages.
- .github/workflows/benchmark.yml: 60-instance sample on PRs, full
  912-instance corpus weekly + on manual dispatch. Uploads reports as
  artifact and posts recall summary to the job step summary.
- scripts/append_bench_history.py appends one row per run to
  tests/benchmarks/reports/history.jsonl tagged with the git commit
  and prints the row-over-row delta. Goal documented:
  text recall@5 > 0.90 (currently 0.704).

Recall failure triage (turns "recall is low" into a ranked worklist)
- eval_retrieval.py --emit-failures dumps every recall@5 miss with the
  top-8 ranked chunks + ground-truth values + a failure_bucket label
  (answer_absent_from_chunks / present_but_ranked_low / wrong_sheet /
  geometric_no_overlap / no_chunks / parse_error / unparseable_ground_truth).
- summary.json gains a failure_buckets histogram per parser; summary.md
  prints the bucket table.
- scripts/triage_recall.py reads failures.ndjson, ranks buckets by
  count, prints 3 exemplar failures per bucket so a human or agent can
  see WHY each miss happened without re-running the benchmark.
- docs/recall-investigation.md documents the diagnosis framework and
  three named hypotheses (chunk-size dilution, formula-expression
  rendering, range-bookkeeping drift) for the next investigation pass.
- .claude/skills/recall-failure-triage/SKILL.md — agent guide.

Misc
- New `bench` optional-dependency group (sentence-transformers, numpy).
- make install-dev alias (Frank pattern-matched this verb off ks-backend).
- New make targets: wheel-check, bench-track, docker-bench.

Verification
- 1041 tests pass with no PYTHONPATH=src.
- Wheel built + verified in clean venv.
- mypy: 59 pre-existing findings, no new ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/workflows/benchmark.yml               |  85 +++++++++
 .github/workflows/ci.yml                      |  71 ++++++--
 .github/workflows/release.yml                 |   6 +
 CHANGELOG.md                                  |  46 +++++
 Dockerfile.bench                              |  49 +++++
 Makefile                                      |  37 +++-
 docs/recall-investigation.md                  | 145 +++++++++++++++
 examples/demo.py                              |   6 +-
 examples/generate_examples.py                 |   2 +-
 examples/stress_test/stress_test_runner.py    |   2 +-
 pyproject.toml                                |  12 +-
 scripts/append_bench_history.py               | 101 +++++++++++
 scripts/eval_retrieval.py                     | 168 +++++++++++++++++-
 scripts/run_bench.sh                          |  56 ++++++
 scripts/run_enterprise_metrics.py             |   3 +-
 scripts/track_corpus_metrics.py               |   1 -
 scripts/triage_recall.py                      | 114 ++++++++++++
 scripts/verify_wheel.py                       |  81 +++++++++
 src/ks_xlsx_parser/__init__.py                |  10 +-
 src/{ => ks_xlsx_parser}/analysis/__init__.py |   0
 .../analysis/light_block_detector.py          |   6 +-
 .../analysis/llm_artifacts.py                 |  16 +-
 .../analysis/pattern_splitter.py              |   9 +-
 .../analysis/table_assembler.py               |   8 +-
 .../analysis/table_grouper.py                 |   8 +-
 .../analysis/template_extractor.py            |   8 +-
 .../analysis/tree_builder.py                  |   9 +-
 .../annotation/__init__.py                    |   0
 .../annotation/block_splitter.py              |   7 +-
 .../annotation/cell_annotator.py              |   6 +-
 src/{ => ks_xlsx_parser}/api.py               |  11 +-
 src/{ => ks_xlsx_parser}/charts/__init__.py   |   0
 .../charts/chart_extractor.py                 |   4 +-
 src/{ => ks_xlsx_parser}/chunking/__init__.py |   0
 src/{ => ks_xlsx_parser}/chunking/chunker.py  |  16 +-
 .../chunking/segmenter.py                     |   8 +-
 .../comparison/__init__.py                    |   0
 .../comparison/template_comparator.py         |   4 +-
 src/{ => ks_xlsx_parser}/export/__init__.py   |   0
 .../export/model_exporter.py                  |   6 +-
 src/{ => ks_xlsx_parser}/formula/__init__.py  |   0
 .../formula/dependency_builder.py             |   9 +-
 .../formula/formula_parser.py                 |   4 +-
 src/{ => ks_xlsx_parser}/models/__init__.py   |   4 +-
 src/{ => ks_xlsx_parser}/models/block.py      |   0
 src/{ => ks_xlsx_parser}/models/cell.py       |   0
 src/{ => ks_xlsx_parser}/models/chart.py      |   0
 src/{ => ks_xlsx_parser}/models/common.py     |   0
 src/{ => ks_xlsx_parser}/models/dependency.py |   0
 src/{ => ks_xlsx_parser}/models/shape.py      |   2 +-
 src/{ => ks_xlsx_parser}/models/sheet.py      |   2 -
 src/{ => ks_xlsx_parser}/models/table.py      |   0
 .../models/table_structure.py                 |   0
 src/{ => ks_xlsx_parser}/models/template.py   |   0
 src/{ => ks_xlsx_parser}/models/tree.py       |   0
 src/{ => ks_xlsx_parser}/models/workbook.py   |   2 +-
 src/{ => ks_xlsx_parser}/parsers/__init__.py  |   0
 .../parsers/calamine_core.py                  |   1 -
 .../parsers/cell_parser.py                    |   4 +-
 .../parsers/sheet_parser.py                   |  14 +-
 .../parsers/table_parser.py                   |   4 +-
 .../parsers/workbook_parser.py                |  13 +-
 src/{ => ks_xlsx_parser}/pipeline.py          |  44 ++---
 src/{ => ks_xlsx_parser}/py.typed             |   0
 .../rendering/__init__.py                     |   0
 .../rendering/html_renderer.py                |   8 +-
 .../rendering/text_renderer.py                |   8 +-
 src/{ => ks_xlsx_parser}/storage/__init__.py  |   0
 .../storage/serializer.py                     |   4 +-
 src/{ => ks_xlsx_parser}/utils/__init__.py    |   0
 .../utils/logging_config.py                   |   2 +-
 .../verification/__init__.py                  |   0
 .../verification/stage_verifier.py            |  32 ++--
 tests/benchmarks/_driver.py                   |   4 +-
 tests/benchmarks/_runner.py                   |   2 +-
 tests/benchmarks/adapters/docling_adapter.py  |   2 +-
 tests/benchmarks/adapters/ks_adapter.py       |   4 +-
 tests/conftest.py                             |   2 -
 tests/helpers/invariant_checker.py            |   2 +-
 tests/test_charts.py                          |   7 +-
 tests/test_corpus_robustness.py               |  12 +-
 tests/test_formula_handling.py                |  15 +-
 tests/test_formula_parser.py                  |   5 +-
 tests/test_llm_artifacts.py                   |   7 +-
 tests/test_models.py                          |   4 +-
 tests/test_multi_table_layout.py              |   9 +-
 tests/test_parsers.py                         |   3 +-
 tests/test_pipeline.py                        |   4 +-
 tests/test_rendering.py                       |  18 +-
 tests/test_segmentation.py                    |   7 +-
 tests/test_stage_verification.py              |   5 +-
 tests/test_structural_invariants.py           |   9 +-
 92 files changed, 1146 insertions(+), 253 deletions(-)
 create mode 100644 .github/workflows/benchmark.yml
 create mode 100644 Dockerfile.bench
 create mode 100644 docs/recall-investigation.md
 create mode 100755 scripts/append_bench_history.py
 create mode 100755 scripts/run_bench.sh
 create mode 100755 scripts/triage_recall.py
 create mode 100755 scripts/verify_wheel.py
 rename src/{ => ks_xlsx_parser}/analysis/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/analysis/light_block_detector.py (96%)
 rename src/{ => ks_xlsx_parser}/analysis/llm_artifacts.py (98%)
 rename src/{ => ks_xlsx_parser}/analysis/pattern_splitter.py (96%)
 rename src/{ => ks_xlsx_parser}/analysis/table_assembler.py (96%)
 rename src/{ => ks_xlsx_parser}/analysis/table_grouper.py (97%)
 rename src/{ => ks_xlsx_parser}/analysis/template_extractor.py (95%)
 rename src/{ => ks_xlsx_parser}/analysis/tree_builder.py (96%)
 rename src/{ => ks_xlsx_parser}/annotation/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/annotation/block_splitter.py (97%)
 rename src/{ => ks_xlsx_parser}/annotation/cell_annotator.py (98%)
 rename src/{ => ks_xlsx_parser}/api.py (97%)
 rename src/{ => ks_xlsx_parser}/charts/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/charts/chart_extractor.py (99%)
 rename src/{ => ks_xlsx_parser}/chunking/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/chunking/chunker.py (95%)
 rename src/{ => ks_xlsx_parser}/chunking/segmenter.py (98%)
 rename src/{ => ks_xlsx_parser}/comparison/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/comparison/template_comparator.py (99%)
 rename src/{ => ks_xlsx_parser}/export/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/export/model_exporter.py (98%)
 rename src/{ => ks_xlsx_parser}/formula/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/formula/dependency_builder.py (93%)
 rename src/{ => ks_xlsx_parser}/formula/formula_parser.py (98%)
 rename src/{ => ks_xlsx_parser}/models/__init__.py (94%)
 rename src/{ => ks_xlsx_parser}/models/block.py (100%)
 rename src/{ => ks_xlsx_parser}/models/cell.py (100%)
 rename src/{ => ks_xlsx_parser}/models/chart.py (100%)
 rename src/{ => ks_xlsx_parser}/models/common.py (100%)
 rename src/{ => ks_xlsx_parser}/models/dependency.py (100%)
 rename src/{ => ks_xlsx_parser}/models/shape.py (97%)
 rename src/{ => ks_xlsx_parser}/models/sheet.py (99%)
 rename src/{ => ks_xlsx_parser}/models/table.py (100%)
 rename src/{ => ks_xlsx_parser}/models/table_structure.py (100%)
 rename src/{ => ks_xlsx_parser}/models/template.py (100%)
 rename src/{ => ks_xlsx_parser}/models/tree.py (100%)
 rename src/{ => ks_xlsx_parser}/models/workbook.py (99%)
 rename src/{ => ks_xlsx_parser}/parsers/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/parsers/calamine_core.py (99%)
 rename src/{ => ks_xlsx_parser}/parsers/cell_parser.py (99%)
 rename src/{ => ks_xlsx_parser}/parsers/sheet_parser.py (98%)
 rename src/{ => ks_xlsx_parser}/parsers/table_parser.py (97%)
 rename src/{ => ks_xlsx_parser}/parsers/workbook_parser.py (97%)
 rename src/{ => ks_xlsx_parser}/pipeline.py (90%)
 rename src/{ => ks_xlsx_parser}/py.typed (100%)
 rename src/{ => ks_xlsx_parser}/rendering/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/rendering/html_renderer.py (96%)
 rename src/{ => ks_xlsx_parser}/rendering/text_renderer.py (97%)
 rename src/{ => ks_xlsx_parser}/storage/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/storage/serializer.py (98%)
 rename src/{ => ks_xlsx_parser}/utils/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/utils/logging_config.py (97%)
 rename src/{ => ks_xlsx_parser}/verification/__init__.py (100%)
 rename src/{ => ks_xlsx_parser}/verification/stage_verifier.py (97%)

diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
new file mode 100644
index 0000000..026b535
--- /dev/null
+++ b/.github/workflows/benchmark.yml
@@ -0,0 +1,85 @@
+name: Benchmark
+
+# Tracks ks-xlsx-parser retrieval recall on SpreadsheetBench over time.
+# The headline goal: text recall@5 > 0.90 (currently ~0.70).
+#
+#   * Pull requests run a fast SAMPLE (60 instances) as a regression smoke
+#     test — keeps the signal without a 40-minute wait.
+#   * The weekly schedule + manual dispatch run the FULL 912-instance
+#     corpus and publish the recall trend.
+
+on:
+  pull_request:
+    branches: [main]
+    paths:
+      - "src/**"
+      - "scripts/eval_retrieval.py"
+      - "scripts/triage_recall.py"
+      - "Dockerfile.bench"
+      - ".github/workflows/benchmark.yml"
+  schedule:
+    - cron: "0 6 * * 1" # Mondays 06:00 UTC
+  workflow_dispatch:
+    inputs:
+      sample:
+        description: "Instances to sample (0 = full 912 corpus)"
+        default: "0"
+
+concurrency:
+  group: benchmark-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  benchmark:
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - uses: actions/checkout@v4
+
+      # PRs use a 60-instance sample; scheduled/dispatch runs use the full
+      # corpus (or whatever the dispatch input requests).
+      - name: Resolve sample size
+        id: cfg
+        run: |
+          if [ "${{ github.event_name }}" = "pull_request" ]; then
+            echo "sample=60" >> "$GITHUB_OUTPUT"
+          else
+            echo "sample=${{ github.event.inputs.sample || 0 }}" >> "$GITHUB_OUTPUT"
+          fi
+
+      - name: Cache SpreadsheetBench corpus
+        uses: actions/cache@v4
+        with:
+          path: data/corpora/spreadsheetbench
+          key: spreadsheetbench-912-v0.1
+
+      - name: Build benchmark image
+        run: docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+
+      - name: Run benchmark
+        run: |
+          mkdir -p tests/benchmarks/reports data
+          docker run --rm \
+            -e BENCH_SAMPLE=${{ steps.cfg.outputs.sample }} \
+            -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+            -v "$PWD/data:/app/data" \
+            ks-xlsx-parser-bench | tee bench.log
+
+      - name: Publish recall to job summary
+        if: always()
+        run: |
+          {
+            echo '## ks-xlsx-parser retrieval benchmark'
+            echo ''
+            echo '```'
+            tail -n 40 bench.log || true
+            echo '```'
+          } >> "$GITHUB_STEP_SUMMARY"
+
+      - name: Upload benchmark reports
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: benchmark-reports-${{ github.run_number }}
+          path: tests/benchmarks/reports/
+          if-no-files-found: warn
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 1168a01..ba7b9e2 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -23,22 +23,20 @@ jobs:
     steps:
       - uses: actions/checkout@v4
 
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          version: "latest"
+          enable-cache: true
+          cache-dependency-glob: "uv.lock"
+
       - name: Set up Python ${{ matrix.python-version }}
         uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
-          cache: pip
-          cache-dependency-path: pyproject.toml
 
       - name: Install
-        run: |
-          python -m pip install --upgrade pip
-          pip install -e ".[dev,api]"
-
-      - name: Ruff lint
-        run: |
-          pip install ruff
-          ruff check src/ tests/ scripts/ || true  # non-blocking until cleanup PR lands
+        run: uv pip install --system -e ".[dev,api]"
 
       - name: Run test suite
         run: make test-ci
@@ -50,3 +48,56 @@ jobs:
           name: junit-${{ matrix.os }}-py${{ matrix.python-version }}
           path: reports/junit.xml
           if-no-files-found: ignore
+
+  # Builds the wheel and proves it installs + imports in a CLEAN venv.
+  # The matrix `test` job runs against an editable install, which exposes the
+  # whole src/ tree on sys.path and therefore CANNOT catch a broken wheel —
+  # exactly how the v0.2.0 "pipeline.py missing from wheel" bug shipped.
+  # This job is the regression guard. Keep it required.
+  wheel-check:
+    name: wheel install smoke test
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Build wheel
+        run: |
+          python -m pip install --upgrade pip build
+          python -m build --wheel
+
+      - name: Verify wheel installs and imports in a clean venv
+        run: python scripts/verify_wheel.py
+
+  lint:
+    name: ruff lint
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Ruff lint
+        run: |
+          python -m pip install --upgrade pip ruff
+          # TODO(cleanup): 45 pre-existing findings (E402/B905/SIM*). Drop the
+          # `|| true` once the lint-cleanup PR lands so this job gates merges.
+          ruff check src/ tests/ scripts/ || true
+
+  typecheck:
+    name: mypy
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: mypy
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -e ".[dev,api]"
+          # TODO(cleanup): 59 pre-existing findings. Drop `|| true` once typed.
+          python -m mypy src/ks_xlsx_parser || true
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index 5cf43b6..1da9bcc 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -37,6 +37,12 @@ jobs:
       - name: Build wheel + sdist
         run: python -m build
 
+      # Gate the release on a clean-venv install of the freshly built wheel.
+      # Prevents shipping a wheel that drops modules or leaks top-level
+      # packages (the v0.2.0 packaging regression).
+      - name: Verify wheel
+        run: python scripts/verify_wheel.py
+
       - name: Upload distribution artifacts
         uses: actions/upload-artifact@v4
         with:
diff --git a/CHANGELOG.md b/CHANGELOG.md
index d7527b2..99d5832 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -45,6 +45,52 @@ Template for a new release (copy this block, fill in, move Unreleased items in):
 
 ## [Unreleased]
 
+### ⚠️ BREAKING (Fixed — see also #ks-xlsx-parser channel report)
+- Repository layout flattened on `src/` was leaking 13 generic top-level
+  packages (`models`, `utils`, `parsers`, …) into installed wheels and
+  silently dropping `pipeline.py` and `api.py` (setuptools `packages.find`
+  only finds *packages*, not top-level modules). Users hitting
+  `from ks_xlsx_parser.pipeline import ...` on 0.2.0 from PyPI got
+  `ModuleNotFoundError`. **All modules now live under
+  `src/ks_xlsx_parser/`**; the wheel's `top_level.txt` contains only
+  `ks_xlsx_parser`. Imports inside the package switched from
+  `from pipeline import` to `from ks_xlsx_parser.pipeline import`.
+  Downstream code that imported the leaked generics
+  (`from models import …`) MUST migrate to `from ks_xlsx_parser.models …`.
+
+### Added
+- `scripts/verify_wheel.py` — builds the wheel, installs it in a fresh
+  venv, and asserts the public import surface resolves. Wired into a
+  new `wheel-check` job in `.github/workflows/ci.yml` and a `Verify wheel`
+  step in `release.yml`. Regression guard for the packaging bug above.
+- `scripts/triage_recall.py` + `scripts/append_bench_history.py` — turn
+  `failures.ndjson` into a ranked bucket histogram with exemplar
+  failures, and append each benchmark run to
+  `tests/benchmarks/reports/history.jsonl` so recall is tracked
+  commit-over-commit. Goal: text recall@5 > 0.90.
+- `eval_retrieval.py --emit-failures` — dumps top-8 ranked chunks per
+  miss with a `failure_bucket` (answer_absent_from_chunks /
+  present_but_ranked_low / wrong_sheet / geometric_no_overlap / …) for
+  triage. Summary JSON gains a `failure_buckets` histogram.
+- `Dockerfile.bench` + `.github/workflows/benchmark.yml` — reproducible
+  benchmark image; PR sample run (60 instances), weekly full corpus run.
+- `make install-dev` alias and `make wheel-check` / `make bench-track`
+  / `make docker-bench` targets.
+- New `bench` optional-dependency group (`sentence-transformers`,
+  `numpy`) — only the benchmark needs these.
+- `docs/recall-investigation.md` documenting the diagnosis framework and
+  three named hypotheses (chunk-size dilution, formula-expression
+  rendering, range-bookkeeping drift).
+- `.claude/skills/recall-failure-triage/SKILL.md` — agent skill that
+  consumes the bucket output and proposes ranked fixes.
+
+### Changed
+- Dropped `PYTHONPATH=src` from Makefile benchmark targets — the
+  package is now properly installable so callers don't need it.
+- `pyproject.toml`: `packages.find` constrained to `ks_xlsx_parser*`,
+  `py.typed` declared as package data, `xlsx-parser-api` console script
+  updated to `ks_xlsx_parser.api:main`.
+
 ### ⚠️ BREAKING
 - Retired the in-tree `testBench/` corpus. The 1054-workbook stress dataset
   and `make testbench*` targets are gone — benchmarks now run against the
diff --git a/Dockerfile.bench b/Dockerfile.bench
new file mode 100644
index 0000000..cf2755f
--- /dev/null
+++ b/Dockerfile.bench
@@ -0,0 +1,49 @@
+# Benchmark image for ks-xlsx-parser.
+#
+# Builds once, then on each run downloads SpreadsheetBench (if not cached),
+# parses the corpus, embeds chunks with a small sentence-transformer, and
+# emits a recall@k report + failure-bucket triage. The output lands in
+# tests/benchmarks/reports/ — mount that path as a volume to persist results.
+#
+# Usage:
+#   docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+#   docker run --rm \
+#       -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+#       -v "$PWD/data:/app/data" \
+#       ks-xlsx-parser-bench
+#
+#   # Quick sanity run on 20 instances:
+#   docker run --rm -e BENCH_SAMPLE=20 ks-xlsx-parser-bench
+
+FROM python:3.12-slim
+
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1
+
+WORKDIR /app
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        curl unzip ca-certificates git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install deps first to keep layers cacheable across code edits.
+COPY pyproject.toml README.md ./
+COPY src ./src
+RUN pip install -e ".[dev,bench]"
+
+# Pre-warm the embedding model so the first ``docker run`` doesn't pay the
+# ~80 MB download. Same model name eval_retrieval.py defaults to.
+RUN python -c "from sentence_transformers import SentenceTransformer; \
+SentenceTransformer('BAAI/bge-small-en-v1.5')"
+
+COPY scripts ./scripts
+COPY tests ./tests
+COPY Makefile ./
+
+ENV BENCH_SAMPLE=0 \
+    BENCH_PARSERS=ks \
+    BENCH_TIMEOUT=120
+
+ENTRYPOINT ["bash", "scripts/run_bench.sh"]
diff --git a/Makefile b/Makefile
index 9bedb5b..5331c52 100644
--- a/Makefile
+++ b/Makefile
@@ -1,4 +1,5 @@
-.PHONY: help install test test-ci lint format typecheck clean corpus-download bench-robust bench-retrieval bench
+.PHONY: help install install-dev test test-ci lint format typecheck wheel-check clean \
+	corpus-download bench-robust bench-retrieval bench bench-track docker-bench
 
 PYTHON ?= python
 PKG_VERSION := $(shell $(PYTHON) -c "import tomllib, pathlib; print(tomllib.loads(pathlib.Path('pyproject.toml').read_text())['project']['version'])")
@@ -7,22 +8,29 @@ help:
 	@echo "ks-xlsx-parser — common targets"
 	@echo ""
 	@echo "  make install         Install package and dev deps (editable)"
+	@echo "  make install-dev     Alias for install (matches ks-backend)"
 	@echo "  make test            Run the default test suite"
 	@echo "  make test-ci         Run the suite with verbose output for CI"
 	@echo ""
 	@echo "  make lint            Ruff lint"
 	@echo "  make format          Ruff format"
 	@echo "  make typecheck       mypy"
+	@echo "  make wheel-check     Build wheel + verify it imports in a clean venv"
 	@echo ""
 	@echo "  make corpus-download Fetch SpreadsheetBench for benchmark runs"
 	@echo ""
 	@echo "  make bench-robust    Robustness on SpreadsheetBench (ks vs docling, ~20 min)"
 	@echo "  make bench-retrieval Retrieval recall on SpreadsheetBench (ks vs docling, ~40 min)"
 	@echo "  make bench           Run both benchmarks back-to-back"
+	@echo "  make bench-track     Run retrieval bench + append metrics to history"
+	@echo "  make docker-bench    Build + run the benchmark Docker image"
 
 install:
 	$(PYTHON) -m pip install -e ".[dev,api]"
 
+# Alias — junior devs pattern-match off ks-backend's `make install-dev`.
+install-dev: install
+
 test:
 	$(PYTHON) -m pytest tests/ -v --tb=short -W ignore::UserWarning
 
@@ -36,7 +44,15 @@ format:
 	$(PYTHON) -m ruff format src/ tests/ scripts/
 
 typecheck:
-	$(PYTHON) -m mypy src/xlsx_parser
+	$(PYTHON) -m mypy src/ks_xlsx_parser
+
+# Build the wheel and prove it imports outside the editable source tree.
+# This is the regression guard for the v0.2.0 packaging bug (pipeline.py
+# missing from the wheel because it was a top-level module, not a package).
+wheel-check:
+	rm -rf dist build
+	$(PYTHON) -m build --wheel
+	$(PYTHON) scripts/verify_wheel.py
 
 clean:
 	rm -rf build/ dist/ *.egg-info src/*.egg-info .pytest_cache .ruff_cache .mypy_cache
@@ -47,13 +63,26 @@ corpus-download:
 
 bench-robust:
 	@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
-	PYTHONPATH=src $(PYTHON) -m tests.benchmarks.vs_hucre \
+	$(PYTHON) -m tests.benchmarks.vs_hucre \
 		--corpus data/corpora/spreadsheetbench --parsers ks,docling \
 		--per-file-timeout 120 \
 		--out tests/benchmarks/reports/spreadsheetbench
 
 bench-retrieval:
 	@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
-	PYTHONPATH=src $(PYTHON) scripts/eval_retrieval.py --parsers ks,docling
+	$(PYTHON) scripts/eval_retrieval.py --parsers ks,docling
 
 bench: bench-robust bench-retrieval
+
+# Run the retrieval benchmark and append a row to history.jsonl so
+# accuracy can be tracked commit-over-commit. Goal: text recall@5 > 0.90.
+bench-track:
+	@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
+	$(PYTHON) scripts/eval_retrieval.py --parsers ks --emit-failures \
+		--out tests/benchmarks/reports/retrieval
+	$(PYTHON) scripts/append_bench_history.py
+	$(PYTHON) scripts/triage_recall.py tests/benchmarks/reports/retrieval
+
+docker-bench:
+	docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+	docker run --rm -v "$(PWD)/tests/benchmarks/reports:/app/tests/benchmarks/reports" ks-xlsx-parser-bench
diff --git a/docs/recall-investigation.md b/docs/recall-investigation.md
new file mode 100644
index 0000000..d3576c8
--- /dev/null
+++ b/docs/recall-investigation.md
@@ -0,0 +1,145 @@
+# Retrieval-recall investigation — getting ks-xlsx-parser to >0.90
+
+## Where we are (v0.2.0 on SpreadsheetBench, 912 instances)
+
+| Metric                 | ks-xlsx-parser | docling 2.93 |
+|------------------------|----------------|--------------|
+| Parse success          | 99.945%        | not run at scale |
+| Recall@1 (text-match)  | 0.580          | 0.579 |
+| Recall@3 (text-match)  | 0.697          | 0.670 |
+| Recall@5 (text-match)  | **0.704**      | 0.686 |
+| Recall@5 (geometric)   | 0.369          | 0.000 (no A1 anchors) |
+| Mean parse time        | 251 ms         | 265 ms |
+
+Recall@5 = 0.704 means **~30% of questions miss** with k=5. To reach 0.90
+we need to roughly cut the miss rate to a third. A single recall number
+hides which lever to pull, which is why this branch ships failure
+bucketing (`scripts/eval_retrieval.py --emit-failures` +
+`scripts/triage_recall.py`).
+
+## The diagnosis framework — why the bucket histogram is the answer
+
+Every recall@5 miss falls into one of these buckets. The fix is
+completely different per bucket, and only one or two will dominate. The
+job of the investigator is to read the histogram FIRST, then commit to
+fixing the biggest one.
+
+| Bucket | What it means | Where to look |
+|---|---|---|
+| `answer_absent_from_chunks` | Answer value is in NO chunk. Cell was dropped or garbled. | `parsers/cell_parser.py`, `rendering/text_renderer.py::_cell_render_value` |
+| `present_but_ranked_low`    | A chunk DOES contain the answer but ranked >5. Chunk is too large/heterogeneous; the embedding is diluted. | `chunking/chunker.py` (no token cap), `analysis/table_assembler.py` (over-merging) |
+| `wrong_sheet`               | Answer sheet was never chunked. Sheet enumeration missed it. | `parsers/workbook_parser.py` sheet loop |
+| `geometric_no_overlap`      | No chunk's A1 range overlaps ground truth. Range bookkeeping drifts during merge/split. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` |
+| `no_chunks` / `parse_error` | Upstream parser failure. | The parse exception — fix the crash. |
+
+## A priori hypotheses (to be confirmed by the next benchmark run)
+
+### H1 — `present_but_ranked_low` is the biggest bucket
+
+There is no per-chunk token cap in `chunking/chunker.py` (`CHARS_PER_TOKEN`
+is only used to *report* `token_count`, never to split). On
+SpreadsheetBench many input files are single-sheet ledgers where the
+block-assembler collapses the whole sheet into one chunk. The
+sentence-transformer query embedding then has to compete against ~2k
+tokens of mostly irrelevant text; the relevant ~5 tokens get washed out.
+
+If H1 is right, the histogram will show `present_but_ranked_low` ≫ the
+others, and recall@1 (0.580) will be much worse than recall@5 (0.704)
+— exactly what we observe (Δ = 12.4 pp, vs typical Δ ≈ 5–6 pp when
+chunks are right-sized).
+
+**Fix**: hard cap chunks at ~512 tokens by row-splitting tables and add
+a "row group" sub-chunk for tall tables. This is a 1–2 day surgical
+change in `chunking/chunker.py`.
+
+### H2 — `answer_absent_from_chunks` dominates the geometric gap
+
+`parsers/workbook_parser.py` loads both `data_only=False` (formula
+expressions) and `data_only=True` (computed values). But what flows into
+`render_text` is whichever `_cell_render_value` picks. If the cell is a
+formula like `=SUM(B2:B10)`, `display_value` may be the *expression*
+when the workbook was saved without cached values (LibreOffice and some
+generated files do this). Those answer cells become unfindable by text
+match even though the data IS in the spreadsheet.
+
+**Diagnostic**: count failure rows where every `top_chunks[*].text`
+matches the formula expression pattern (`=`, function name) but not the
+expected numeric value. The bucket emits the top-8 chunks for inspection.
+
+**Fix**: when the cached value is missing for a formula cell, evaluate it
+with our own formula engine (`formula/formula_parser.py` already exists)
+or use python-calamine's value-only pass as the source of truth for
+render text — never the formula source.
+
+### H3 — `geometric_no_overlap` is high because block ranges over-extend
+
+Geometric recall@5 = 0.369 means **only ~37% of the time** does the
+chunk a parser surfaces actually cover the ground-truth answer cell —
+even when the text match works. The block-detection pipeline merges
+sparse blocks (`analysis/light_block_detector.py`) and groups by
+similarity (`analysis/table_grouper.py`). Each merge widens the
+top-left/bottom-right anchors. If the anchors are widened past the
+sheet's true content, downstream citation overlays in ks-backend will
+highlight whitespace, and the geometric metric registers the chunk as
+"not overlapping" because its claimed range is so large it's not useful.
+
+**Fix**: after every merge/split, clip `cell_range` to the tight bounding
+box of the cells that actually contributed text. Add an invariant test
+that `block.cell_range` ⊆ `bounding_box(block.cells)`.
+
+## How to confirm — the next benchmark run
+
+1. `make corpus-download` (one-time, ~hundreds of MB).
+2. `make bench-track` — runs the full benchmark, appends to
+   `tests/benchmarks/reports/history.jsonl`, prints the bucket triage.
+3. Read the histogram. Pick the biggest bucket. Open 3–5 example
+   failures with `python scripts/triage_recall.py <reports-dir>
+   --bucket <name> --examples 5`. Each row shows:
+   * the natural-language question
+   * the ground-truth answer cell + values
+   * the top-8 ranked chunks we produced (sheet, A1 range, text snippet)
+   * whether each chunk contains the answer
+4. Pattern-match across 5 examples — what's the common parser behaviour?
+   That tells you the fix.
+5. Implement, re-run `make bench-track`. The script prints the delta
+   row-over-row so improvement is visible immediately.
+
+## How to use the Docker image (CI + reproducibility)
+
+```bash
+# Build once
+docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+
+# Quick smoke (60 instances, < 2 min)
+docker run --rm -e BENCH_SAMPLE=60 ks-xlsx-parser-bench
+
+# Full corpus, persist reports + corpus cache
+docker run --rm \
+  -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+  -v "$PWD/data:/app/data" \
+  ks-xlsx-parser-bench
+```
+
+The `Benchmark` GitHub workflow:
+* Runs a 60-instance smoke on every PR that touches `src/` or the
+  benchmark scripts.
+* Runs the full 912-instance corpus weekly (Monday 06:00 UTC) and on
+  manual dispatch.
+* Uploads `tests/benchmarks/reports/*` as a build artifact and posts the
+  recall summary to the job step summary.
+
+## Goal & cadence
+
+Target: **text recall@5 ≥ 0.90** by end of the current quarter.
+
+Track in `tests/benchmarks/reports/history.jsonl` (commit-over-commit
+row append). Refuse merges that drop recall@5 by ≥ 2 pp on the sample
+run (planned gate; today the PR job is reporting-only).
+
+## See also
+
+* `scripts/eval_retrieval.py` — the benchmark itself.
+* `scripts/triage_recall.py` — bucket histogram + example dump.
+* `scripts/append_bench_history.py` — history.jsonl row writer.
+* `.claude/skills/recall-failure-triage/SKILL.md` — agent guide.
+* `Dockerfile.bench` — reproducible benchmark image.
diff --git a/examples/demo.py b/examples/demo.py
index dbf6118..de59151 100644
--- a/examples/demo.py
+++ b/examples/demo.py
@@ -12,8 +12,8 @@
 # Add src to path for development
 sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
 
-from xlsx_parser.pipeline import parse_workbook
-from xlsx_parser.utils.logging_config import configure_logging
+from ks_xlsx_parser.pipeline import parse_workbook
+from ks_xlsx_parser.utils.logging_config import configure_logging
 
 EXAMPLES_DIR = Path(__file__).parent / "fixtures"
 
@@ -136,7 +136,7 @@ def demo_engineering_calcs():
     print(f"  Named Ranges: {[nr.name for nr in wb.named_ranges]}")
 
     # Show dependency chain for Design Moment (C15)
-    from xlsx_parser.models import CellCoord
+    from ks_xlsx_parser.models import CellCoord
     upstream = wb.dependency_graph.get_upstream(
         "Beam Design", CellCoord(row=15, col=3), max_depth=3
     )
diff --git a/examples/generate_examples.py b/examples/generate_examples.py
index 8d25e01..245caa2 100644
--- a/examples/generate_examples.py
+++ b/examples/generate_examples.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Generate example Excel workbooks for demonstrating the xlsx_parser.
+Generate example Excel workbooks for demonstrating the ks_xlsx_parser.
 
 Creates several representative workbooks in the examples/ folder
 that showcase the parser's capabilities across different Excel features.
diff --git a/examples/stress_test/stress_test_runner.py b/examples/stress_test/stress_test_runner.py
index 2551765..98128fa 100644
--- a/examples/stress_test/stress_test_runner.py
+++ b/examples/stress_test/stress_test_runner.py
@@ -18,7 +18,7 @@
 PROJECT_ROOT = Path(__file__).parent.parent.parent
 sys.path.insert(0, str(PROJECT_ROOT / "src"))
 
-from xlsx_parser.pipeline import parse_workbook
+from ks_xlsx_parser.pipeline import parse_workbook
 
 STRESS_DIR = Path(__file__).parent
 
diff --git a/pyproject.toml b/pyproject.toml
index a425853..bbaa8ac 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -46,7 +46,7 @@ Repository = "https://github.com/knowledgestack/ks-xlsx-parser"
 Documentation = "https://github.com/knowledgestack/ks-xlsx-parser#readme"
 
 [project.scripts]
-xlsx-parser-api = "api:main"
+xlsx-parser-api = "ks_xlsx_parser.api:main"
 
 [project.optional-dependencies]
 api = [
@@ -64,6 +64,12 @@ dev = [
     "ruff>=0.6.0",
     "mypy>=1.0",
 ]
+# Retrieval-recall benchmark (scripts/eval_retrieval.py). Heavy — only the
+# benchmark Docker image and `make bench-retrieval` need these.
+bench = [
+    "sentence-transformers>=2.2.0",
+    "numpy>=1.24.0",
+]
 
 [tool.pytest.ini_options]
 testpaths = ["tests"]
@@ -80,6 +86,10 @@ addopts = "-m 'not corpus'"
 
 [tool.setuptools.packages.find]
 where = ["src"]
+include = ["ks_xlsx_parser*"]
+
+[tool.setuptools.package-data]
+ks_xlsx_parser = ["py.typed"]
 
 [tool.ruff]
 line-length = 110
diff --git a/scripts/append_bench_history.py b/scripts/append_bench_history.py
new file mode 100755
index 0000000..01def1f
--- /dev/null
+++ b/scripts/append_bench_history.py
@@ -0,0 +1,101 @@
+#!/usr/bin/env python3
+"""Append the latest retrieval-benchmark run to a commit-over-commit history.
+
+``eval_retrieval.py`` writes a timestamped ``summary.json`` per run. This
+script picks the most recent one, flattens the headline metrics, tags it
+with the current git commit, and appends one JSON line to
+``tests/benchmarks/reports/history.jsonl``.
+
+That history file is what makes "is recall improving over time?" answerable
+— plot it, diff it in CI, or just ``tail`` it. Goal: text recall@5 > 0.90.
+
+Usage:
+    python scripts/append_bench_history.py
+    python scripts/append_bench_history.py --reports-dir tests/benchmarks/reports/retrieval
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import subprocess
+from datetime import UTC, datetime
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+HISTORY = ROOT / "tests" / "benchmarks" / "reports" / "history.jsonl"
+
+
+def git_commit() -> str:
+    try:
+        return subprocess.check_output(
+            ["git", "rev-parse", "--short", "HEAD"], cwd=ROOT, text=True
+        ).strip()
+    except Exception:
+        return "unknown"
+
+
+def latest_summary(reports_dir: Path) -> Path:
+    summaries = sorted(reports_dir.glob("*/summary.json"))
+    if not summaries:
+        raise SystemExit(
+            f"no summary.json under {reports_dir} — run `make bench-retrieval` first"
+        )
+    return summaries[-1]
+
+
+def main(argv: list[str] | None = None) -> int:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--reports-dir", type=Path,
+                    default=ROOT / "tests" / "benchmarks" / "reports" / "retrieval")
+    ap.add_argument("--parser", default="ks-xlsx-parser",
+                    help="which parser's metrics to record")
+    args = ap.parse_args(argv)
+
+    summary_path = latest_summary(args.reports_dir)
+    summary = json.loads(summary_path.read_text())
+    metrics = summary.get(args.parser)
+    if metrics is None:
+        raise SystemExit(
+            f"parser {args.parser!r} not in {summary_path}; "
+            f"have: {list(summary)}"
+        )
+
+    row = {
+        "timestamp": datetime.now(UTC).isoformat(),
+        "commit": git_commit(),
+        "parser": args.parser,
+        "run": summary_path.parent.name,
+        "instances": metrics.get("instances"),
+        "recall_text@1": metrics.get("recall_text@1"),
+        "recall_text@3": metrics.get("recall_text@3"),
+        "recall_text@5": metrics.get("recall_text@5"),
+        "recall_geometric@5": metrics.get("recall_geometric@5"),
+        "table_fragmentation_rate": metrics.get("table_fragmentation_rate"),
+        "mean_parse_ms": metrics.get("mean_parse_ms"),
+        "errors": metrics.get("errors"),
+        "failure_buckets": metrics.get("failure_buckets"),
+    }
+
+    HISTORY.parent.mkdir(parents=True, exist_ok=True)
+    with HISTORY.open("a") as f:
+        f.write(json.dumps(row, separators=(",", ":")) + "\n")
+
+    print(f"appended to {HISTORY.relative_to(ROOT)}:")
+    print(f"  commit {row['commit']}  recall_text@5={row['recall_text@5']}  "
+          f"recall_text@1={row['recall_text@1']}")
+
+    # Show the trend if there's history to compare against.
+    rows = [json.loads(ln) for ln in HISTORY.read_text().splitlines() if ln.strip()]
+    if len(rows) >= 2:
+        prev, cur = rows[-2], rows[-1]
+        for k in ("recall_text@5", "recall_text@1"):
+            p, c = prev.get(k), cur.get(k)
+            if isinstance(p, int | float) and isinstance(c, int | float):
+                delta = c - p
+                arrow = "▲" if delta > 0 else ("▼" if delta < 0 else "—")
+                print(f"  {k}: {p:.4f} → {c:.4f}  {arrow} {delta:+.4f}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/eval_retrieval.py b/scripts/eval_retrieval.py
index 64c1743..3d4cdff 100644
--- a/scripts/eval_retrieval.py
+++ b/scripts/eval_retrieval.py
@@ -34,13 +34,16 @@
 import signal
 import sys
 import time
+from collections.abc import Iterable
 from dataclasses import dataclass, field
 from pathlib import Path
-from typing import Any, Iterable
+from typing import Any
 
 REPO_ROOT = Path(__file__).resolve().parent.parent
+# Keep ``import scripts.X`` style imports working when invoked as
+# ``python scripts/eval_retrieval.py``. We no longer need ``src`` on the
+# path — ks_xlsx_parser is a properly-installed package now.
 sys.path.insert(0, str(REPO_ROOT))
-sys.path.insert(0, str(REPO_ROOT / "src"))
 
 
 def _normalize_value_for_match(s: str) -> set[str]:
@@ -254,7 +257,7 @@ def parse_position_spec(
 
 
 def extract_chunks_ks(path: Path) -> list[Chunk]:
-    from pipeline import parse_workbook
+    from ks_xlsx_parser.pipeline import parse_workbook
 
     result = parse_workbook(path=str(path))
     out: list[Chunk] = []
@@ -442,6 +445,62 @@ class InstanceResult:
     extra: dict[str, Any] = field(default_factory=dict)
 
 
+# ────────────────────────────────────────────────────────── failure triage
+#
+# recall@5 sitting at ~0.70 means ~30% of questions miss. A single recall
+# number can't tell you WHY — and the fix is completely different per cause:
+#
+#   * answer_absent_from_chunks → the answer value is in NO chunk at all.
+#     The parser dropped/garbled the cell. Fix the EXTRACTION pipeline.
+#   * present_but_ranked_low    → a chunk DOES contain the answer, but the
+#     embedding model ranked it past k. Fix CHUNKING (smaller/cleaner
+#     chunks rank better) or the embedding step — NOT the parser.
+#   * wrong_sheet               → answer is on a sheet we produced no chunk
+#     for, or mis-attributed the sheet. Fix sheet enumeration.
+#   * geometric_no_overlap      → no chunk's A1 range overlaps ground truth
+#     even though text may match. Fix RANGE bookkeeping (top_left /
+#     bottom_right anchors drift during merge/split).
+#   * no_chunks / parse_error   → upstream parser failure.
+#
+# Splitting the miss population into these buckets turns "recall is low"
+# into a ranked, actionable worklist. This is the single most useful thing
+# for an agent iterating on recall — see scripts/triage_recall.py.
+
+FAILURE_BUCKETS = (
+    "answer_absent_from_chunks",
+    "present_but_ranked_low",
+    "wrong_sheet",
+    "geometric_no_overlap",
+    "no_chunks",
+    "parse_error",
+    "unparseable_ground_truth",
+)
+
+
+def classify_text_failure(rec: dict[str, Any]) -> str | None:
+    """Bucket a single result record for the TEXT-match recall@5 metric.
+
+    Returns None if the instance was a recall@5 hit (rank ≤ 5) or was not
+    scoreable (no ground-truth values to match against).
+    """
+    if rec.get("error"):
+        return "no_chunks" if rec.get("n_chunks", 0) == 0 else "parse_error"
+    if not rec.get("had_answer_values", True):
+        return None  # not scoreable for text-match — exclude, don't penalise
+    if rec.get("n_chunks", 0) == 0:
+        return "no_chunks"
+    rank = rec.get("rank_of_text_match")
+    if rank is not None and rank <= 5:
+        return None  # hit
+    if rank is None:
+        # Not in any chunk. Distinguish "value never extracted" from
+        # "value extracted but on a sheet we never chunked".
+        if rec.get("answer_on_unchunked_sheet"):
+            return "wrong_sheet"
+        return "answer_absent_from_chunks"
+    return "present_but_ranked_low"  # rank > 5
+
+
 def score_instance(
     *,
     parser_name: str,
@@ -452,8 +511,10 @@ def score_instance(
     answer_position: str,
     default_sheet: str | None,
     answer_cell_values: list[str],
+    answer_regions: list[tuple[str | None, tuple[int, int, int, int]]] | None = None,
     model,
     per_parser_timeout_s: float = 60.0,
+    emit_failures: bool = False,
 ) -> InstanceResult:
     import numpy as np
 
@@ -547,6 +608,38 @@ def score_instance(
                 rank_text = r
                 break
 
+    # Did the parser produce ANY chunk on the sheet(s) the answer lives on?
+    # If not, a text miss is a sheet-enumeration bug, not a cell-drop bug.
+    answer_on_unchunked_sheet = False
+    if answer_regions:
+        gt_sheets = {s for s, _ in answer_regions if s}
+        chunk_sheets = {c.sheet for c in chunks if c.sheet}
+        if gt_sheets and chunk_sheets and not (gt_sheets & chunk_sheets):
+            answer_on_unchunked_sheet = True
+
+    extra: dict[str, Any] = {
+        "answer_on_unchunked_sheet": answer_on_unchunked_sheet,
+        "had_answer_values": bool(answer_cell_values),
+    }
+    if emit_failures:
+        # Dump the top-8 ranked chunks so a human/agent can eyeball WHY the
+        # answer was missed without re-running the (expensive) benchmark.
+        top: list[dict[str, Any]] = []
+        for r, idx in enumerate(ranking[:8], start=1):
+            c = chunks[idx]
+            top.append({
+                "rank": r,
+                "sheet": c.sheet,
+                "range": (
+                    f"{c.top_left}:{c.bottom_right}"
+                    if c.top_left else None
+                ),
+                "contains_answer": _matches_chunk_text(answer_cell_values, c.text or ""),
+                "text": (c.text or "")[:280],
+            })
+        extra["top_chunks"] = top
+        extra["answer_values"] = answer_cell_values[:20]
+
     return InstanceResult(
         instance_id=inst_id,
         parser=parser_name,
@@ -558,6 +651,7 @@ def score_instance(
         chunks_overlapping_data=len(overlap_idxs),
         rank_of_first_overlap=rank_overlap,
         rank_of_text_match=rank_text,
+        extra=extra,
     )
 
 
@@ -650,6 +744,21 @@ def _recall_at(k: int, key: str) -> float:
 
         parse_times = [r.parse_ms for r in recs if not r.error]
 
+        # Failure-bucket histogram for the text-match recall@5 metric.
+        # Only counts instances that HAD ground-truth values to match
+        # against (others aren't scoreable). See classify_text_failure.
+        buckets = dict.fromkeys(FAILURE_BUCKETS, 0)
+        for r in recs:
+            bucket = classify_text_failure({
+                "error": r.error,
+                "n_chunks": r.n_chunks,
+                "rank_of_text_match": r.rank_of_text_match,
+                "had_answer_values": r.extra.get("had_answer_values", True),
+                "answer_on_unchunked_sheet": r.extra.get("answer_on_unchunked_sheet"),
+            })
+            if bucket is not None:
+                buckets[bucket] += 1
+
         summary[parser] = {
             "instances": total,
             "ok": ok,
@@ -667,6 +776,9 @@ def _recall_at(k: int, key: str) -> float:
             if parse_times else None,
             "p50_parse_ms": round(sorted(parse_times)[len(parse_times) // 2], 2)
             if parse_times else None,
+            # Why the text-match recall@5 misses happen, bucketed. The
+            # biggest bucket is the highest-leverage thing to fix next.
+            "failure_buckets": buckets,
         }
 
     return summary
@@ -701,6 +813,10 @@ def main(argv: list[str] | None = None) -> int:
     parser.add_argument("--test-case", type=int, default=1,
                         help="Which of the (typically 3) test cases per instance "
                              "to score on. We use one to keep eval costs bounded.")
+    parser.add_argument("--emit-failures", action="store_true",
+                        help="Also write failures.ndjson — one row per "
+                             "recall@5 miss with the top-8 ranked chunks and "
+                             "ground-truth values, for failure triage.")
     parser.add_argument("--per-parser-timeout", type=float, default=60.0,
                         help="Wall-clock seconds before a parser is "
                              "considered hung on a single file (docling can "
@@ -736,6 +852,7 @@ def main(argv: list[str] | None = None) -> int:
     ndjson_path = out_dir / "results.ndjson"
 
     results: list[InstanceResult] = []
+    failure_rows: list[dict[str, Any]] = []
     n = len(instances) * len(parser_fns)
     done = 0
 
@@ -786,10 +903,20 @@ def main(argv: list[str] | None = None) -> int:
                     answer_position=answer_pos,
                     default_sheet=default_sheet,
                     answer_cell_values=answer_values,
+                    answer_regions=answer_regions,
                     model=model,
                     per_parser_timeout_s=args.per_parser_timeout,
+                    emit_failures=args.emit_failures,
                 )
                 results.append(res)
+                bucket = classify_text_failure({
+                    "error": res.error,
+                    "n_chunks": res.n_chunks,
+                    "rank_of_text_match": res.rank_of_text_match,
+                    "had_answer_values": res.extra.get("had_answer_values", True),
+                    "answer_on_unchunked_sheet": res.extra.get(
+                        "answer_on_unchunked_sheet"),
+                })
                 f.write(json.dumps({
                     "instance_id": res.instance_id,
                     "parser": res.parser,
@@ -801,8 +928,22 @@ def main(argv: list[str] | None = None) -> int:
                     "chunks_overlapping_data": res.chunks_overlapping_data,
                     "rank_of_first_overlap": res.rank_of_first_overlap,
                     "rank_of_text_match": res.rank_of_text_match,
+                    "failure_bucket": bucket,
                     "error": res.error,
                 }, separators=(",", ":")) + "\n")
+                if args.emit_failures and bucket is not None:
+                    failure_rows.append({
+                        "instance_id": res.instance_id,
+                        "parser": res.parser,
+                        "failure_bucket": bucket,
+                        "instruction": instr,
+                        "answer_position": answer_pos,
+                        "answer_values": res.extra.get("answer_values", []),
+                        "rank_of_text_match": res.rank_of_text_match,
+                        "n_chunks": res.n_chunks,
+                        "top_chunks": res.extra.get("top_chunks", []),
+                        "error": res.error,
+                    })
                 done += 1
                 if done % 10 == 0:
                     sys.stderr.write(f"\r[{done}/{n}] ")
@@ -810,6 +951,13 @@ def main(argv: list[str] | None = None) -> int:
 
     sys.stderr.write(f"\nWrote {ndjson_path}\n")
 
+    if args.emit_failures:
+        fail_path = out_dir / "failures.ndjson"
+        with fail_path.open("w") as ff:
+            for row in failure_rows:
+                ff.write(json.dumps(row, separators=(",", ":")) + "\n")
+        sys.stderr.write(f"Wrote {fail_path} ({len(failure_rows)} failure rows)\n")
+
     summary = aggregate(results)
     summary_path = out_dir / "summary.json"
     summary_path.write_text(json.dumps(summary, indent=2))
@@ -853,6 +1001,20 @@ def main(argv: list[str] | None = None) -> int:
                     "(sheet, range) per chunk — docling does not, so its geometric "
                     "recall is structurally 0.")
     md_lines.append("")
+    md_lines.append("## Failure buckets (text-match recall@5 misses)")
+    md_lines.append("")
+    md_lines.append("Why each miss happened — the biggest bucket is the highest-")
+    md_lines.append("leverage fix. `answer_absent_from_chunks` → fix extraction; ")
+    md_lines.append("`present_but_ranked_low` → fix chunking/embedding.")
+    md_lines.append("")
+    md_lines.append("| Bucket | " + " | ".join(parsers) + " |")
+    md_lines.append("|---|" + "|".join(["---"] * len(parsers)) + "|")
+    for b in FAILURE_BUCKETS:
+        row = [b]
+        for p in parsers:
+            row.append(str(summary[p].get("failure_buckets", {}).get(b, 0)))
+        md_lines.append("| " + " | ".join(row) + " |")
+    md_lines.append("")
     md_lines.append("**Text-match** = the answer cell's actual string value appears "
                     "as a substring of the chunk's text. Parser-agnostic; this is "
                     "the apples-to-apples retrieval comparison.")
diff --git a/scripts/run_bench.sh b/scripts/run_bench.sh
new file mode 100755
index 0000000..05a77fb
--- /dev/null
+++ b/scripts/run_bench.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Entrypoint for the benchmark Docker image (Dockerfile.bench).
+#
+# Ensures the SpreadsheetBench corpus is present, runs the retrieval-recall
+# benchmark for ks-xlsx-parser, appends the result to history.jsonl, and
+# prints a failure-bucket triage so accuracy can be tracked over time.
+#
+# Env vars:
+#   BENCH_SAMPLE   parse only N random instances (0 / unset = full 912)
+#   BENCH_PARSERS  comma list passed to eval_retrieval.py (default: ks)
+#   BENCH_TIMEOUT  per-file parser timeout in seconds (default: 120)
+set -euo pipefail
+
+cd "$(dirname "$0")/.."
+
+CORPUS_DIR="data/corpora/spreadsheetbench"
+SAMPLE="${BENCH_SAMPLE:-0}"
+PARSERS="${BENCH_PARSERS:-ks}"
+TIMEOUT="${BENCH_TIMEOUT:-120}"
+
+if [ ! -d "$CORPUS_DIR" ]; then
+  echo "→ Downloading SpreadsheetBench corpus ..."
+  mkdir -p "$CORPUS_DIR"
+  curl -L --fail --retry 3 --connect-timeout 20 \
+    -o /tmp/sb.tar.gz \
+    "https://raw.githubusercontent.com/RUCKBReasoning/SpreadsheetBench/main/data/spreadsheetbench_912_v0.1.tar.gz"
+  tar -xzf /tmp/sb.tar.gz -C "$CORPUS_DIR"
+  rm -f /tmp/sb.tar.gz
+fi
+
+# eval_retrieval.py expects the dataset.json + spreadsheet/ dir. Find it.
+CORPUS_ARG="$CORPUS_DIR"
+if [ -d "$CORPUS_DIR/all_data_912_v0.1" ]; then
+  CORPUS_ARG="$CORPUS_DIR/all_data_912_v0.1"
+fi
+
+SAMPLE_ARG=()
+if [ "$SAMPLE" != "0" ]; then
+  SAMPLE_ARG=(--sample "$SAMPLE")
+  echo "→ Sampling $SAMPLE instances"
+fi
+
+echo "→ Running retrieval benchmark (parsers=$PARSERS) ..."
+python scripts/eval_retrieval.py \
+  --corpus "$CORPUS_ARG" \
+  --parsers "$PARSERS" \
+  --emit-failures \
+  --per-parser-timeout "$TIMEOUT" \
+  --out tests/benchmarks/reports/retrieval \
+  "${SAMPLE_ARG[@]}"
+
+echo "→ Appending to history.jsonl ..."
+python scripts/append_bench_history.py
+
+echo
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
diff --git a/scripts/run_enterprise_metrics.py b/scripts/run_enterprise_metrics.py
index 1d6821a..6366fbd 100644
--- a/scripts/run_enterprise_metrics.py
+++ b/scripts/run_enterprise_metrics.py
@@ -7,9 +7,8 @@
 
 import json
 
-from xlsx_parser import parse_workbook
-
 from scripts.generate_enterprise_fixtures import generate_all
+from xlsx_parser import parse_workbook
 
 
 class EnterpriseScorecard:
diff --git a/scripts/track_corpus_metrics.py b/scripts/track_corpus_metrics.py
index 779995b..a0cb244 100644
--- a/scripts/track_corpus_metrics.py
+++ b/scripts/track_corpus_metrics.py
@@ -6,7 +6,6 @@
 from datetime import datetime
 from pathlib import Path
 
-
 ROOT = Path(__file__).resolve().parent.parent
 METRICS_DIR = ROOT / "metrics" / "corpus"
 
diff --git a/scripts/triage_recall.py b/scripts/triage_recall.py
new file mode 100755
index 0000000..41a766a
--- /dev/null
+++ b/scripts/triage_recall.py
@@ -0,0 +1,114 @@
+#!/usr/bin/env python3
+"""Triage retrieval-recall failures into a ranked, actionable worklist.
+
+Reads a ``failures.ndjson`` produced by
+``eval_retrieval.py --emit-failures`` and prints:
+
+  1. A histogram of failure buckets (biggest = highest leverage).
+  2. For each bucket, a few concrete example failures with the
+     ground-truth answer and the top ranked chunks the parser produced.
+
+The point: turn "recall@5 is 0.70" into "N misses are answer_absent_from_chunks
+— the parser is dropping cells; here are 5 examples to reproduce."
+
+Usage:
+    python scripts/triage_recall.py tests/benchmarks/reports/retrieval/<stamp>/failures.ndjson
+    python scripts/triage_recall.py <dir>           # finds the latest run
+    python scripts/triage_recall.py <file> --bucket answer_absent_from_chunks --examples 10
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+
+# Ordered worst→least-actionable. Used to sort the worklist.
+BUCKET_GUIDANCE = {
+    "parse_error": "Parser raised on the file. Reproduce with parse_workbook(path) and fix the crash.",
+    "no_chunks": "Parser produced zero chunks. Sheet/region detection collapsed — check chunking/segmenter.py.",
+    "answer_absent_from_chunks": "Answer value is in NO chunk. EXTRACTION gap — the cell was dropped or garbled. Highest leverage.",
+    "wrong_sheet": "Answer sheet was never chunked. Sheet enumeration bug — check workbook_parser.py sheet loop.",
+    "geometric_no_overlap": "No chunk's A1 range overlaps ground truth. RANGE bookkeeping drift in merge/split.",
+    "present_but_ranked_low": "A chunk DOES contain the answer but ranked >5. Not a parser bug — fix chunk granularity/embedding.",
+    "unparseable_ground_truth": "Could not parse the dataset's answer_position. Benchmark-harness issue, not the parser.",
+}
+
+
+def find_failures_file(arg: Path) -> Path:
+    if arg.is_file():
+        return arg
+    if arg.is_dir():
+        direct = arg / "failures.ndjson"
+        if direct.exists():
+            return direct
+        runs = sorted(p for p in arg.glob("*/failures.ndjson"))
+        if runs:
+            return runs[-1]
+    sys.exit(f"ERROR: no failures.ndjson found at {arg}")
+
+
+def load(path: Path) -> list[dict]:
+    rows = []
+    for line in path.read_text().splitlines():
+        line = line.strip()
+        if line:
+            rows.append(json.loads(line))
+    return rows
+
+
+def main(argv: list[str] | None = None) -> int:
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("path", type=Path,
+                    help="failures.ndjson, or a dir containing/parenting one")
+    ap.add_argument("--bucket", help="only show examples for this bucket")
+    ap.add_argument("--examples", type=int, default=3,
+                    help="example failures to print per bucket (default 3)")
+    args = ap.parse_args(argv)
+
+    path = find_failures_file(args.path)
+    rows = load(path)
+    if not rows:
+        print(f"{path} is empty — no failures recorded (or run was a no-op).")
+        return 0
+
+    print(f"# Recall failure triage — {path}")
+    print(f"# {len(rows)} total failure rows\n")
+
+    hist = Counter(r.get("failure_bucket") for r in rows)
+    print("## Bucket histogram (ranked by count — fix the top one first)\n")
+    width = max(len(b or "?") for b in hist)
+    for bucket, count in hist.most_common():
+        pct = 100.0 * count / len(rows)
+        print(f"  {str(bucket):<{width}}  {count:5d}  ({pct:5.1f}%)")
+        print(f"  {'':<{width}}         → {BUCKET_GUIDANCE.get(bucket, '')}")
+    print()
+
+    buckets = [args.bucket] if args.bucket else [b for b, _ in hist.most_common()]
+    for bucket in buckets:
+        examples = [r for r in rows if r.get("failure_bucket") == bucket][:args.examples]
+        if not examples:
+            continue
+        print(f"## Examples — {bucket}\n")
+        for r in examples:
+            print(f"  instance {r.get('instance_id')}  ({r.get('parser')})")
+            print(f"    Q: {(r.get('instruction') or '')[:160]}")
+            print(f"    answer_position: {r.get('answer_position')}")
+            print(f"    ground-truth values: {r.get('answer_values')}")
+            print(f"    n_chunks={r.get('n_chunks')} "
+                  f"rank_of_text_match={r.get('rank_of_text_match')}")
+            if r.get("error"):
+                print(f"    ERROR: {r['error']}")
+            for c in (r.get("top_chunks") or [])[:4]:
+                mark = "✓" if c.get("contains_answer") else " "
+                snippet = (c.get("text") or "").replace("\n", " ")[:120]
+                print(f"    [{mark}] #{c.get('rank')} {c.get('sheet')} "
+                      f"{c.get('range')}  {snippet}")
+            print()
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/scripts/verify_wheel.py b/scripts/verify_wheel.py
new file mode 100755
index 0000000..26801f3
--- /dev/null
+++ b/scripts/verify_wheel.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+"""Verify the built wheel is installable and importable in a clean venv.
+
+This is the regression guard for the v0.2.0 packaging bug: ``pipeline.py``
+and ``api.py`` were top-level modules under ``src/`` and ``setuptools``
+``packages.find`` only picks up *packages*, so they were silently dropped
+from the wheel — ``from ks_xlsx_parser.pipeline import ...`` failed for
+every installed user. The flat layout also leaked 13 generic top-level
+packages (``models``, ``utils``, ``parsers`` ...) into ``site-packages``.
+
+Run after ``python -m build --wheel``. Exits non-zero on any problem so it
+can gate CI and ``make wheel-check``.
+"""
+from __future__ import annotations
+
+import subprocess
+import sys
+import tempfile
+import venv
+import zipfile
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+
+# Imports a real downstream consumer relies on. Keep in sync with the
+# public surface in ks_xlsx_parser/__init__.py.
+SMOKE_IMPORTS = [
+    "from ks_xlsx_parser import parse_workbook, ParseResult",
+    "from ks_xlsx_parser.pipeline import parse_workbook",
+    "from ks_xlsx_parser.verification import StageVerifier",
+    "from ks_xlsx_parser.analysis.table_assembler import TableAssembler",
+    "from ks_xlsx_parser.models.workbook import WorkbookDTO",
+]
+
+
+def find_wheel() -> Path:
+    wheels = sorted((ROOT / "dist").glob("*.whl"))
+    if not wheels:
+        sys.exit("ERROR: no wheel in dist/ — run `python -m build --wheel` first")
+    return wheels[-1]
+
+
+def check_wheel_contents(wheel: Path) -> None:
+    """Fail loudly if the wheel pollutes the global namespace or drops modules."""
+    with zipfile.ZipFile(wheel) as zf:
+        names = zf.namelist()
+        top_level = next((n for n in names if n.endswith("top_level.txt")), None)
+        if top_level:
+            packages = zf.read(top_level).decode().split()
+            if packages != ["ks_xlsx_parser"]:
+                sys.exit(
+                    f"ERROR: wheel exposes top-level packages {packages}; "
+                    "expected only ['ks_xlsx_parser']. The flat src/ layout leaked."
+                )
+    required = ["ks_xlsx_parser/pipeline.py", "ks_xlsx_parser/api.py"]
+    for req in required:
+        if not any(n == req for n in names):
+            sys.exit(f"ERROR: wheel is missing {req}")
+    print(f"wheel contents OK ({len(names)} entries, top-level: ks_xlsx_parser)")
+
+
+def check_install_and_import(wheel: Path) -> None:
+    with tempfile.TemporaryDirectory() as tmp:
+        env_dir = Path(tmp) / "venv"
+        venv.create(env_dir, with_pip=True)
+        py = env_dir / ("Scripts" if sys.platform == "win32" else "bin") / "python"
+        subprocess.run([str(py), "-m", "pip", "install", "-q", str(wheel)], check=True)
+        script = "; ".join(SMOKE_IMPORTS) + "; print('clean-venv import OK')"
+        subprocess.run([str(py), "-c", script], check=True)
+
+
+def main() -> None:
+    wheel = find_wheel()
+    print(f"verifying {wheel.name}")
+    check_wheel_contents(wheel)
+    check_install_and_import(wheel)
+    print("wheel verification PASSED")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/src/ks_xlsx_parser/__init__.py b/src/ks_xlsx_parser/__init__.py
index 184452b..cabcd96 100644
--- a/src/ks_xlsx_parser/__init__.py
+++ b/src/ks_xlsx_parser/__init__.py
@@ -1,25 +1,27 @@
 """
 ks_xlsx_parser — public API entry point for the ks-xlsx-parser package.
 
-The source tree is flat: top-level modules at ``src/`` (``pipeline``,
+All modules live under the ``ks_xlsx_parser`` package (``pipeline``,
 ``models``, ``analysis``, ``verification``, etc.). This module re-exports
 the stable, user-facing names so callers can do::
 
     from ks_xlsx_parser import parse_workbook, ParseResult
 
-regardless of internal layout.
+and submodules remain importable directly::
+
+    from ks_xlsx_parser.pipeline import parse_workbook
 """
 from __future__ import annotations
 
 __version__ = "0.2.0"
 
-from pipeline import (  # noqa: F401
+from .pipeline import (  # noqa: F401
     ParseResult,
     compare_workbooks,
     export_importer,
     parse_workbook,
 )
-from verification import (  # noqa: F401
+from .verification import (  # noqa: F401
     ExcellentStage,
     StageVerifier,
     VerificationReport,
diff --git a/src/analysis/__init__.py b/src/ks_xlsx_parser/analysis/__init__.py
similarity index 100%
rename from src/analysis/__init__.py
rename to src/ks_xlsx_parser/analysis/__init__.py
diff --git a/src/analysis/light_block_detector.py b/src/ks_xlsx_parser/analysis/light_block_detector.py
similarity index 96%
rename from src/analysis/light_block_detector.py
rename to src/ks_xlsx_parser/analysis/light_block_detector.py
index 9816cd0..7fc2d11 100644
--- a/src/analysis/light_block_detector.py
+++ b/src/ks_xlsx_parser/analysis/light_block_detector.py
@@ -10,9 +10,9 @@
 
 import logging
 
-from models.block import BlockDTO
-from models.common import BlockType, CellCoord, CellRange
-from models.table_structure import TableRegion, TableRegionRole, TableStructure
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/llm_artifacts.py b/src/ks_xlsx_parser/analysis/llm_artifacts.py
similarity index 98%
rename from src/analysis/llm_artifacts.py
rename to src/ks_xlsx_parser/analysis/llm_artifacts.py
index 01d0777..2122c45 100644
--- a/src/analysis/llm_artifacts.py
+++ b/src/ks_xlsx_parser/analysis/llm_artifacts.py
@@ -11,24 +11,22 @@
 from __future__ import annotations
 
 import logging
-import re
 from collections import Counter
 
 from pydantic import Field
 
-from models.block import BlockDTO
-from models.chart import ChartDTO
-from models.common import (
-    BlockType,
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.chart import ChartDTO
+from ks_xlsx_parser.models.common import (
     CellCoord,
     SheetPurpose,
     StableModel,
     compute_hash,
 )
-from models.dependency import DependencyGraph
-from models.sheet import SheetDTO
-from models.table import TableDTO
-from models.workbook import KpiDTO, SheetSummaryDTO
+from ks_xlsx_parser.models.dependency import DependencyGraph
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table import TableDTO
+from ks_xlsx_parser.models.workbook import KpiDTO, SheetSummaryDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/pattern_splitter.py b/src/ks_xlsx_parser/analysis/pattern_splitter.py
similarity index 96%
rename from src/analysis/pattern_splitter.py
rename to src/ks_xlsx_parser/analysis/pattern_splitter.py
index 92ac82f..e065801 100644
--- a/src/analysis/pattern_splitter.py
+++ b/src/ks_xlsx_parser/analysis/pattern_splitter.py
@@ -10,12 +10,11 @@
 
 import logging
 import math
-from collections import defaultdict
 
-from models.block import BlockDTO
-from models.common import BlockType, CellAnnotation, CellCoord, CellRange
-from models.sheet import SheetDTO
-from models.table_structure import TableRegion, TableRegionRole, TableStructure
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellAnnotation, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/table_assembler.py b/src/ks_xlsx_parser/analysis/table_assembler.py
similarity index 96%
rename from src/analysis/table_assembler.py
rename to src/ks_xlsx_parser/analysis/table_assembler.py
index 279603d..28bc3a3 100644
--- a/src/analysis/table_assembler.py
+++ b/src/ks_xlsx_parser/analysis/table_assembler.py
@@ -9,10 +9,10 @@
 
 import logging
 
-from models.block import BlockDTO
-from models.common import BlockType, CellCoord, CellRange, compute_hash
-from models.sheet import SheetDTO
-from models.table_structure import TableRegion, TableRegionRole, TableStructure
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/table_grouper.py b/src/ks_xlsx_parser/analysis/table_grouper.py
similarity index 97%
rename from src/analysis/table_grouper.py
rename to src/ks_xlsx_parser/analysis/table_grouper.py
index 94fe862..597da57 100644
--- a/src/analysis/table_grouper.py
+++ b/src/ks_xlsx_parser/analysis/table_grouper.py
@@ -11,10 +11,10 @@
 import logging
 from collections import defaultdict
 
-from models.block import BlockDTO
-from models.common import BlockType, CellCoord, CellRange, compute_hash
-from models.sheet import SheetDTO
-from models.table_structure import TableStructure
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table_structure import TableStructure
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/template_extractor.py b/src/ks_xlsx_parser/analysis/template_extractor.py
similarity index 95%
rename from src/analysis/template_extractor.py
rename to src/ks_xlsx_parser/analysis/template_extractor.py
index 27a8025..54dedd9 100644
--- a/src/analysis/template_extractor.py
+++ b/src/ks_xlsx_parser/analysis/template_extractor.py
@@ -10,15 +10,15 @@
 
 import logging
 
-from models.common import CellAnnotation, CellCoord
-from models.sheet import SheetDTO
-from models.template import (
+from ks_xlsx_parser.models.common import CellAnnotation, CellCoord
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.template import (
     DOFType,
     TemplateCellSpec,
     TemplateConstraint,
     TemplateNode,
 )
-from models.tree import TreeNode, TreeNodeType
+from ks_xlsx_parser.models.tree import TreeNode, TreeNodeType
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/tree_builder.py b/src/ks_xlsx_parser/analysis/tree_builder.py
similarity index 96%
rename from src/analysis/tree_builder.py
rename to src/ks_xlsx_parser/analysis/tree_builder.py
index 0e5192f..22a2ed1 100644
--- a/src/analysis/tree_builder.py
+++ b/src/ks_xlsx_parser/analysis/tree_builder.py
@@ -10,11 +10,10 @@
 
 import logging
 
-from models.block import BlockDTO
-from models.common import CellCoord, CellRange, compute_hash
-from models.sheet import SheetDTO
-from models.table_structure import TableStructure
-from models.tree import TreeNode, TreeNodeType
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table_structure import TableStructure
+from ks_xlsx_parser.models.tree import TreeNode, TreeNodeType
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/annotation/__init__.py b/src/ks_xlsx_parser/annotation/__init__.py
similarity index 100%
rename from src/annotation/__init__.py
rename to src/ks_xlsx_parser/annotation/__init__.py
diff --git a/src/annotation/block_splitter.py b/src/ks_xlsx_parser/annotation/block_splitter.py
similarity index 97%
rename from src/annotation/block_splitter.py
rename to src/ks_xlsx_parser/annotation/block_splitter.py
index 0ec332f..2d45209 100644
--- a/src/annotation/block_splitter.py
+++ b/src/ks_xlsx_parser/annotation/block_splitter.py
@@ -8,11 +8,10 @@
 from __future__ import annotations
 
 import logging
-from collections import defaultdict
 
-from models.block import BlockDTO
-from models.common import BlockType, CellAnnotation, CellCoord, CellRange
-from models.sheet import SheetDTO
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellAnnotation, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/annotation/cell_annotator.py b/src/ks_xlsx_parser/annotation/cell_annotator.py
similarity index 98%
rename from src/annotation/cell_annotator.py
rename to src/ks_xlsx_parser/annotation/cell_annotator.py
index 5f80bdc..af82bac 100644
--- a/src/annotation/cell_annotator.py
+++ b/src/ks_xlsx_parser/annotation/cell_annotator.py
@@ -11,9 +11,9 @@
 import logging
 from collections import defaultdict
 
-from models.cell import CellDTO
-from models.common import CellAnnotation
-from models.sheet import SheetDTO
+from ks_xlsx_parser.models.cell import CellDTO
+from ks_xlsx_parser.models.common import CellAnnotation
+from ks_xlsx_parser.models.sheet import SheetDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/api.py b/src/ks_xlsx_parser/api.py
similarity index 97%
rename from src/api.py
rename to src/ks_xlsx_parser/api.py
index 36469c5..82a367d 100644
--- a/src/api.py
+++ b/src/ks_xlsx_parser/api.py
@@ -2,7 +2,7 @@
 FastAPI application for uploading and parsing Excel files.
 
 Run with:
-    uvicorn xlsx_parser.api:app --reload --port 8080
+    uvicorn ks_xlsx_parser.api:app --reload --port 8080
 
 Or use the xlsx-parser-api script (runs on port 8080 by default):
     xlsx-parser-api
@@ -12,16 +12,11 @@
 
 from __future__ import annotations
 
-import json
-import time
-import traceback
-from pathlib import Path
-
 from fastapi import FastAPI, File, UploadFile
 from fastapi.responses import HTMLResponse, JSONResponse
 
-from pipeline import parse_workbook
-from verification import StageVerifier
+from ks_xlsx_parser.pipeline import parse_workbook
+from ks_xlsx_parser.verification import StageVerifier
 
 app = FastAPI(
     title="ks-xlsx-parser API",
diff --git a/src/charts/__init__.py b/src/ks_xlsx_parser/charts/__init__.py
similarity index 100%
rename from src/charts/__init__.py
rename to src/ks_xlsx_parser/charts/__init__.py
diff --git a/src/charts/chart_extractor.py b/src/ks_xlsx_parser/charts/chart_extractor.py
similarity index 99%
rename from src/charts/chart_extractor.py
rename to src/ks_xlsx_parser/charts/chart_extractor.py
index 4c4cdca..017feb9 100644
--- a/src/charts/chart_extractor.py
+++ b/src/ks_xlsx_parser/charts/chart_extractor.py
@@ -21,8 +21,8 @@
 from pathlib import Path
 from xml.etree import ElementTree as ET
 
-from models.chart import ChartAnchor, ChartAxis, ChartDTO, ChartSeries
-from models.common import ChartType
+from ks_xlsx_parser.models.chart import ChartAnchor, ChartAxis, ChartDTO, ChartSeries
+from ks_xlsx_parser.models.common import ChartType
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/chunking/__init__.py b/src/ks_xlsx_parser/chunking/__init__.py
similarity index 100%
rename from src/chunking/__init__.py
rename to src/ks_xlsx_parser/chunking/__init__.py
diff --git a/src/chunking/chunker.py b/src/ks_xlsx_parser/chunking/chunker.py
similarity index 95%
rename from src/chunking/chunker.py
rename to src/ks_xlsx_parser/chunking/chunker.py
index bb4d4f9..7057e78 100644
--- a/src/chunking/chunker.py
+++ b/src/ks_xlsx_parser/chunking/chunker.py
@@ -13,14 +13,12 @@
 
 import logging
 
-from models.block import BlockDTO, ChunkDTO, DependencySummary
-from models.common import CellCoord, EdgeType
-from models.dependency import DependencyGraph
-from models.sheet import SheetDTO
-from models.table import TableDTO
-from models.workbook import NamedRangeDTO, WorkbookDTO
-from rendering.html_renderer import HtmlRenderer
-from rendering.text_renderer import TextRenderer
+from ks_xlsx_parser.models.block import BlockDTO, ChunkDTO, DependencySummary
+from ks_xlsx_parser.models.common import CellCoord, EdgeType
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.workbook import WorkbookDTO
+from ks_xlsx_parser.rendering.html_renderer import HtmlRenderer
+from ks_xlsx_parser.rendering.text_renderer import TextRenderer
 
 logger = logging.getLogger(__name__)
 
@@ -179,7 +177,7 @@ def _chart_to_chunk(self, chart) -> ChunkDTO:
         top_left = "A1"
         bottom_right = "A1"
         if chart.anchor:
-            from models.common import col_number_to_letter
+            from ks_xlsx_parser.models.common import col_number_to_letter
             top_left = f"{col_number_to_letter(chart.anchor.from_col + 1)}{chart.anchor.from_row + 1}"
             if chart.anchor.to_col is not None and chart.anchor.to_row is not None:
                 bottom_right = f"{col_number_to_letter(chart.anchor.to_col + 1)}{chart.anchor.to_row + 1}"
diff --git a/src/chunking/segmenter.py b/src/ks_xlsx_parser/chunking/segmenter.py
similarity index 98%
rename from src/chunking/segmenter.py
rename to src/ks_xlsx_parser/chunking/segmenter.py
index 1b7898a..bde7b2a 100644
--- a/src/chunking/segmenter.py
+++ b/src/ks_xlsx_parser/chunking/segmenter.py
@@ -22,10 +22,10 @@
 
 import logging
 
-from models.block import BlockDTO
-from models.common import BlockType, CellCoord, CellRange
-from models.sheet import SheetDTO
-from models.table import TableDTO
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table import TableDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/comparison/__init__.py b/src/ks_xlsx_parser/comparison/__init__.py
similarity index 100%
rename from src/comparison/__init__.py
rename to src/ks_xlsx_parser/comparison/__init__.py
diff --git a/src/comparison/template_comparator.py b/src/ks_xlsx_parser/comparison/template_comparator.py
similarity index 99%
rename from src/comparison/template_comparator.py
rename to src/ks_xlsx_parser/comparison/template_comparator.py
index e7e043e..0c4fd64 100644
--- a/src/comparison/template_comparator.py
+++ b/src/ks_xlsx_parser/comparison/template_comparator.py
@@ -11,8 +11,8 @@
 import logging
 from collections import defaultdict
 
-from models.common import CellCoord, compute_hash
-from models.template import (
+from ks_xlsx_parser.models.common import CellCoord
+from ks_xlsx_parser.models.template import (
     DOFConflict,
     DOFType,
     GeneralizedTemplate,
diff --git a/src/export/__init__.py b/src/ks_xlsx_parser/export/__init__.py
similarity index 100%
rename from src/export/__init__.py
rename to src/ks_xlsx_parser/export/__init__.py
diff --git a/src/export/model_exporter.py b/src/ks_xlsx_parser/export/model_exporter.py
similarity index 98%
rename from src/export/model_exporter.py
rename to src/ks_xlsx_parser/export/model_exporter.py
index d47dc7d..0784aac 100644
--- a/src/export/model_exporter.py
+++ b/src/ks_xlsx_parser/export/model_exporter.py
@@ -12,8 +12,8 @@
 from pathlib import Path
 from typing import Any
 
-from models.common import CellCoord
-from models.template import DOFType, GeneralizedTemplate, TemplateNode
+from ks_xlsx_parser.models.common import CellCoord
+from ks_xlsx_parser.models.template import DOFType, GeneralizedTemplate
 
 logger = logging.getLogger(__name__)
 
@@ -205,7 +205,7 @@ def export_code(
             Constants: {template.total_constants} | DOFs: {template.total_dofs} | Formulas: {template.total_formulas}
             """
 
-            from xlsx_parser.export.model_exporter import SpreadsheetImporter
+            from ks_xlsx_parser.export.model_exporter import SpreadsheetImporter
             from xlsx_parser import parse_workbook
 
 
diff --git a/src/formula/__init__.py b/src/ks_xlsx_parser/formula/__init__.py
similarity index 100%
rename from src/formula/__init__.py
rename to src/ks_xlsx_parser/formula/__init__.py
diff --git a/src/formula/dependency_builder.py b/src/ks_xlsx_parser/formula/dependency_builder.py
similarity index 93%
rename from src/formula/dependency_builder.py
rename to src/ks_xlsx_parser/formula/dependency_builder.py
index 8bd85b0..19785af 100644
--- a/src/formula/dependency_builder.py
+++ b/src/ks_xlsx_parser/formula/dependency_builder.py
@@ -11,13 +11,14 @@
 import logging
 from typing import TYPE_CHECKING
 
-from models.common import CellCoord, EdgeType
-from models.dependency import DependencyEdgeDTO, DependencyGraph
+from ks_xlsx_parser.models.common import CellCoord, EdgeType
+from ks_xlsx_parser.models.dependency import DependencyEdgeDTO, DependencyGraph
+
 from .formula_parser import FormulaParser
 
 if TYPE_CHECKING:
-    from models.sheet import SheetDTO
-    from models.workbook import NamedRangeDTO
+    from ks_xlsx_parser.models.sheet import SheetDTO
+    from ks_xlsx_parser.models.workbook import NamedRangeDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/formula/formula_parser.py b/src/ks_xlsx_parser/formula/formula_parser.py
similarity index 98%
rename from src/formula/formula_parser.py
rename to src/ks_xlsx_parser/formula/formula_parser.py
index f850b91..941f7a4 100644
--- a/src/formula/formula_parser.py
+++ b/src/ks_xlsx_parser/formula/formula_parser.py
@@ -10,9 +10,9 @@
 from __future__ import annotations
 
 import re
-from dataclasses import dataclass, field
+from dataclasses import dataclass
 
-from models.common import CellCoord, CellRange, col_letter_to_number
+from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number
 
 # Optional Rust fast-path: ks_xlsx_core.scan_formula returns reference tuples
 # in the same emit order as the Python regex pipeline. When available, we
diff --git a/src/models/__init__.py b/src/ks_xlsx_parser/models/__init__.py
similarity index 94%
rename from src/models/__init__.py
rename to src/ks_xlsx_parser/models/__init__.py
index 42814ea..5c91780 100644
--- a/src/models/__init__.py
+++ b/src/ks_xlsx_parser/models/__init__.py
@@ -1,8 +1,8 @@
 """
-Data models (DTOs) for the xlsx_parser.
+Data models (DTOs) for the ks_xlsx_parser.
 
 Re-exports all model classes for convenient importing:
-    from xlsx_parser.models import WorkbookDTO, SheetDTO, CellDTO, ...
+    from ks_xlsx_parser.models import WorkbookDTO, SheetDTO, CellDTO, ...
 """
 
 from .block import BlockDTO, ChunkDTO, DependencySummary
diff --git a/src/models/block.py b/src/ks_xlsx_parser/models/block.py
similarity index 100%
rename from src/models/block.py
rename to src/ks_xlsx_parser/models/block.py
diff --git a/src/models/cell.py b/src/ks_xlsx_parser/models/cell.py
similarity index 100%
rename from src/models/cell.py
rename to src/ks_xlsx_parser/models/cell.py
diff --git a/src/models/chart.py b/src/ks_xlsx_parser/models/chart.py
similarity index 100%
rename from src/models/chart.py
rename to src/ks_xlsx_parser/models/chart.py
diff --git a/src/models/common.py b/src/ks_xlsx_parser/models/common.py
similarity index 100%
rename from src/models/common.py
rename to src/ks_xlsx_parser/models/common.py
diff --git a/src/models/dependency.py b/src/ks_xlsx_parser/models/dependency.py
similarity index 100%
rename from src/models/dependency.py
rename to src/ks_xlsx_parser/models/dependency.py
diff --git a/src/models/shape.py b/src/ks_xlsx_parser/models/shape.py
similarity index 97%
rename from src/models/shape.py
rename to src/ks_xlsx_parser/models/shape.py
index 178ebaf..a0f94f9 100644
--- a/src/models/shape.py
+++ b/src/ks_xlsx_parser/models/shape.py
@@ -9,7 +9,7 @@
 
 from pydantic import Field
 
-from .common import CellRange, StableModel, compute_hash
+from .common import StableModel, compute_hash
 
 
 class ShapeAnchor(StableModel):
diff --git a/src/models/sheet.py b/src/ks_xlsx_parser/models/sheet.py
similarity index 99%
rename from src/models/sheet.py
rename to src/ks_xlsx_parser/models/sheet.py
index bfc2758..68c8cb1 100644
--- a/src/models/sheet.py
+++ b/src/ks_xlsx_parser/models/sheet.py
@@ -8,8 +8,6 @@
 
 from __future__ import annotations
 
-from typing import Any
-
 from pydantic import Field
 
 from .cell import CellDTO
diff --git a/src/models/table.py b/src/ks_xlsx_parser/models/table.py
similarity index 100%
rename from src/models/table.py
rename to src/ks_xlsx_parser/models/table.py
diff --git a/src/models/table_structure.py b/src/ks_xlsx_parser/models/table_structure.py
similarity index 100%
rename from src/models/table_structure.py
rename to src/ks_xlsx_parser/models/table_structure.py
diff --git a/src/models/template.py b/src/ks_xlsx_parser/models/template.py
similarity index 100%
rename from src/models/template.py
rename to src/ks_xlsx_parser/models/template.py
diff --git a/src/models/tree.py b/src/ks_xlsx_parser/models/tree.py
similarity index 100%
rename from src/models/tree.py
rename to src/ks_xlsx_parser/models/tree.py
diff --git a/src/models/workbook.py b/src/ks_xlsx_parser/models/workbook.py
similarity index 99%
rename from src/models/workbook.py
rename to src/ks_xlsx_parser/models/workbook.py
index 772c934..39a428b 100644
--- a/src/models/workbook.py
+++ b/src/ks_xlsx_parser/models/workbook.py
@@ -32,7 +32,7 @@
 from .sheet import SheetDTO
 from .table import TableDTO
 from .table_structure import TableStructure
-from .template import GeneralizedTemplate, TemplateNode
+from .template import TemplateNode
 from .tree import TreeNode
 
 
diff --git a/src/parsers/__init__.py b/src/ks_xlsx_parser/parsers/__init__.py
similarity index 100%
rename from src/parsers/__init__.py
rename to src/ks_xlsx_parser/parsers/__init__.py
diff --git a/src/parsers/calamine_core.py b/src/ks_xlsx_parser/parsers/calamine_core.py
similarity index 99%
rename from src/parsers/calamine_core.py
rename to src/ks_xlsx_parser/parsers/calamine_core.py
index 830df5d..e6ac959 100644
--- a/src/parsers/calamine_core.py
+++ b/src/ks_xlsx_parser/parsers/calamine_core.py
@@ -17,7 +17,6 @@
 """
 from __future__ import annotations
 
-import io
 import logging
 import os
 import tempfile
diff --git a/src/parsers/cell_parser.py b/src/ks_xlsx_parser/parsers/cell_parser.py
similarity index 99%
rename from src/parsers/cell_parser.py
rename to src/ks_xlsx_parser/parsers/cell_parser.py
index cc22092..bf1857c 100644
--- a/src/parsers/cell_parser.py
+++ b/src/ks_xlsx_parser/parsers/cell_parser.py
@@ -17,7 +17,7 @@
 from openpyxl.cell.cell import Cell as OpenpyxlCell
 from openpyxl.cell.cell import MergedCell as OpenpyxlMergedCell
 
-from models.cell import (
+from ks_xlsx_parser.models.cell import (
     AlignmentStyle,
     BorderSide,
     BorderStyle,
@@ -26,7 +26,7 @@
     FillStyle,
     FontStyle,
 )
-from models.common import CellCoord, RichTextRun
+from ks_xlsx_parser.models.common import CellCoord, RichTextRun
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/parsers/sheet_parser.py b/src/ks_xlsx_parser/parsers/sheet_parser.py
similarity index 98%
rename from src/parsers/sheet_parser.py
rename to src/ks_xlsx_parser/parsers/sheet_parser.py
index 5883a92..d49a28d 100644
--- a/src/parsers/sheet_parser.py
+++ b/src/ks_xlsx_parser/parsers/sheet_parser.py
@@ -18,21 +18,22 @@
 from lxml import etree
 from openpyxl.worksheet.worksheet import Worksheet as OpenpyxlWorksheet
 
-from models.cell import CellDTO
-from models.common import CellCoord, CellRange, ParseError, Severity, col_letter_to_number
+from ks_xlsx_parser.models.cell import CellDTO
+from ks_xlsx_parser.models.common import CellCoord, CellRange, ParseError, Severity, col_letter_to_number
 
 # OOXML namespace for spreadsheetML
 _OOXML_NS = {"s": "http://schemas.openxmlformats.org/spreadsheetml/2006/main"}
 
 # Regex to split an A1-style cell reference like "B1" into ("B", "1")
 _CELL_REF_RE = re.compile(r"^([A-Z]+)(\d+)$")
-from models.sheet import (
+from ks_xlsx_parser.models.sheet import (
     ConditionalFormatRule,
     DataValidationRule,
     MergedRegion,
     SheetDTO,
     SheetProperties,
 )
+
 from .cell_parser import CellParser
 
 logger = logging.getLogger(__name__)
@@ -282,7 +283,7 @@ def _extract_dimensions(self, sheet: SheetDTO) -> None:
 
         for col_letter, cd in self._ws.column_dimensions.items():
             if cd.width:
-                from models.common import col_letter_to_number
+                from ks_xlsx_parser.models.common import col_letter_to_number
                 col_num = col_letter_to_number(col_letter)
                 sheet.col_widths[col_num] = cd.width * 7.5  # chars to points approx
 
@@ -294,7 +295,7 @@ def _extract_hidden(self, sheet: SheetDTO) -> None:
 
         for col_letter, cd in self._ws.column_dimensions.items():
             if cd.hidden:
-                from models.common import col_letter_to_number
+                from ks_xlsx_parser.models.common import col_letter_to_number
                 col_num = col_letter_to_number(col_letter)
                 sheet.hidden_cols.add(col_num)
 
@@ -347,11 +348,10 @@ def _extract_autofilter(self, sheet: SheetDTO) -> None:
         if not af or not af.ref:
             return
         try:
-            from models.common import FilterCriteria
+            from ks_xlsx_parser.models.common import FilterCriteria
             ref = str(af.ref)
             parts = ref.split(":")
             if len(parts) == 2:
-                from models.common import col_letter_to_number as c2n
                 start_match = _CELL_REF_RE.match(parts[0])
                 end_match = _CELL_REF_RE.match(parts[1])
                 if start_match and end_match:
diff --git a/src/parsers/table_parser.py b/src/ks_xlsx_parser/parsers/table_parser.py
similarity index 97%
rename from src/parsers/table_parser.py
rename to src/ks_xlsx_parser/parsers/table_parser.py
index d2eb2eb..bdc0d55 100644
--- a/src/parsers/table_parser.py
+++ b/src/ks_xlsx_parser/parsers/table_parser.py
@@ -12,8 +12,8 @@
 
 from openpyxl.worksheet.worksheet import Worksheet as OpenpyxlWorksheet
 
-from models.common import CellCoord, CellRange, col_letter_to_number
-from models.table import TableColumn, TableDTO
+from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number
+from ks_xlsx_parser.models.table import TableColumn, TableDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/parsers/workbook_parser.py b/src/ks_xlsx_parser/parsers/workbook_parser.py
similarity index 97%
rename from src/parsers/workbook_parser.py
rename to src/ks_xlsx_parser/parsers/workbook_parser.py
index bee0452..57e0c93 100644
--- a/src/parsers/workbook_parser.py
+++ b/src/ks_xlsx_parser/parsers/workbook_parser.py
@@ -11,7 +11,6 @@
 import io
 import logging
 import time
-from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
 from pathlib import Path
 
 import xxhash
@@ -21,7 +20,7 @@
 # both values AND formulas in one read; python-calamine is kept as a fallback
 # for environments where the Rust crate hasn't been built. Both are bypassed
 # cleanly if absent — correctness always falls back to openpyxl.
-from parsers import calamine_core as _calamine_core
+from ks_xlsx_parser.parsers import calamine_core as _calamine_core
 
 try:
     from python_calamine import CalamineWorkbook
@@ -30,14 +29,14 @@
     CalamineWorkbook = None  # type: ignore[assignment]
     _HAS_PYCALAMINE = False
 
-from models.common import ParseError, Severity
-from models.common import CalculationMode, DateSystem
-from models.workbook import (
+from ks_xlsx_parser.models.common import CalculationMode, DateSystem, ParseError, Severity
+from ks_xlsx_parser.models.workbook import (
     ExternalLink,
     NamedRangeDTO,
     WorkbookDTO,
     WorkbookProperties,
 )
+
 from .sheet_parser import SheetParser
 from .table_parser import TableParser
 
@@ -234,7 +233,7 @@ def parse(self) -> WorkbookDTO:
 
         # Extract charts via OOXML parsing
         try:
-            from charts.chart_extractor import ChartExtractor
+            from ks_xlsx_parser.charts.chart_extractor import ChartExtractor
             chart_extractor = ChartExtractor(
                 self._path or self._content, wb_formula.sheetnames
             )
@@ -253,7 +252,7 @@ def parse(self) -> WorkbookDTO:
         # accounts for ~25% of the full-mode wall clock).
         if self._build_dep_graph:
             try:
-                from formula.dependency_builder import DependencyBuilder
+                from ks_xlsx_parser.formula.dependency_builder import DependencyBuilder
                 dep_builder = DependencyBuilder(result.sheets, result.named_ranges)
                 result.dependency_graph = dep_builder.build()
             except Exception as e:
diff --git a/src/pipeline.py b/src/ks_xlsx_parser/pipeline.py
similarity index 90%
rename from src/pipeline.py
rename to src/ks_xlsx_parser/pipeline.py
index 9b902c5..1451074 100644
--- a/src/pipeline.py
+++ b/src/ks_xlsx_parser/pipeline.py
@@ -1,7 +1,7 @@
 """
 End-to-end parsing pipeline.
 
-This is the primary public API for the xlsx_parser. It orchestrates
+This is the primary public API for the ks_xlsx_parser. It orchestrates
 the full 11-stage Excellent Algorithm:
   Stage 0: Sheet Chunking (adaptive gaps + style boundaries)
   Stage 1: Cell Annotation (feature-based scorer)
@@ -15,7 +15,7 @@
   Render + Store
 
 Usage:
-    from xlsx_parser.pipeline import parse_workbook
+    from ks_xlsx_parser.pipeline import parse_workbook
     result = parse_workbook("path/to/workbook.xlsx")
     print(result.workbook.total_cells)
     for chunk in result.chunks:
@@ -32,25 +32,25 @@
 
 ParseMode = Literal["full", "fast"]
 
-from analysis.light_block_detector import LightBlockDetector
-from models.common import CellCoord, CellRange, col_letter_to_number
-from analysis.pattern_splitter import PatternSplitter
-from analysis.table_assembler import TableAssembler
-from analysis.table_grouper import TableGrouper
-from analysis.template_extractor import TemplateExtractor
-from analysis.tree_builder import TreeBuilder
-from annotation.block_splitter import BlockSplitter
-from annotation.cell_annotator import CellAnnotator
-from chunking.chunker import ChunkBuilder
-from comparison.template_comparator import TemplateComparator
-from export.model_exporter import ModelExporter
-from models.block import ChunkDTO
-from models.table_structure import TableStructure
-from models.template import GeneralizedTemplate, TemplateNode
-from models.tree import TreeNode
-from models.workbook import WorkbookDTO
-from parsers.workbook_parser import WorkbookParser
-from storage.serializer import WorkbookSerializer
+from ks_xlsx_parser.analysis.light_block_detector import LightBlockDetector
+from ks_xlsx_parser.analysis.pattern_splitter import PatternSplitter
+from ks_xlsx_parser.analysis.table_assembler import TableAssembler
+from ks_xlsx_parser.analysis.table_grouper import TableGrouper
+from ks_xlsx_parser.analysis.template_extractor import TemplateExtractor
+from ks_xlsx_parser.analysis.tree_builder import TreeBuilder
+from ks_xlsx_parser.annotation.block_splitter import BlockSplitter
+from ks_xlsx_parser.annotation.cell_annotator import CellAnnotator
+from ks_xlsx_parser.chunking.chunker import ChunkBuilder
+from ks_xlsx_parser.comparison.template_comparator import TemplateComparator
+from ks_xlsx_parser.export.model_exporter import ModelExporter
+from ks_xlsx_parser.models.block import ChunkDTO
+from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number
+from ks_xlsx_parser.models.table_structure import TableStructure
+from ks_xlsx_parser.models.template import GeneralizedTemplate, TemplateNode
+from ks_xlsx_parser.models.tree import TreeNode
+from ks_xlsx_parser.models.workbook import WorkbookDTO
+from ks_xlsx_parser.parsers.workbook_parser import WorkbookParser
+from ks_xlsx_parser.storage.serializer import WorkbookSerializer
 
 logger = logging.getLogger(__name__)
 
@@ -261,7 +261,7 @@ def parse_workbook(
         annotator.annotate()
 
         # Stage 0+2: Segment then split by annotation
-        from chunking.segmenter import LayoutSegmenter
+        from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
         sheet_tables = [t for t in workbook.tables if t.sheet_name == sheet.sheet_name]
         sheet_named = [
             nr.name for nr in workbook.named_ranges
diff --git a/src/py.typed b/src/ks_xlsx_parser/py.typed
similarity index 100%
rename from src/py.typed
rename to src/ks_xlsx_parser/py.typed
diff --git a/src/rendering/__init__.py b/src/ks_xlsx_parser/rendering/__init__.py
similarity index 100%
rename from src/rendering/__init__.py
rename to src/ks_xlsx_parser/rendering/__init__.py
diff --git a/src/rendering/html_renderer.py b/src/ks_xlsx_parser/rendering/html_renderer.py
similarity index 96%
rename from src/rendering/html_renderer.py
rename to src/ks_xlsx_parser/rendering/html_renderer.py
index c7c1a1c..50a8b21 100644
--- a/src/rendering/html_renderer.py
+++ b/src/ks_xlsx_parser/rendering/html_renderer.py
@@ -14,10 +14,10 @@
 import html
 import logging
 
-from models.block import BlockDTO
-from models.cell import CellDTO, CellStyle
-from models.common import BlockType, CellCoord, col_number_to_letter
-from models.sheet import SheetDTO
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.cell import CellStyle
+from ks_xlsx_parser.models.common import BlockType, col_number_to_letter
+from ks_xlsx_parser.models.sheet import SheetDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/rendering/text_renderer.py b/src/ks_xlsx_parser/rendering/text_renderer.py
similarity index 97%
rename from src/rendering/text_renderer.py
rename to src/ks_xlsx_parser/rendering/text_renderer.py
index 76b9b76..e2c15f4 100644
--- a/src/rendering/text_renderer.py
+++ b/src/ks_xlsx_parser/rendering/text_renderer.py
@@ -11,10 +11,10 @@
 import datetime as _dt
 import logging
 
-from models.block import BlockDTO
-from models.common import BlockType, col_number_to_letter
-from models.chart import ChartDTO
-from models.sheet import SheetDTO
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.chart import ChartDTO
+from ks_xlsx_parser.models.common import BlockType, col_number_to_letter
+from ks_xlsx_parser.models.sheet import SheetDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/storage/__init__.py b/src/ks_xlsx_parser/storage/__init__.py
similarity index 100%
rename from src/storage/__init__.py
rename to src/ks_xlsx_parser/storage/__init__.py
diff --git a/src/storage/serializer.py b/src/ks_xlsx_parser/storage/serializer.py
similarity index 98%
rename from src/storage/serializer.py
rename to src/ks_xlsx_parser/storage/serializer.py
index 5ced8f9..66add02 100644
--- a/src/storage/serializer.py
+++ b/src/ks_xlsx_parser/storage/serializer.py
@@ -15,8 +15,8 @@
 import logging
 from typing import Any
 
-from models.block import ChunkDTO
-from models.workbook import WorkbookDTO
+from ks_xlsx_parser.models.block import ChunkDTO
+from ks_xlsx_parser.models.workbook import WorkbookDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/utils/__init__.py b/src/ks_xlsx_parser/utils/__init__.py
similarity index 100%
rename from src/utils/__init__.py
rename to src/ks_xlsx_parser/utils/__init__.py
diff --git a/src/utils/logging_config.py b/src/ks_xlsx_parser/utils/logging_config.py
similarity index 97%
rename from src/utils/logging_config.py
rename to src/ks_xlsx_parser/utils/logging_config.py
index ca23108..94ceb54 100644
--- a/src/utils/logging_config.py
+++ b/src/ks_xlsx_parser/utils/logging_config.py
@@ -1,5 +1,5 @@
 """
-Structured logging configuration for the xlsx_parser.
+Structured logging configuration for the ks_xlsx_parser.
 
 Provides a JSON-structured logging format with workbook/sheet/block context
 fields for observability and debugging in production.
diff --git a/src/verification/__init__.py b/src/ks_xlsx_parser/verification/__init__.py
similarity index 100%
rename from src/verification/__init__.py
rename to src/ks_xlsx_parser/verification/__init__.py
diff --git a/src/verification/stage_verifier.py b/src/ks_xlsx_parser/verification/stage_verifier.py
similarity index 97%
rename from src/verification/stage_verifier.py
rename to src/ks_xlsx_parser/verification/stage_verifier.py
index 3fb315b..88fb4f8 100644
--- a/src/verification/stage_verifier.py
+++ b/src/ks_xlsx_parser/verification/stage_verifier.py
@@ -16,23 +16,21 @@
 
 from pydantic import Field
 
-from analysis.light_block_detector import LightBlockDetector
-from analysis.pattern_splitter import PatternSplitter
-from analysis.table_assembler import TableAssembler
-from analysis.table_grouper import TableGrouper
-from analysis.template_extractor import TemplateExtractor
-from analysis.tree_builder import TreeBuilder
-from annotation.block_splitter import BlockSplitter
-from annotation.cell_annotator import CellAnnotator
-from chunking.segmenter import LayoutSegmenter
-from comparison.template_comparator import TemplateComparator
-from export.model_exporter import ModelExporter
-from models.block import BlockDTO
-from models.common import BlockType, CellAnnotation, StableModel
-from models.table_structure import TableStructure
-from models.template import TemplateNode
-from models.tree import TreeNode
-from parsers.workbook_parser import WorkbookParser
+from ks_xlsx_parser.analysis.light_block_detector import LightBlockDetector
+from ks_xlsx_parser.analysis.pattern_splitter import PatternSplitter
+from ks_xlsx_parser.analysis.table_assembler import TableAssembler
+from ks_xlsx_parser.analysis.table_grouper import TableGrouper
+from ks_xlsx_parser.analysis.template_extractor import TemplateExtractor
+from ks_xlsx_parser.analysis.tree_builder import TreeBuilder
+from ks_xlsx_parser.annotation.block_splitter import BlockSplitter
+from ks_xlsx_parser.annotation.cell_annotator import CellAnnotator
+from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellAnnotation, StableModel
+from ks_xlsx_parser.models.table_structure import TableStructure
+from ks_xlsx_parser.models.template import TemplateNode
+from ks_xlsx_parser.models.tree import TreeNode
+from ks_xlsx_parser.parsers.workbook_parser import WorkbookParser
 
 logger = logging.getLogger(__name__)
 
diff --git a/tests/benchmarks/_driver.py b/tests/benchmarks/_driver.py
index a66acd5..94ee633 100644
--- a/tests/benchmarks/_driver.py
+++ b/tests/benchmarks/_driver.py
@@ -18,10 +18,8 @@
 import subprocess
 import sys
 from collections import defaultdict
-from dataclasses import asdict
 from datetime import UTC, datetime
 from pathlib import Path
-from typing import Iterable
 
 from ._runner import Runner
 from ._schema import CSV_FIELDS, BenchmarkRecord, record_to_csv_row, validate_record
@@ -344,7 +342,7 @@ def generate_drift(out_dir: Path) -> None:
                 short = Path(fname).name
                 lines.append(f"| {short} | {va} | {vb} | {va - vb:+d} |")
         if n_drift == 0:
-            lines.append(f"| _no drift_ | | | |")
+            lines.append("| _no drift_ | | | |")
         elif n_drift > 50:
             lines.append(f"| … {n_drift - 50} more rows truncated | | | |")
         lines.append("")
diff --git a/tests/benchmarks/_runner.py b/tests/benchmarks/_runner.py
index 2bdb95c..a5febc9 100644
--- a/tests/benchmarks/_runner.py
+++ b/tests/benchmarks/_runner.py
@@ -18,7 +18,7 @@
 import time
 from dataclasses import dataclass
 from pathlib import Path
-from typing import Any, Iterator
+from typing import Any
 
 HERE = Path(__file__).resolve().parent
 REPO_ROOT = HERE.parent.parent
diff --git a/tests/benchmarks/adapters/docling_adapter.py b/tests/benchmarks/adapters/docling_adapter.py
index 17d4b02..bd573f5 100644
--- a/tests/benchmarks/adapters/docling_adapter.py
+++ b/tests/benchmarks/adapters/docling_adapter.py
@@ -41,7 +41,7 @@
 sys.path.insert(0, str(_HERE.parent.parent.parent))  # repo root
 
 from tests.benchmarks._mem import peak_rss_mb  # noqa: E402
-from tests.benchmarks._schema import BenchmarkRecord, SCHEMA_VERSION  # noqa: E402
+from tests.benchmarks._schema import SCHEMA_VERSION, BenchmarkRecord  # noqa: E402
 
 PARSER_NAME = "docling"
 MAX_ERR_LEN = 500
diff --git a/tests/benchmarks/adapters/ks_adapter.py b/tests/benchmarks/adapters/ks_adapter.py
index 917df29..754b2a6 100644
--- a/tests/benchmarks/adapters/ks_adapter.py
+++ b/tests/benchmarks/adapters/ks_adapter.py
@@ -28,10 +28,10 @@
 sys.path.insert(0, str(_HERE.parent.parent.parent))  # repo root
 sys.path.insert(0, str(_HERE.parent.parent.parent / "src"))  # src layout
 
-from tests.benchmarks._mem import peak_rss_mb  # noqa: E402
-from tests.benchmarks._schema import BenchmarkRecord, SCHEMA_VERSION  # noqa: E402
 from ks_xlsx_parser import __version__ as KS_VERSION  # noqa: E402
 from ks_xlsx_parser import parse_workbook  # noqa: E402
+from tests.benchmarks._mem import peak_rss_mb  # noqa: E402
+from tests.benchmarks._schema import SCHEMA_VERSION, BenchmarkRecord  # noqa: E402
 
 MAX_ERR_LEN = 500
 
diff --git a/tests/conftest.py b/tests/conftest.py
index d422b9b..1417e30 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -8,7 +8,6 @@
 
 
 
-import os
 import tempfile
 from pathlib import Path
 
@@ -17,7 +16,6 @@
 from openpyxl.chart import BarChart, Reference
 from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, IconSetRule
 from openpyxl.styles import Alignment, Border, Font, PatternFill, Side
-from openpyxl.utils import get_column_letter
 from openpyxl.worksheet.datavalidation import DataValidation
 from openpyxl.worksheet.table import Table, TableStyleInfo
 
diff --git a/tests/helpers/invariant_checker.py b/tests/helpers/invariant_checker.py
index a077486..89efd1c 100644
--- a/tests/helpers/invariant_checker.py
+++ b/tests/helpers/invariant_checker.py
@@ -9,7 +9,7 @@
 
 import re
 
-from models.common import EdgeType
+from ks_xlsx_parser.models.common import EdgeType
 
 
 def check_invariants(workbook) -> list[str]:
diff --git a/tests/test_charts.py b/tests/test_charts.py
index 021e793..802e6a6 100644
--- a/tests/test_charts.py
+++ b/tests/test_charts.py
@@ -5,11 +5,10 @@
 type, series, axes, title, and position anchor.
 """
 
-import pytest
 
-from charts.chart_extractor import ChartExtractor
-from models import ChartType
-from parsers import WorkbookParser
+from ks_xlsx_parser.charts.chart_extractor import ChartExtractor
+from ks_xlsx_parser.models import ChartType
+from ks_xlsx_parser.parsers import WorkbookParser
 
 
 class TestChartExtraction:
diff --git a/tests/test_corpus_robustness.py b/tests/test_corpus_robustness.py
index 2253eea..4263c89 100644
--- a/tests/test_corpus_robustness.py
+++ b/tests/test_corpus_robustness.py
@@ -1,5 +1,5 @@
 """
-Corpus robustness tests for the xlsx_parser.
+Corpus robustness tests for the ks_xlsx_parser.
 
 Tests the parser against large sets of real-world .xlsx files to catch
 crashes, invariant violations, and regression. Corpus tests are skipped
@@ -15,17 +15,15 @@
 
 import pytest
 
-from pipeline import parse_workbook
-from models.common import Severity
-
-from tests.helpers.invariant_checker import check_invariants
+from ks_xlsx_parser.models.common import Severity
+from ks_xlsx_parser.pipeline import parse_workbook
 from tests.helpers.corpus_downloader import (
-    download_euses_corpus,
     download_enron_corpus,
+    download_euses_corpus,
     download_github_xlsx_samples,
     get_corpus_files,
 )
-
+from tests.helpers.invariant_checker import check_invariants
 
 CORPUS_DIR = Path(__file__).parent / "fixtures" / "corpus"
 
diff --git a/tests/test_formula_handling.py b/tests/test_formula_handling.py
index b3f8342..8402667 100644
--- a/tests/test_formula_handling.py
+++ b/tests/test_formula_handling.py
@@ -11,14 +11,11 @@
 - Renders formula cells in chunk output
 """
 
-import pytest
-
-from formula.dependency_builder import DependencyBuilder
-from formula.formula_parser import FormulaParser, ParsedReference
-from models.common import BlockType, CellCoord, EdgeType
-from parsers import WorkbookParser
-from pipeline import parse_workbook
 
+from ks_xlsx_parser.formula.formula_parser import FormulaParser
+from ks_xlsx_parser.models.common import CellCoord, EdgeType
+from ks_xlsx_parser.parsers import WorkbookParser
+from ks_xlsx_parser.pipeline import parse_workbook
 
 # ---------------------------------------------------------------------------
 # Helpers
@@ -705,7 +702,7 @@ class TestFormulaInBlocks:
     def test_block_formula_count(self, simple_formulas):
         result = parse_workbook(path=simple_formulas)
         # The block should report formula_count > 0
-        from chunking.segmenter import LayoutSegmenter
+        from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
         sheet = result.workbook.sheets[0]
         tables = [t for t in result.workbook.tables if t.sheet_name == sheet.sheet_name]
         segmenter = LayoutSegmenter(sheet, tables=tables)
@@ -715,7 +712,7 @@ def test_block_formula_count(self, simple_formulas):
 
     def test_cross_sheet_block_has_formulas(self, cross_sheet_formulas):
         result = parse_workbook(path=cross_sheet_formulas)
-        from chunking.segmenter import LayoutSegmenter
+        from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
         summary = [s for s in result.workbook.sheets if s.sheet_name == "Summary"][0]
         tables = [t for t in result.workbook.tables if t.sheet_name == summary.sheet_name]
         segmenter = LayoutSegmenter(summary, tables=tables)
diff --git a/tests/test_formula_parser.py b/tests/test_formula_parser.py
index 784e171..03c3959 100644
--- a/tests/test_formula_parser.py
+++ b/tests/test_formula_parser.py
@@ -5,10 +5,9 @@
 and circular reference detection.
 """
 
-import pytest
 
-from formula.formula_parser import FormulaParser
-from models import CellCoord, DependencyGraph, DependencyEdgeDTO, EdgeType
+from ks_xlsx_parser.formula.formula_parser import FormulaParser
+from ks_xlsx_parser.models import CellCoord, DependencyEdgeDTO, DependencyGraph, EdgeType
 
 
 class TestFormulaParser:
diff --git a/tests/test_llm_artifacts.py b/tests/test_llm_artifacts.py
index c3247eb..87f6335 100644
--- a/tests/test_llm_artifacts.py
+++ b/tests/test_llm_artifacts.py
@@ -5,16 +5,15 @@
 and reading-order linearization using programmatic fixtures.
 """
 
-import pytest
 
-from analysis import (
+from ks_xlsx_parser.analysis import (
     EntityIndexBuilder,
     KpiCatalogBuilder,
     ReadingOrderLinearizer,
     SheetSummaryAnalyzer,
 )
-from models.common import SheetPurpose
-from parsers import WorkbookParser
+from ks_xlsx_parser.models.common import SheetPurpose
+from ks_xlsx_parser.parsers import WorkbookParser
 
 
 class TestSheetSummaryAnalyzer:
diff --git a/tests/test_models.py b/tests/test_models.py
index bddec00..5393eea 100644
--- a/tests/test_models.py
+++ b/tests/test_models.py
@@ -4,12 +4,12 @@
 Covers CellCoord, CellRange, hashing, and serialization.
 """
 
-from models import (
+from ks_xlsx_parser.models import (
     CellCoord,
     CellRange,
-    compute_hash,
     col_letter_to_number,
     col_number_to_letter,
+    compute_hash,
 )
 
 
diff --git a/tests/test_multi_table_layout.py b/tests/test_multi_table_layout.py
index 7681a52..cca2057 100644
--- a/tests/test_multi_table_layout.py
+++ b/tests/test_multi_table_layout.py
@@ -8,11 +8,10 @@
 
 import pytest
 
-from chunking.segmenter import LayoutSegmenter
-from models import BlockType
-from parsers import WorkbookParser
-from pipeline import parse_workbook
-
+from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
+from ks_xlsx_parser.models import BlockType
+from ks_xlsx_parser.parsers import WorkbookParser
+from ks_xlsx_parser.pipeline import parse_workbook
 
 # ---------------------------------------------------------------------------
 # Helpers
diff --git a/tests/test_parsers.py b/tests/test_parsers.py
index 850a0bb..fe1e911 100644
--- a/tests/test_parsers.py
+++ b/tests/test_parsers.py
@@ -6,9 +6,8 @@
 tables, comments, data validations, and sheet properties.
 """
 
-import pytest
 
-from parsers import WorkbookParser
+from ks_xlsx_parser.parsers import WorkbookParser
 
 
 class TestSimpleWorkbook:
diff --git a/tests/test_pipeline.py b/tests/test_pipeline.py
index 534d125..7f1ab0e 100644
--- a/tests/test_pipeline.py
+++ b/tests/test_pipeline.py
@@ -7,9 +7,7 @@
 
 import json
 
-import pytest
-
-from pipeline import parse_workbook
+from ks_xlsx_parser.pipeline import parse_workbook
 
 
 class TestEndToEndPipeline:
diff --git a/tests/test_rendering.py b/tests/test_rendering.py
index e24072a..3a757ec 100644
--- a/tests/test_rendering.py
+++ b/tests/test_rendering.py
@@ -5,12 +5,11 @@
 headers, and coordinate annotations.
 """
 
-import pytest
 
-from chunking.segmenter import LayoutSegmenter
-from parsers import WorkbookParser
-from rendering.html_renderer import HtmlRenderer
-from rendering.text_renderer import TextRenderer
+from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
+from ks_xlsx_parser.parsers import WorkbookParser
+from ks_xlsx_parser.rendering.html_renderer import HtmlRenderer
+from ks_xlsx_parser.rendering.text_renderer import TextRenderer
 
 
 class TestHtmlRendering:
@@ -112,11 +111,10 @@ def test_numeric_cells_render_raw_not_display_formatted(self):
         sci-notation fallback (``1.272000e+03``) once the ``[=]`` formula
         marker pushed the rendered string past col_width — this test
         guards against that regression."""
-        from models.sheet import SheetDTO
-        from models.cell import CellDTO
-        from models.common import CellCoord, CellRange
-        from models.block import BlockDTO
-        from models.common import BlockType
+        from ks_xlsx_parser.models.block import BlockDTO
+        from ks_xlsx_parser.models.cell import CellDTO
+        from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+        from ks_xlsx_parser.models.sheet import SheetDTO
 
         coord = CellCoord(row=1, col=1)
         cell = CellDTO(
diff --git a/tests/test_segmentation.py b/tests/test_segmentation.py
index 8cb3cfc..45f13b0 100644
--- a/tests/test_segmentation.py
+++ b/tests/test_segmentation.py
@@ -5,11 +5,10 @@
 assumption blocks, result blocks, and text headers.
 """
 
-import pytest
 
-from chunking.segmenter import LayoutSegmenter
-from models import BlockType
-from parsers import WorkbookParser
+from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
+from ks_xlsx_parser.models import BlockType
+from ks_xlsx_parser.parsers import WorkbookParser
 
 
 class TestSegmentation:
diff --git a/tests/test_stage_verification.py b/tests/test_stage_verification.py
index 574be13..75dc8bc 100644
--- a/tests/test_stage_verification.py
+++ b/tests/test_stage_verification.py
@@ -8,9 +8,7 @@
 
 import json
 
-import pytest
-
-from verification import (
+from ks_xlsx_parser.verification import (
     ExcellentStage,
     ImplementationStatus,
     StageResult,
@@ -18,7 +16,6 @@
     VerificationReport,
 )
 
-
 # ---------------------------------------------------------------------------
 # Model tests
 # ---------------------------------------------------------------------------
diff --git a/tests/test_structural_invariants.py b/tests/test_structural_invariants.py
index 612d11a..bbe1460 100644
--- a/tests/test_structural_invariants.py
+++ b/tests/test_structural_invariants.py
@@ -1,5 +1,5 @@
 """
-Structural invariant tests for the xlsx_parser.
+Structural invariant tests for the ks_xlsx_parser.
 
 Tests properties that must always hold for any valid parse output,
 regardless of the input file: merge structure, used range bounds,
@@ -13,11 +13,8 @@
 
 import pytest
 
-from pipeline import parse_workbook
-from models.common import EdgeType
-
-from tests.helpers.invariant_checker import check_invariants
-
+from ks_xlsx_parser.models.common import EdgeType
+from ks_xlsx_parser.pipeline import parse_workbook
 
 # ---------------------------------------------------------------------------
 # Invariant tests on programmatic fixtures

From 43e2109b249c83c510d5aa355771e91a582b3888 Mon Sep 17 00:00:00 2001
From: Arnav Goel <arnavgoel2@gmail.com>
Date: Tue, 19 May 2026 17:39:16 -0400
Subject: [PATCH 2/4] docs(bench): add benchmark-local-setup.md walkthrough
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Verified locally on a real 60-instance SpreadsheetBench sample:
  recall_text@1 = 0.550, recall_text@5 = 0.633, geometric@5 = 0.283
  failure_buckets = {answer_absent_from_chunks: 11, ...all others 0}

Producer side of --emit-failures runs end-to-end on the real corpus.
The doc captures the exact loop (install → corpus → eval → triage →
history) plus Docker parity and troubleshooting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/benchmark-local-setup.md | 235 ++++++++++++++++++++++++++++++++++
 1 file changed, 235 insertions(+)
 create mode 100644 docs/benchmark-local-setup.md

diff --git a/docs/benchmark-local-setup.md b/docs/benchmark-local-setup.md
new file mode 100644
index 0000000..ab39803
--- /dev/null
+++ b/docs/benchmark-local-setup.md
@@ -0,0 +1,235 @@
+# Running the retrieval benchmark locally
+
+This is the loop you'll use whenever you want to know "is my parser
+change actually moving recall?" — the same pipeline CI runs, just on
+your laptop. Goal: text recall@5 > 0.90 (currently ~0.70).
+
+The full SpreadsheetBench corpus is 912 instances and takes ~30–45 min
+on a Mac. For an iteration loop you'll mostly use `--sample 60` (≈ 3 min
+after the first embedding-model load).
+
+## TL;DR
+
+```bash
+# One-time
+make install-dev              # installs the parser + dev deps
+pip install -e ".[bench]"     # sentence-transformers + numpy (~500 MB)
+make corpus-download          # downloads SpreadsheetBench → data/corpora/
+                              # (91 MB tarball, 2,726 .xlsx after extract)
+
+# Each time you want to score
+python scripts/eval_retrieval.py \
+  --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \
+  --parsers ks \
+  --sample 60 \
+  --emit-failures
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
+python scripts/append_bench_history.py
+```
+
+Output: `tests/benchmarks/reports/retrieval/<UTC stamp>/` with
+`results.ndjson`, `failures.ndjson`, `summary.json`, `summary.md`.
+
+## Step-by-step
+
+### 1. Install bench deps
+
+The benchmark uses `sentence-transformers` (≈ 500 MB with torch) for
+embeddings. They're a separate optional group so the parser package
+itself stays lean:
+
+```bash
+pip install -e ".[bench]"
+```
+
+First run also downloads the embedding model (`BAAI/bge-small-en-v1.5`,
+≈ 130 MB) into `~/.cache/huggingface/`.
+
+### 2. Download the corpus
+
+```bash
+make corpus-download
+```
+
+This is the same `scripts/download_corpora.sh` CI uses. It pulls
+SpreadsheetBench v0.1 plus a few legacy XLSX samples; only
+SpreadsheetBench matters for retrieval scoring.
+
+Layout you should end up with:
+
+```
+data/corpora/spreadsheetbench/
+  all_data_912_v0.1/
+    dataset.json                       # 912 (question, answer_position) tuples
+    spreadsheet/
+      <instance-id>/
+        1_<instance-id>_input.xlsx     # test case 1 input
+        1_<instance-id>_answer.xlsx    # test case 1 ground-truth output
+        2_..., 3_...                   # additional test cases per instance
+```
+
+`data/` is gitignored — never commit corpus files.
+
+### 3. Run the benchmark
+
+A typical iteration cycle uses a sample for speed:
+
+```bash
+python scripts/eval_retrieval.py \
+  --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \
+  --parsers ks \
+  --sample 60 \
+  --emit-failures
+```
+
+Flags worth knowing:
+
+| Flag                  | What it does                                                |
+|-----------------------|-------------------------------------------------------------|
+| `--parsers ks,docling`| Score one or both parsers. Docling is heavy; skip unless comparing. |
+| `--sample N`          | Random N-instance subset (seeded). Omit for the full 912.   |
+| `--seed 1337`         | Random seed for `--sample`. Stays stable across runs.       |
+| `--emit-failures`     | Also write `failures.ndjson` with top-8 chunks per miss.    |
+| `--test-case 1`       | Which of the (usually 3) test cases per instance to score.  |
+| `--per-parser-timeout`| Wall-clock seconds before a hung parse is killed. Default 60. |
+
+For a full run, drop `--sample` and add `--per-parser-timeout 120`:
+
+```bash
+python scripts/eval_retrieval.py \
+  --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \
+  --parsers ks \
+  --emit-failures \
+  --per-parser-timeout 120
+```
+
+`make bench-retrieval` is the same thing with the canonical defaults.
+
+### 4. Read the triage report
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
+```
+
+This auto-finds the most recent run and prints:
+
+* **Bucket histogram** ranked by count. The top bucket is the
+  highest-leverage thing to fix next.
+* **3 example failures** per bucket showing the question, the
+  ground-truth answer cell + values, and the top-8 chunks the parser
+  produced (with a ✓ next to chunks that contain the answer).
+
+Drill into one bucket:
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval \
+  --bucket answer_absent_from_chunks --examples 10
+```
+
+Five buckets and what they mean:
+
+| Bucket | Root cause | Fix lives in |
+|---|---|---|
+| `answer_absent_from_chunks` | Answer value in NO chunk. Cell dropped or rendered as formula. | `parsers/cell_parser.py`, `rendering/text_renderer.py` |
+| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. Chunk too big/heterogeneous. | `chunking/chunker.py` |
+| `wrong_sheet` | Answer sheet never chunked. | `parsers/workbook_parser.py` |
+| `geometric_no_overlap` | Text matches but the chunk's A1 range doesn't overlap GT. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` |
+| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. |
+
+See `docs/recall-investigation.md` for the named hypotheses behind each
+bucket and `.claude/skills/recall-failure-triage/SKILL.md` for the
+agent-driven loop.
+
+### 5. Append to history.jsonl
+
+```bash
+python scripts/append_bench_history.py
+```
+
+Appends one row per benchmark run to
+`tests/benchmarks/reports/history.jsonl` tagged with the current git
+commit, and prints the row-over-row delta on the headline metrics:
+
+```
+appended to tests/benchmarks/reports/history.jsonl:
+  commit 421783f  recall_text@5=0.704  recall_text@1=0.580
+  recall_text@5: 0.6800 → 0.7040  ▲ +0.0240
+```
+
+That's how "is recall improving?" gets answered. Goal: `recall_text@5 > 0.90`.
+
+`make bench-track` chains eval + history-append + triage in one go.
+
+## Docker path (matches CI exactly)
+
+When you want to make sure local results aren't drifting from CI:
+
+```bash
+docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+
+# Quick sanity (60 instances, ~3 min after image load):
+docker run --rm \
+  -e BENCH_SAMPLE=60 \
+  -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+  -v "$PWD/data:/app/data" \
+  ks-xlsx-parser-bench
+
+# Full corpus:
+docker run --rm \
+  -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+  -v "$PWD/data:/app/data" \
+  ks-xlsx-parser-bench
+```
+
+The image pre-warms the embedding model at build time so the first
+`docker run` doesn't pay the 130 MB download.
+
+Environment knobs (also work in CI dispatch):
+
+| Env var          | Default | What it does                              |
+|------------------|---------|-------------------------------------------|
+| `BENCH_SAMPLE`   | `0`     | Sample N instances (0 = full 912)         |
+| `BENCH_PARSERS`  | `ks`    | Comma list (e.g. `ks,docling`)            |
+| `BENCH_TIMEOUT`  | `120`   | Per-file parse timeout in seconds         |
+
+## Adding a new failure bucket
+
+If you find a recall failure mode that doesn't fit any of the existing
+six buckets, add it instead of stuffing it into `answer_absent_from_chunks`:
+
+1. Append the new bucket name to `FAILURE_BUCKETS` in
+   `scripts/eval_retrieval.py`.
+2. Update `classify_text_failure` so it can return the new name. Keep
+   the predicate cheap — it runs once per scored instance.
+3. Add the bucket + a one-line root cause + fix location to the table
+   in `docs/recall-investigation.md` and the SKILL file.
+4. Re-run `make bench-track`; confirm the histogram shows the new
+   bucket and counts make sense.
+
+## Troubleshooting
+
+* `ModuleNotFoundError: No module named 'sentence_transformers'`
+  — you skipped `pip install -e ".[bench]"`.
+* `dataset.json not found in ...` — your `--corpus` is pointing at
+  `data/corpora/spreadsheetbench`, not the nested
+  `data/corpora/spreadsheetbench/all_data_912_v0.1`. The benchmark
+  expects the leaf directory that contains `dataset.json`.
+* `FileNotFoundError: 1_<id>_input.xlsx` — the corpus tarball didn't
+  fully extract. Delete `data/corpora/spreadsheetbench/` and re-run
+  `make corpus-download`.
+* `recall_text@5 = 0.0` on a sample of 5 — small samples have huge
+  variance because the benchmark seeds. Bump to `--sample 60` minimum
+  before trusting the number; use `--sample 0` (full corpus) for a
+  decision-grade comparison.
+* MPS / CUDA torch errors on first sentence-transformers import —
+  re-install with `pip install --upgrade torch torchvision`. The
+  benchmark runs fine on CPU.
+
+## See also
+
+* `scripts/eval_retrieval.py` — the benchmark itself.
+* `scripts/triage_recall.py` — the bucket histogram + exemplar dump.
+* `scripts/append_bench_history.py` — history.jsonl row writer.
+* `Dockerfile.bench` — reproducible benchmark image.
+* `docs/recall-investigation.md` — diagnosis framework & hypotheses.
+* `.claude/skills/recall-failure-triage/SKILL.md` — agent guide.

From 0cd1175fb96833fcde2ca6785b64fe7ab02a7d5d Mon Sep 17 00:00:00 2001
From: Arnav Goel <arnavgoel2@gmail.com>
Date: Tue, 19 May 2026 17:58:52 -0400
Subject: [PATCH 3/4] =?UTF-8?q?recall(planning):=20break=20recall=E2=86=92?=
 =?UTF-8?q?0.90=20into=20parallel=20TODOs=20+=20enrichment=20script?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ran the 200-instance seed=1337 sample with --emit-failures locally,
enriched the failure rows with per-instance diagnostics (input.xlsx vs
chunk bbox vs answer.xlsx vs sheet enumeration), and clustered the
actionable failures by observable pattern.

Key empirical finding: 127 of 200 instances are
`instruction_requires_execution` — the benchmark asks the system to
COMPUTE and WRITE the answer; a retrieval parser cannot satisfy them.
The headline recall_text@5 = 0.635 includes those instances in the
denominator, dragging the metric down by ~25 pp. Real in-scope recall
is **0.59**, and getting to 0.90 means closing 30 actionable failures.

Adds:
- docs/planning/recall-90/README.md  — overview, in-scope vs out-of-scope
  framing, status table, the iteration loop, parallelism gates by file
  scope.
- docs/planning/recall-90/00..05.md   — one cluster per file. Each TODO
  is self-contained with: cluster pattern, repro instance IDs, diagnostic
  signature (jq query), file scope, acceptance criteria stated against
  the 200-sample seed=1337 rerun, a failing-test sketch. Deliberately
  omits proposed fixes and effort estimates (the picking agent does
  that work).
- scripts/enrich_failures.py — post-hoc enricher. Reads results.ndjson,
  re-parses each failed instance with openpyxl + ks-xlsx-parser, emits
  per-row diagnostic columns (gt_sheet, gt_cell_raw, gt_cell_formula,
  cached value, chunked sheets vs workbook sheets, chunk bbox vs GT
  range, ten heuristic flags). Avoids re-running the slow benchmark to
  add diagnostic columns.
- .claude/skills/recall-failure-triage/SKILL.md — adds the in-scope
  filter step (Step 0).

Cluster summary (200-sample seed=1337):
  00  benchmark spec parser bugs              8 instances
  01  array-formula rendering                 2 instances
  02  chunk range vs text mismatch           7 instances
  03  cell-drop / uncached formula          ≤10 instances (some unscorable)
  04  single-chunk-per-sheet dilution        2 instances
  05  out-of-scope execution instances    127 instances (scoring fix)

Parallelism: each TODO scoped to a different file slice; 00 and 05 both
touch eval_retrieval.py so they're flagged for serialization. Everything
else can land independently.

Updates docs/recall-investigation.md with the post-200-sample reality:
H1 (chunk-size dilution) was largely refuted; H3 (range bookkeeping)
landed as TODO 02; H2 (formula rendering) extended into TODO 03 covering
both cached-value gaps and array formulas.

1041 tests still passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .claude/skills/recall-failure-triage/SKILL.md | 154 ++++++++
 .../00-benchmark-spec-parser-bugs.md          | 119 ++++++
 .../recall-90/01-array-formula-rendering.md   | 130 +++++++
 .../02-chunk-range-vs-text-mismatch.md        | 150 ++++++++
 .../03-cell-drop-or-uncached-formula.md       | 165 +++++++++
 .../04-single-chunk-multi-region-sheet.md     | 138 +++++++
 .../05-out-of-scope-execution-instances.md    | 136 +++++++
 docs/planning/recall-90/README.md             | 137 +++++++
 docs/recall-investigation.md                  |  33 +-
 scripts/enrich_failures.py                    | 348 ++++++++++++++++++
 10 files changed, 1509 insertions(+), 1 deletion(-)
 create mode 100644 .claude/skills/recall-failure-triage/SKILL.md
 create mode 100644 docs/planning/recall-90/00-benchmark-spec-parser-bugs.md
 create mode 100644 docs/planning/recall-90/01-array-formula-rendering.md
 create mode 100644 docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md
 create mode 100644 docs/planning/recall-90/03-cell-drop-or-uncached-formula.md
 create mode 100644 docs/planning/recall-90/04-single-chunk-multi-region-sheet.md
 create mode 100644 docs/planning/recall-90/05-out-of-scope-execution-instances.md
 create mode 100644 docs/planning/recall-90/README.md
 create mode 100755 scripts/enrich_failures.py

diff --git a/.claude/skills/recall-failure-triage/SKILL.md b/.claude/skills/recall-failure-triage/SKILL.md
new file mode 100644
index 0000000..a9a9263
--- /dev/null
+++ b/.claude/skills/recall-failure-triage/SKILL.md
@@ -0,0 +1,154 @@
+---
+name: recall-failure-triage
+description: Diagnose why ks-xlsx-parser's retrieval recall is below target by reading the failure-bucket histogram and exemplar failures, then propose ranked fixes. Use when investigating low recall@k on SpreadsheetBench, deciding what to fix next, or reviewing a benchmark run. Sister skill to excel-extraction-pipeline-improver.
+---
+
+# Recall Failure Triage — turn `recall@5 = X` into a ranked worklist
+
+A single recall number tells you nothing about *what* to fix. Five
+distinct failure modes produce identical recall drops, and the fix is
+completely different per mode. This skill is the bridge from "recall is
+low" to "open `parsers/cell_parser.py` line N, here is the bug".
+
+Pair with `excel-extraction-pipeline-improver` — that skill implements
+fixes; this one decides what to fix.
+
+## When this skill fires
+
+* User says "recall is low / dropped / not improving"
+* New benchmark run produced a `failures.ndjson` or `summary.json`
+* PR-time benchmark sample regressed on the CI step summary
+* You're picking the next parser/chunking PR and need to choose what's
+  highest leverage
+
+## Inputs you should expect
+
+| Input | Where it comes from |
+|---|---|
+| `summary.json` + `summary.md` | latest `tests/benchmarks/reports/retrieval/<stamp>/` |
+| `failures.ndjson` | same dir, **only present if** the run used `--emit-failures` |
+| `history.jsonl` | `tests/benchmarks/reports/history.jsonl` — recall over commits |
+| current branch + recent commits | to know what changed since the last run |
+
+## The decision procedure
+
+### 0. Apply the in-scope filter — most "misses" are unfixable
+
+Before counting buckets, run `scripts/enrich_failures.py` to mark
+instances flagged `instruction_requires_execution` — those where the
+benchmark's `answer_position` is empty in `input.xlsx` because the
+question literally asks the system to *write* the answer. ~63% of
+the SpreadsheetBench corpus is this class on the 200-sample. They are
+NOT parser bugs and CANNOT be fixed by parser work. Exclude them.
+
+```bash
+python scripts/enrich_failures.py tests/benchmarks/reports/retrieval
+jq -c 'select((.flags | contains(["instruction_requires_execution"])) | not)' \
+    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
+```
+
+### 1. Read the bucket histogram FIRST
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
+```
+
+The output ranks buckets by count. **Spend your time on the largest
+bucket only.** The next-largest is the next PR.
+
+| Bucket | Root cause | Where to fix |
+|---|---|---|
+| `answer_absent_from_chunks` | The answer value never made it into ANY chunk. The cell was dropped or rendered as a formula expression instead of its computed value. | `parsers/cell_parser.py::_extract_cell_value` and `rendering/text_renderer.py::_cell_render_value`. Check `data_only` plumbing in `parsers/workbook_parser.py`. |
+| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. The chunk is too big/heterogeneous — embedding is diluted. | `chunking/chunker.py`. There is no token cap today; large blocks are emitted whole. Cap at ~512 tokens and row-split tall tables. |
+| `wrong_sheet` | Answer sheet was never chunked. | `parsers/workbook_parser.py` sheet loop. Check hidden sheets, very-hidden sheets, and `wb.sheetnames` vs `wb.worksheets`. |
+| `geometric_no_overlap` | Answer text matches but the chunk's A1 range doesn't overlap ground truth. Range drifts during merge/split. | `annotation/block_splitter.py` + `analysis/pattern_splitter.py`. Add invariant: `block.cell_range` ⊆ bbox of cells that actually rendered text. |
+| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. Reproduce with `parse_workbook(path)`. |
+
+### 2. Read 5 example failures from the top bucket
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval \
+  --bucket <largest> --examples 5
+```
+
+For each example, the script prints:
+* the natural-language question
+* `answer_position` (the ground-truth cell)
+* the ground-truth string values that should appear in a chunk
+* the top-8 ranked chunks with sheet, range, and whether each contains
+  the answer
+
+**Pattern-match.** What do the 5 examples have in common? E.g.
+* All answer cells are formulas → H2 confirmed: render the cached value
+* All chunks span the entire sheet → H1 confirmed: cap chunk size
+* All answer sheets are hidden → wrong_sheet from hidden-sheet skip
+
+### 3. State the hypothesis before changing code
+
+In your reply, write the hypothesis as one sentence:
+> "Bucket X dominates because [observable common pattern in examples].
+> Expected fix: [specific file + change]. Expected lift on recall@5:
+> [number]."
+
+The expected lift = (bucket count) ÷ (scoreable instances). If a fix
+is supposed to halve the biggest bucket but recall barely moves on the
+next run, the hypothesis was wrong — go back to step 1, don't pile on.
+
+### 4. Wire the fix into the existing pipeline
+
+Pass the hypothesis to the `excel-extraction-pipeline-improver` skill
+(or implement directly) with TDD: write a failing crossval/invariant
+test that captures the bug from one example, fix, run `make test`, then
+re-run `make bench-track` to confirm the bucket shrank.
+
+### 5. Verify on the history
+
+```bash
+tail -5 tests/benchmarks/reports/history.jsonl
+```
+
+The script prints a row-over-row delta. Expected behaviour:
+* The targeted bucket count drops
+* `recall_text@5` rises by approximately `target_bucket_drop /
+  scoreable_instances`
+* `recall_text@1` improves proportionally OR more (better chunks rank
+  higher too)
+* Other buckets shouldn't grow — if they do, the fix had a side effect
+
+## Anti-patterns
+
+* **Don't chase the headline number.** "Recall went up 1 pp" without a
+  bucket explanation is suspicious — it might be benchmark noise.
+* **Don't fix two buckets at once.** You won't know which fix worked.
+* **Don't tune chunk size as the first move.** It only helps
+  `present_but_ranked_low`. If `answer_absent_from_chunks` is bigger,
+  chunk tuning is wasted effort.
+* **Don't trust geometric and text recall independently.** A docling-style
+  parser with zero geometric overlap can still have high text recall
+  (because it serializes everything to plain text). That's not the same
+  thing as good — citation overlays need the geometric metric. Always
+  report both.
+
+## Quick reference — running it from scratch
+
+```bash
+make corpus-download                                    # one-time, large
+make bench-track                                        # runs + triages
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5
+cat tests/benchmarks/reports/history.jsonl | tail -10   # commit-over-commit
+```
+
+CI runs a 60-instance sample on every PR (`.github/workflows/benchmark.yml`)
+and posts the bucket histogram to the GitHub job summary. Weekly schedule
+runs the full 912-instance corpus.
+
+## Hand-off contract
+
+When you finish a triage round, your final message MUST include:
+1. The bucket histogram (counts + percentages).
+2. The dominant bucket + one-sentence root cause hypothesis.
+3. The 1–2 file paths where the fix lives.
+4. The expected lift on `recall_text@5` if the fix lands.
+5. A pointer to the example instance IDs you read.
+
+Anything less and the next agent will have to re-do the analysis.
diff --git a/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md b/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md
new file mode 100644
index 0000000..8d33b7a
--- /dev/null
+++ b/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md
@@ -0,0 +1,119 @@
+# 00 · Benchmark harness fails to parse malformed `answer_position` strings
+
+**Status:** 🆓 free to claim
+**Slice:** A (benchmark harness — `scripts/eval_retrieval.py`)
+**Independent of:** all other TODOs in this folder
+
+## What it looks like
+
+Eight instances in the 200-sample (seed=1337) score as recall@5 misses
+even though the parser produced perfectly reasonable chunks. The
+benchmark harness's `parse_position_spec()` returns no regions because
+the `answer_position` string in `dataset.json` doesn't match
+`SHEET_RANGE_RE`. The score code then sees `data_regions=0`, marks the
+instance as "no scorable target", and the recall denominator counts it
+as a miss. Net effect: the parser is blamed for benchmark-spec parsing
+bugs.
+
+Concrete malformed strings observed:
+
+| Instance | `answer_position` string | What's wrong |
+|---|---|---|
+| `50442`  | `RESULTS 1'!G17`                                  | unmatched closing quote, opening quote missing |
+| `49490`  | `Sheet1'!H3:H58`                                  | same |
+| `49036`  | `Dashboard'!B8`                                   | same |
+| `55427`  | `Compiled and located schools da'!B2:B1461`       | same + long sheet name |
+| `48975`  | `Output'!B11:B17`                                 | same |
+| `37456`  | `G12：J15`                                        | fullwidth Chinese colon `：` instead of `:` |
+| `184-6`  | `'RAWDATA!'A1:P6,'OUTPUT!'A1:P6'`                 | quote between sheet and `!`, comma-joined |
+| `164-22` | `'YEAR1!'A1:G1478,'YEAR2!'A1:G1480'`              | same shape as 184-6 |
+
+(`374-9` may also belong here — investigate.)
+
+## Diagnostic signature
+
+In `enriched_failures.ndjson`, members of this cluster have:
+
+* `gt_sheet == null` (spec parser couldn't extract the sheet name), OR
+* `gt_sheet` wrapped in apostrophes (`"'Sheet1'"`), AND
+* the workbook *does* contain a sheet with the bare name.
+
+```bash
+python scripts/enrich_failures.py tests/benchmarks/reports/retrieval
+jq -c 'select(.gt_sheet == null or (.gt_sheet | startswith("'")))' \
+    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
+```
+
+## File scope
+
+You may touch:
+
+* `scripts/eval_retrieval.py` — `SHEET_RANGE_RE`, `parse_position_spec`,
+  `_normalize_value_for_match`. This is where the bugs live.
+* `scripts/enrich_failures.py` — must stay consistent with the harness;
+  re-run after editing.
+* Add new tests under `tests/test_eval_retrieval.py` (or similar — file
+  doesn't exist yet; create it).
+
+Do **NOT** touch any `src/ks_xlsx_parser/*` code for this cluster. If
+the urge to "make the parser tolerate these patterns too" arises,
+that's a separate TODO and a category error here — these are dataset
+typos, not parser inputs.
+
+## Acceptance criteria
+
+On `python scripts/eval_retrieval.py --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 --parsers ks --sample 200 --seed 1337 --emit-failures`:
+
+1. All 8 instances listed above stop appearing as recall@5 misses
+   (text or geometric) on the diagnostic ndjson.
+2. `geometric@5` rises by ≥ 4 pp (each fixed instance flips from miss
+   to hit since the chunks already overlap the correct region).
+3. No previously-passing instance becomes a miss. Diff the
+   `results.ndjson` before and after — `pass → fail` count must be 0.
+4. `python scripts/enrich_failures.py` shows zero rows with
+   `gt_sheet == null` from the eight listed instances.
+
+## Failing test sketch
+
+```python
+# tests/test_eval_retrieval_spec_parser.py
+import pytest
+from scripts.eval_retrieval import parse_position_spec
+
+@pytest.mark.parametrize("spec, expected_sheet", [
+    ("Dashboard'!B8", "Dashboard"),
+    ("Sheet1'!H3:H58", "Sheet1"),
+    ("'RAWDATA!'A1:P6", "RAWDATA"),
+    ("G12：J15", None),  # this one keeps default_sheet; sheet unaltered
+])
+def test_unmatched_quotes_resolve_sheet(spec, expected_sheet):
+    regions = parse_position_spec(spec, default_sheet="DEFAULT")
+    assert regions, f"no regions for {spec!r}"
+    if expected_sheet is not None:
+        assert regions[0][0] == expected_sheet
+
+def test_fullwidth_colon_in_range():
+    # Excel-China dataset entries occasionally use the fullwidth colon.
+    regions = parse_position_spec("G12：J15", default_sheet="S")
+    assert regions and regions[0][1] == (12, 7, 15, 10)
+```
+
+These should fail on `main`. They pass when the fix lands.
+
+## Pitfalls
+
+* `'YEAR1!'A1:G1478,'YEAR2!'A1:G1480'` is multi-region. Make sure the
+  comma-split path still returns BOTH regions after the fix.
+* `_normalize_value_for_match` is not involved here — don't refactor it.
+* The current regex `SHEET_RANGE_RE` uses a backreference for matching
+  quotes; tightening the closing quote to "optional, but only if opening
+  present" is fragile. Consider a two-pass approach: strip stray
+  apostrophes, then re-attempt the strict regex.
+
+## Repro fixtures
+
+The eight instances live under
+`data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/<id>/`.
+Quote one as a unit-test data point — do NOT vendor the corpus into
+the repo. If a snapshot of the malformed strings is wanted, hand-copy
+them into a Python fixture (they're ≤ 100 chars each).
diff --git a/docs/planning/recall-90/01-array-formula-rendering.md b/docs/planning/recall-90/01-array-formula-rendering.md
new file mode 100644
index 0000000..a9edd72
--- /dev/null
+++ b/docs/planning/recall-90/01-array-formula-rendering.md
@@ -0,0 +1,130 @@
+# 01 · Array-formula cells render as `<openpyxl.worksheet.formula.ArrayFormula object>`
+
+**Status:** 🆓 free to claim
+**Slice:** B + D (`parsers/cell_parser.py`, `rendering/text_renderer.py`)
+**Independent of:** 00, 02, 04 — partially overlapping with 03 on
+`text_renderer.py`; coordinate via the README table.
+
+## What it looks like
+
+Two instances (`43026`, `59969`) hit this exact pattern. The
+ground-truth answer cell contains an Excel **array formula** (an
+`{= ...}` formula entered with Ctrl-Shift-Enter that returns multiple
+cells). openpyxl returns these as an `ArrayFormula` Python object, NOT
+a string and NOT a number. Our cell extractor stuffs the repr of that
+object into the cell's value, so the rendered chunk text contains
+literal:
+
+```
+<openpyxl.worksheet.formula.ArrayFormula object at 0x10c0767b0>
+```
+
+This obviously kills both text-match (the answer values are nowhere in
+the chunk) and any downstream LLM consumption.
+
+Instance `43026` (`summary!D10`):
+
+```
+gt_cell_raw  = '<openpyxl.worksheet.formula.ArrayFormula object at 0x10c0767b0>'
+gt_cell_formula = None    ← our heuristic missed it because raw_value isn't a str starting with '='
+gt_cell_data_only = (cached value, if any)
+```
+
+## Diagnostic signature
+
+```bash
+jq -c 'select(.gt_cell_raw | tostring | contains("ArrayFormula"))' \
+    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
+```
+
+Likely a much larger cluster at full-corpus scale — most "tabular
+report" workbooks use array formulas. Two on the 200-sample is the floor.
+
+## File scope
+
+You may touch:
+
+* `src/ks_xlsx_parser/parsers/cell_parser.py` — wherever the raw value
+  is extracted from `openpyxl.cell.Cell.value`. Detect `ArrayFormula`,
+  pull the formula text out of it (`obj.text` on openpyxl) and treat it
+  as a formula cell.
+* `src/ks_xlsx_parser/rendering/text_renderer.py::_cell_render_value`
+  — make sure array-formula cells render their *cached value* (from the
+  `data_only` workbook pass) when one exists; fall back to the formula
+  expression as a string only if no cached value.
+* `src/ks_xlsx_parser/models/cell.py` — add an `is_array_formula: bool`
+  field if useful for downstream consumers.
+
+Do NOT add a new evaluator for array formulas — out of scope (and
+covered indirectly by the cached-value path).
+
+## Acceptance criteria
+
+1. Build a tiny array-formula fixture inside `tests/conftest.py` (the
+   existing programmatic-fixture pattern). Use `openpyxl`'s
+   `ArrayFormula` constructor; populate the `data_only` workbook with a
+   plausible cached value.
+2. Add a test that asserts the chunk's `render_text` for that fixture
+   contains the cached value AND does **not** contain the substring
+   `ArrayFormula object`.
+3. On the 200-sample seed=1337 rerun, instances `43026` and `59969`
+   move out of the failure set on the text-match metric IF the answer
+   values exist as cached data in the input (verify with openpyxl
+   first; if they don't, the instance is `instruction_requires_execution`
+   and you can't move it — note it in the PR).
+4. No previously-passing crossval test regresses (`make test`).
+
+## Failing test sketch
+
+```python
+# tests/test_array_formula_rendering.py
+import openpyxl
+from openpyxl.worksheet.formula import ArrayFormula
+from ks_xlsx_parser.pipeline import parse_workbook
+
+def test_array_formula_renders_cached_value(tmp_path):
+    p = tmp_path / "af.xlsx"
+    wb = openpyxl.Workbook()
+    ws = wb.active
+    ws["A1"] = 10
+    ws["A2"] = 20
+    ws["A3"] = 30
+    # Array formula in B1:B3 summing the row; cache a plausible value.
+    af = ArrayFormula("B1:B3", "=A1:A3*2")
+    ws["B1"] = af
+    # openpyxl writes the cached value separately when data_only=True is read;
+    # for fixture purposes, write the file once and patch cached values via
+    # a second openpyxl pass.
+    wb.save(p)
+
+    chunks = parse_workbook(path=str(p)).chunks
+    assert chunks, "no chunks produced"
+    text = chunks[0].render_text
+    assert "ArrayFormula" not in text, "raw ArrayFormula object leaked"
+    # Once the fix lands, expect cached values 20 / 40 / 60 to render.
+```
+
+## Pitfalls
+
+* openpyxl's behaviour for `ArrayFormula` differs between read-only
+  and normal mode. Make sure both paths in `workbook_parser.py` agree
+  on how the cell is surfaced.
+* If the input was saved by LibreOffice or generated programmatically
+  without computing array results, `cached_value` will be `None`. In
+  that case there's no useful retrieval signal — surfacing the formula
+  expression itself is acceptable as a fallback, but it must be a
+  string, not `repr()`.
+* "Spilled" dynamic arrays (Excel 365's new `=A1:A3*2` without
+  Ctrl-Shift-Enter) are a related but distinct case. Note in your PR
+  if you handled both or only the classic array formula.
+
+## Repro
+
+```bash
+python -c "
+from openpyxl import load_workbook
+wb = load_workbook('data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/43026/1_43026_input.xlsx')
+ws = wb['summary']
+print(type(ws['D10'].value), repr(ws['D10'].value))
+"
+```
diff --git a/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md b/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md
new file mode 100644
index 0000000..bd7500b
--- /dev/null
+++ b/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md
@@ -0,0 +1,150 @@
+# 02 · `text_hit_geom_miss` — chunk contains answer text but range doesn't overlap GT
+
+**Status:** 🆓 free to claim
+**Slice:** E (`annotation/block_splitter.py`, `analysis/pattern_splitter.py`, `analysis/light_block_detector.py`)
+**Independent of:** 00, 01, 03, 04
+
+## What it looks like
+
+Seven actionable instances on the 200-sample
+(`61-4`, `353-29`, `382-29`, `80-42`, `334-11`, `495-31`, `462-45`)
+where:
+
+* `rank_of_text_match ≤ 5` — some chunk's `render_text` contains the
+  answer value(s).
+* `rank_of_first_overlap is None` — but no chunk's claimed A1 range
+  overlaps the ground-truth range.
+* `n_chunks_on_gt_sheet ≥ 1` — we DID emit chunks on the correct sheet.
+
+So the parser knows what data exists on the sheet (text is there), but
+the **range bookkeeping** on the chunk it emitted is wrong: the claimed
+`top_left_cell` / `bottom_right_cell` excludes the answer cell even
+though the chunk's text includes it.
+
+This is the **citation-grade killer**. ks-backend's UI will highlight a
+region that doesn't actually contain the answer the LLM cited — looks
+buggy to the user.
+
+Example — instance `61-4`:
+```
+ans = 'output'!A2:G15
+chunks on 'output': 1 chunk
+chunk text contains: the answer string (text_rank=1)
+chunk's reported A1 range: NOT A2:G15
+```
+
+## Hypothesis (deliberately under-specified)
+
+Block merge/split paths widen the *text* (by absorbing adjacent cells)
+faster than they update `cell_range.bottom_right`, OR a pattern-split
+narrows the `cell_range` to exclude the rows the splitter just lifted
+into a child block while the rendered text still contains them. Find
+the actual code path empirically.
+
+## Diagnostic signature
+
+```bash
+jq -c 'select(.bucket_combined == "text_hit_geom_miss"
+              and (.flags | contains(["instruction_requires_execution"]) | not)
+              and .n_chunks_on_gt_sheet > 0)' \
+    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
+```
+
+For each match, the comparison that matters is `gt_range_bbox` vs each
+chunk on the GT sheet's `top_left_cell`/`bottom_right_cell`. Print the
+chunk's `render_text` and confirm it DOES contain the answer values
+that openpyxl reads at `answer_position`.
+
+## File scope
+
+You may touch:
+
+* `src/ks_xlsx_parser/models/block.py` — invariants on `BlockDTO.cell_range`.
+* `src/ks_xlsx_parser/annotation/block_splitter.py`
+* `src/ks_xlsx_parser/analysis/pattern_splitter.py`
+* `src/ks_xlsx_parser/analysis/light_block_detector.py`
+* `src/ks_xlsx_parser/chunking/chunker.py::_block_to_chunk` — at chunk
+  creation time, you can defensively clip `cell_range` to the bounding
+  box of cells that actually contributed text. That's a safety net even
+  if the upstream bug isn't located.
+
+Do NOT touch parsing (`parsers/*`) — the inputs are correct;
+downstream block bookkeeping is wrong.
+
+## Acceptance criteria
+
+1. Add an invariant test (`tests/test_structural_invariants.py`
+   pattern): for every chunk, the cells whose values appear in
+   `render_text` MUST lie within
+   `[chunk.top_left_cell, chunk.bottom_right_cell]`. Test should fail
+   on `main` for at least 3 of the 7 cluster instances.
+2. After the fix, all 7 instances flip from `text_hit_geom_miss` to
+   `both_hit` (`rank_of_first_overlap ≤ 5`).
+3. `geometric@5` on the 200-sample seed=1337 rises by ≥ 3 pp.
+4. `recall_text@5` does NOT drop (tightening ranges shouldn't drop
+   text — if it does, the chunker is also dropping text in step with
+   the range clip, which is a separate bug).
+5. `table_fragmentation_rate` does NOT rise — if you fixed range drift
+   by splitting one chunk into many, fragmentation will spike. That
+   would solve cluster 02 by creating cluster 04 — net zero. Either
+   keep fragmentation flat or coordinate with whoever owns 04.
+
+## Failing test sketch
+
+```python
+# tests/test_chunk_range_invariants.py
+import re
+from openpyxl.utils import column_index_from_string
+from ks_xlsx_parser.pipeline import parse_workbook
+from ks_xlsx_parser.models.common import col_letter_to_number
+
+A1_RE = re.compile(r"^([A-Z]+)(\d+)$")
+
+def _parse_a1(a1):
+    m = A1_RE.match(a1)
+    return (int(m.group(2)), col_letter_to_number(m.group(1)))
+
+def test_chunk_range_covers_all_rendered_cells(tmp_path):
+    # Use one of the cluster instances or a hand-built fixture:
+    # the assertion is "for every chunk, the cell coordinates that appear
+    # in render_text fall inside the claimed range box."
+    chunks = parse_workbook(
+        path="data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/61-4/1_61-4_input.xlsx"
+    ).chunks
+    for c in chunks:
+        if not c.top_left_cell or not c.bottom_right_cell:
+            continue
+        r0, col0 = _parse_a1(c.top_left_cell)
+        r1, col1 = _parse_a1(c.bottom_right_cell)
+        # Pull A1 references that appear in the chunk header (e.g. "[Sheet1!A2:G15]")
+        # and assert each is inside the claimed range. Adapt to whatever
+        # renderer markers your current text actually uses.
+        for ref in re.findall(r"\b([A-Z]+)(\d+)\b", c.render_text or ""):
+            col = col_letter_to_number(ref[0])
+            row = int(ref[1])
+            assert r0 <= row <= r1, f"row {row} outside {c.top_left_cell}:{c.bottom_right_cell}"
+            assert col0 <= col <= col1, f"col {ref[0]} outside {c.top_left_cell}:{c.bottom_right_cell}"
+```
+
+## Pitfalls
+
+* The renderer prints A1 refs in its block header (e.g.
+  `[Sheet1!A2:G15] (table)`) — those are the chunk's CLAIMED range, not
+  a cell reference. The invariant test should look at rendered cell
+  contents (table rows), not at the block header.
+* Some text rendered into a chunk doesn't have an A1 reference (e.g.
+  pure values inside a table grid). The invariant has to use the
+  block's underlying `cells` collection, not regex on text.
+* If the merge widens the chunk's range past adjacent empty rows, you
+  may LOSE geometric overlap on those by clipping. That's fine here —
+  the cluster is text_hit/geom_miss, so clipping to actual content is
+  what's needed.
+
+## Repro
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval \
+    --bucket present_but_ranked_low --examples 0   # to ensure no overlap
+# Then manually read enriched_failures.ndjson and pick instance 61-4
+# or 353-29 — both are pure text_hit_geom_miss cases.
+```
diff --git a/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md b/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md
new file mode 100644
index 0000000..ef7161d
--- /dev/null
+++ b/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md
@@ -0,0 +1,165 @@
+# 03 · Chunk's range covers GT but rendered text lacks the answer values
+
+**Status:** 🆓 free to claim
+**Slice:** B + C + D (`parsers/cell_parser.py`, `parsers/sheet_parser.py`, `rendering/text_renderer.py`)
+**Independent of:** 00, 02, 04 — partial overlap with 01 on `text_renderer.py`; coordinate via README.
+
+## What it looks like
+
+Up to ten actionable instances on the 200-sample (`CF_3712`, `57033`,
+`53-12`, `55794`, `45063`, `36842`, `384-4`, `262-17`, `189-9`, `48799`)
+where:
+
+* `rank_of_first_overlap ≤ 5` — chunk's claimed range overlaps GT.
+* `rank_of_text_match is None` — but the chunk's rendered text does NOT
+  contain the answer values.
+
+The parser knows the block exists (range is right) but it dropped or
+mis-rendered the cells whose values should appear in the chunk text.
+
+**Two root causes nested inside this cluster — confirm by inspection:**
+
+### Sub-cluster 3a: cell-drop within block
+
+The block's `cells` collection is missing the answer cell, or the
+renderer's grid-construction loop skips it (e.g. when the cell sits in
+a row otherwise mostly empty, and a sparsity filter culls it).
+
+Instance `CF_3712`: chunk on `Purchases` covers M3:M5 geometrically;
+the answer header value (`gt_cell_raw = "Product "`) doesn't render.
+
+### Sub-cluster 3b: uncached formula renders as `=A1+B1`, not the value
+
+Cell stores a formula; the workbook was saved without computing it
+(common for LibreOffice / programmatic generation). `data_only` load
+returns `None` for that cell. The renderer falls back to the formula
+source. Retrieval for the *value* (e.g. "1272") fails because the chunk
+text contains "=A1+B1" instead.
+
+Instance `48799` (`Sheet2!Z2` — a 24-deep nested IF/VLOOKUP). Same
+shape: formula text in chunk, value missing.
+
+> ⚠️ **Some of the named instances may be unscorable** — e.g. `189-9`
+> and `53-12` have answer cells that are all `None` in `answer.xlsx`,
+> meaning the benchmark expects the system to write nothing. In that
+> case `answer_cell_values=[]` and `rank_text=None` is automatic,
+> regardless of what the chunk renders. The first thing to do in this
+> TODO is to filter the ten instances through openpyxl to find the
+> truly unfixable ones; report that as part of the PR.
+
+## Diagnostic signature
+
+```bash
+jq -c 'select(.bucket_combined == "text_miss_geom_hit"
+              and (.flags | contains(["instruction_requires_execution"]) | not)
+              and .n_chunks_on_gt_sheet >= 1)' \
+    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
+```
+
+Then run for each candidate:
+
+```python
+from openpyxl import load_workbook
+wb = load_workbook(input_path, data_only=True)
+ws = wb[answer_sheet]
+# read answer.xlsx cells at answer_position — if all None, drop this instance.
+```
+
+Instances whose `answer.xlsx` GT cells are all `None` move to TODO 05
+(out-of-scope; benchmark-scoring issue).
+
+## File scope
+
+You may touch:
+
+* `src/ks_xlsx_parser/parsers/cell_parser.py` — formula vs cached-value
+  selection. Add a code path: if cell has a formula AND no cached
+  value in the `data_only` pass, mark as `formula_uncached` and emit
+  both the formula text AND any reachable proxy (e.g. blank cell with
+  formula reference, instead of literal formula in the grid).
+* `src/ks_xlsx_parser/parsers/sheet_parser.py` — make sure cells with
+  string values that look like formulas (`startswith('=')`) get
+  classified as formula cells.
+* `src/ks_xlsx_parser/rendering/text_renderer.py::_cell_render_value`
+  — for formula cells, prefer cached value; if cached is None, render
+  the formula source ONLY (not `None`, not the formula display).
+
+Do NOT add a formula evaluator. That's a separate, larger effort.
+
+## Acceptance criteria
+
+1. After culling unscorable instances (those with all-None
+   `answer.xlsx` cells), at least **5 of the remaining instances** flip
+   from `text_miss_geom_hit` to `both_hit` (rank_of_text_match ≤ 5).
+2. `recall_text@5` on the 200-sample seed=1337 rises by ≥ 2 pp.
+3. The unscorable instances are documented in the PR description with
+   a one-line "this instance has no answer values in answer.xlsx,
+   moving to TODO 05".
+4. A new test fixture in `tests/conftest.py` (a workbook with a
+   `=SUM(A1:A3)` cell saved WITHOUT cached values) plus a regression
+   test that asserts the chunk render_text contains `=SUM(A1:A3)`
+   verbatim (not the openpyxl repr of the formula object).
+
+## Failing test sketch
+
+```python
+# tests/test_formula_rendering_unfilled.py
+import openpyxl
+from ks_xlsx_parser.pipeline import parse_workbook
+
+def test_uncached_formula_renders_formula_source(tmp_path):
+    p = tmp_path / "uncached.xlsx"
+    wb = openpyxl.Workbook()
+    ws = wb.active
+    ws["A1"] = 10
+    ws["A2"] = 20
+    ws["A3"] = 30
+    ws["B1"] = "=SUM(A1:A3)"
+    wb.save(p)  # NO data_only refresh — B1's cached value remains None.
+
+    chunks = parse_workbook(path=str(p)).chunks
+    assert chunks, "no chunks"
+    text = chunks[0].render_text
+    # Whichever is acceptable: the formula source verbatim, OR the
+    # computed value via the formula engine. NOT a None/empty cell.
+    assert ("=SUM(A1:A3)" in text) or ("60" in text), (
+        "uncached formula cell dropped from render_text:\n" + text
+    )
+```
+
+## Pitfalls
+
+* `_cell_render_value` already has a branch for booleans, dates,
+  integer-valued floats. Don't break that path while adding the formula
+  fallback.
+* If you add a formula-source string to render_text, make sure the
+  retrieval text-match logic in `eval_retrieval.py::_normalize_value_for_match`
+  still works on numeric values that DO have cached results — adding
+  formula sources should AUGMENT, not replace, the cached-value path.
+* Some "cell drop" cases are actually merged-cell handling: a value is
+  on the top-left of a merged region but the renderer surfaces only an
+  empty placeholder for the merged cells. Distinct from this cluster —
+  if you find one, file it as a new TODO.
+* Don't try to evaluate formulas — Excel's calc engine is a swamp and
+  `formula/formula_parser.py` only parses, doesn't compute. The
+  acceptable fallback is "render the formula source so an LLM can read
+  it and reason."
+
+## Repro
+
+```bash
+# Inspect one candidate end-to-end.
+python << 'EOF'
+from openpyxl import load_workbook
+from ks_xlsx_parser.pipeline import parse_workbook
+p = "data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/CF_3712/1_CF_3712_input.xlsx"
+wb = load_workbook(p, data_only=True)
+print("M3:M5 on", wb.sheetnames[0])
+for r in range(3,6):
+    print(" ", wb.active.cell(row=r,column=13).value)
+r = parse_workbook(path=p)
+for c in r.chunks:
+    print(c.sheet_name, c.top_left_cell, c.bottom_right_cell)
+    print((c.render_text or "")[:600])
+EOF
+```
diff --git a/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md b/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md
new file mode 100644
index 0000000..97cd879
--- /dev/null
+++ b/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md
@@ -0,0 +1,138 @@
+# 04 · Whole-sheet single chunk dilutes the embedding
+
+**Status:** 🆓 free to claim
+**Slice:** F (`chunking/chunker.py`, `chunking/segmenter.py`, `analysis/light_block_detector.py`, `analysis/table_grouper.py`)
+**Independent of:** 00, 01, 02, 03
+
+## What it looks like
+
+Two clear instances on the 200-sample (`184-6`, `374-9`); the pattern
+extends to many more on the full corpus where a sheet contains a
+single large data table with a few headers and the chunker emits ONE
+chunk covering A1:end. Note: TODO 02 also has instances where the
+single-chunk-per-sheet pattern shows up (`53-12`, `189-9`,
+`334-11`, `353-29`, `382-29`, `462-45`, `495-31`, `CF_3712`). Treat
+those as **second-order beneficiaries**: when 02 lands first, the
+chunk's range tightens and may already split the over-broad chunk.
+When 04 lands first, the chunks get smaller and 02's range invariant
+becomes easier to satisfy. Order is reviewer's choice.
+
+Symptoms:
+
+* `n_chunks_on_gt_sheet == 1` and the chunk spans most of the sheet.
+* The answer values are present in the chunk's render_text BUT rank
+  >5 (or rank=1 by luck — at recall@1 the model picks based on the
+  bag-of-everything embedding). The recall@1 vs recall@5 gap of 12.4 pp
+  in the v0.2.0 numbers is consistent with this.
+* `summary.md` shows `table_fragmentation_rate ≈ 0` on these sheets
+  (which sounds good but means we're under-splitting, not perfectly
+  segmenting).
+
+This is the **chunk-granularity** problem. The block-detection pipeline
+fuses everything into one block; the chunker has no token cap and
+emits the block verbatim.
+
+## Diagnostic signature
+
+```bash
+jq -c 'select(.n_chunks_on_gt_sheet == 1 and .n_chunks_total <= 2
+              and (.flags | contains(["instruction_requires_execution"]) | not))' \
+    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
+```
+
+Also: dump the per-chunk `token_count` (already computed but unused in
+chunker.py). Any chunk over 800 tokens is over-broad on this corpus.
+
+## File scope
+
+You may touch:
+
+* `src/ks_xlsx_parser/chunking/chunker.py` — introduce a chunk-size
+  cap. Suggested target: ~512 tokens. When a block exceeds, split the
+  block into row-groups (preserving headers in each group), with each
+  group becoming its own chunk. `prev_chunk_id` / `next_chunk_id` is
+  already wired for navigation.
+* `src/ks_xlsx_parser/chunking/segmenter.py` — segmenter logic if any
+  granularity decisions live there.
+* `src/ks_xlsx_parser/analysis/light_block_detector.py` and
+  `analysis/table_grouper.py` — investigate whether *over*-merging
+  upstream is causing the one-block-per-sheet symptom. If the table
+  grouper is collapsing 3 logical tables into 1, the upstream fix is
+  better than splitting at chunk-emit time.
+
+Do NOT touch the ranking/embedding step (`scripts/eval_retrieval.py`)
+or any retrieval logic.
+
+## Acceptance criteria
+
+1. After landing, no chunk on the 200-sample seed=1337 run exceeds
+   ~800 tokens (= ~3200 characters) by default. Allow override.
+2. `table_fragmentation_rate` may rise (this is desired up to ~0.5 —
+   purposeful row-group splits). It must NOT cause a regression in
+   geometric@5 — each row-group chunk's claimed range must cover only
+   its rows.
+3. `recall_text@5` rises by ≥ 3 pp on the 200-sample.
+4. `recall_text@1` rises by ≥ 4 pp (smaller chunks = sharper embeddings
+   = top-1 sharpens).
+5. The two named instances (`184-6`, `374-9`) flip to `both_hit`. Note
+   that `184-6` is also subject to TODO 00's harness fix — order may
+   matter.
+
+## Failing test sketch
+
+```python
+# tests/test_chunk_size_cap.py
+from ks_xlsx_parser.pipeline import parse_workbook
+
+def test_no_chunk_exceeds_token_budget(tmp_path):
+    # Generate a workbook with a 1000-row table. Without a cap the
+    # chunker emits one giant chunk; with the cap it should emit several.
+    import openpyxl
+    p = tmp_path / "big.xlsx"
+    wb = openpyxl.Workbook()
+    ws = wb.active
+    ws.append(["id", "name", "value"])
+    for i in range(1, 1001):
+        ws.append([i, f"name-{i}", i * 7])
+    wb.save(p)
+
+    chunks = parse_workbook(path=str(p)).chunks
+    assert len(chunks) >= 2, "1000-row table should produce ≥2 chunks after the cap lands"
+    # Each chunk's claimed range should be a contiguous row block on the same sheet.
+    for c in chunks:
+        assert c.sheet_name and c.top_left_cell and c.bottom_right_cell
+    # No chunk should be larger than the budget (in render-text characters).
+    for c in chunks:
+        assert len(c.render_text or "") <= 4000, f"chunk too big: {len(c.render_text)} chars"
+```
+
+## Pitfalls
+
+* **Don't strand headers.** The current chunker preserves table headers
+  in render_text. When splitting a table into row groups, every group
+  needs the header rows OR a header-summary block so an LLM can
+  interpret column meanings.
+* **Range bookkeeping** — when you split, the child chunks' A1 ranges
+  MUST be tight to their row subset. Otherwise you create
+  geometric_no_overlap failures (cluster 02 in reverse).
+* **`token_count` is character-based** (`len(text) // 4`). Use the
+  same metric for the cap to avoid a circular dependency on a tokenizer.
+* **Don't split horizontally** (column subsets). Excel tables are
+  usually wider than embeddings need but column-split chunks have no
+  natural boundary and reading order breaks.
+
+## Coordinating with TODO 02
+
+If both 02 and 04 are in flight:
+- 02 is *range-tightening* (semantically correct ranges).
+- 04 is *granularity* (more chunks per sheet).
+- The risk is they fight over `_block_to_chunk`. Decide which one
+  emits the final A1 range and document the contract. Suggested:
+  04 splits blocks into sub-blocks WITH tight ranges, then 02's
+  invariant test passes for the sub-blocks.
+
+## Repro
+
+Use a synthetic 1000-row workbook (see test sketch) — much faster to
+iterate on than the corpus instances. Once your local fixture passes,
+run the 200-sample to confirm corpus-wide impact.
diff --git a/docs/planning/recall-90/05-out-of-scope-execution-instances.md b/docs/planning/recall-90/05-out-of-scope-execution-instances.md
new file mode 100644
index 0000000..0133d3a
--- /dev/null
+++ b/docs/planning/recall-90/05-out-of-scope-execution-instances.md
@@ -0,0 +1,136 @@
+# 05 · Out-of-scope: benchmark instances that require execution, not retrieval
+
+**Status:** 🆓 free to claim (scoring/harness work — **NOT a parser fix**)
+**Slice:** A (`scripts/eval_retrieval.py`, `scripts/enrich_failures.py`)
+**Independent of:** 01, 02, 03, 04 — partial overlap with 00 on
+`eval_retrieval.py`; coordinate via README.
+
+## Why this exists
+
+127 of 200 instances in the seed=1337 sample (≈ 63%) have
+`instruction_requires_execution = True` — the `answer_position` in the
+INPUT spreadsheet is empty because the question literally asks the
+system to *compute and write* a value there.
+
+Example instructions from this class:
+
+* "Round the value 367.5 to the nearest $5 and put the result in C2."
+* "Sum each row of B:F and fill column G."
+* "Fix the broken formulas in D2:D613."
+
+A retrieval-grade parser cannot satisfy these instances. The answer
+doesn't exist in `input.xlsx` to be retrieved. Including these
+instances in the headline `recall_text@5` number drags it down ~25
+points without representing any parser quality signal.
+
+**This TODO is not a parser improvement.** It's a benchmark-scoring
+correction so that subsequent parser work has a clean signal.
+
+## What the headline numbers actually mean today
+
+On the 200-sample seed=1337:
+
+|                         | All 200 | In-scope (73) |
+|-------------------------|---------|---------------|
+| Text recall@5           | 0.635   | 0.59          |
+| Geometric recall@5      | 0.310   | (compute it)  |
+| `instruction_requires_execution` | 127 of 200 | 0 (excluded) |
+
+The "in-scope" column is the metric the planning roadmap targets. Today
+it's hidden inside the eval output; this TODO surfaces it as a
+first-class metric.
+
+## What to ship
+
+1. **A scoring filter switch** in `scripts/eval_retrieval.py`.
+   - New optional `--exclude-execution-instances` flag (or compute both
+     filtered and unfiltered metrics every run).
+   - Definition of "execution instance": at run time, peek at
+     `input.xlsx[answer_sheet][answer_position]`. If the union of
+     non-empty cells in that range is empty, mark the instance as
+     `instruction_requires_execution` and report it in a separate
+     bucket. **Do this once per instance, not per parser** — the
+     classification is parser-independent.
+2. **Two recall numbers** in `summary.json` / `summary.md`:
+   - `recall_text@5_all` (current behaviour — count every instance).
+   - `recall_text@5_in_scope` (denominator excludes
+     execution instances).
+3. **History tracking**: include both numbers in
+   `scripts/append_bench_history.py` so the trend chart shows the
+   in-scope number — that's the target for the 0.90 goal.
+4. **README + recall-investigation.md update** so the in-scope number
+   is named and called out as the gate. The 0.90 target is on
+   *in-scope* recall, not all-instances recall.
+5. **Out-of-scope summary file** alongside the run reports:
+   `tests/benchmarks/reports/retrieval/<stamp>/out_of_scope.txt` listing
+   the instance IDs that were filtered, so the filter's behaviour is
+   auditable. If somebody disagrees with the classification, they can
+   diff the list.
+
+## File scope
+
+You may touch:
+
+* `scripts/eval_retrieval.py` — add classification step BEFORE
+  `score_instance` (so unfiltered metric still works).
+* `scripts/enrich_failures.py` — make `instruction_requires_execution`
+  the canonical source of truth (it already detects this).
+* `scripts/append_bench_history.py` — record both numbers per run.
+* `scripts/triage_recall.py` — by default exclude out-of-scope rows.
+* `docs/planning/recall-90/README.md` — update target row.
+* `docs/recall-investigation.md` — call out the in-scope number.
+
+Do NOT touch any `src/ks_xlsx_parser/*` code for this cluster.
+
+## Acceptance criteria
+
+1. On the same 200-sample seed=1337 run, the new
+   `recall_text@5_in_scope` number is emitted alongside the existing
+   one and printed in `summary.md`.
+2. The classifier reproducibly identifies ≥ 120 of the 127 currently-
+   flagged instances (some borderline cases — a single non-empty
+   header in the answer range — are debatable).
+3. `history.jsonl` rows after the change have both fields populated.
+4. A unit test under `tests/test_eval_retrieval_classify.py` covers
+   the classifier with hand-built fixtures: empty range → out-of-scope,
+   range with one value → in-scope, range with only string headers
+   (no numeric data) → in-scope, range with formula cells that have
+   cached values → in-scope.
+
+## Why this isn't TODO 00
+
+* TODO 00 fixes the harness's *spec parser* — wrong sheet names get
+  resolved.
+* TODO 05 fixes the harness's *scoring* — instances the parser can't
+  affect get a separate bucket.
+
+Both touch `eval_retrieval.py` but in different functions. If they're
+worked simultaneously, coordinate over `parse_position_spec` —
+TODO 00 owns the regex; TODO 05 owns the classifier.
+
+## Pitfalls
+
+* Don't over-exclude. An instance where `answer_position` covers
+  `A2:G15` and only `A1` is non-empty (a single header row above the
+  empty area) is still execution-required, but use the right cell
+  range — read the actual `answer_position`, not `data_position`.
+* The classifier needs the openpyxl read, which costs ~50ms per
+  instance. Cache it or do it once at script start, not inside the
+  ranking loop.
+* Don't delete the unfiltered metric. Both numbers are useful: the
+  unfiltered one tells you how a downstream consumer experiences
+  ks-xlsx-parser on real spreadsheet questions (most of which DO
+  require execution); the in-scope one tells you whether the parser
+  is doing its job.
+
+## Sample of what gets filtered
+
+Random sampling of `instruction_requires_execution = True` instances:
+
+| Instance | Instruction (first 60 chars) | Answer range |
+|---|---|---|
+| ...      | "Calculate the sum of column B for products where..."   | `Sheet1!E2:E10` |
+| ...      | "For each row in column A, look up the corresponding..."| `Sheet1!C2:C50` |
+| ...      | "Pivot the data in A1:D100 by month..."                 | `Sheet2!A1:E13` |
+
+Run the enricher to see the full list.
diff --git a/docs/planning/recall-90/README.md b/docs/planning/recall-90/README.md
new file mode 100644
index 0000000..4f50b46
--- /dev/null
+++ b/docs/planning/recall-90/README.md
@@ -0,0 +1,137 @@
+# Roadmap: ks-xlsx-parser retrieval recall → 0.90
+
+This directory is the worklist for getting SpreadsheetBench recall on the
+canonical ks-xlsx-parser benchmark from the v0.2.0 baseline up to:
+
+| Metric                 | v0.2.0 (912) | 200-sample baseline | Target | Stretch |
+|------------------------|--------------|---------------------|--------|---------|
+| Text recall@5          | 0.704        | 0.635               | ≥ 0.90 | 0.95    |
+| Geometric recall@5     | 0.369        | 0.310               | ≥ 0.70 | 0.85    |
+| In-scope text recall@5 | unmeasured   | **0.59**            | ≥ 0.90 | —       |
+
+Each `NN-*.md` in this folder is one **independent, parallel-safe TODO**.
+Pick one, claim it (table below), execute, hand off the next agent.
+
+## Read this first — the headline number is misleading
+
+The benchmark dataset has 912 instances. **~63% of them require the
+system to *compute and write* the answer** (e.g. "fill column C with
+ROUND(B*5)") — the answer literally doesn't exist in the input
+spreadsheet. A parser cannot retrieve what isn't there. We call this
+class `instruction_requires_execution` and treat them as **out of scope
+for parser work**. See [05-out-of-scope-execution-instances.md](./05-out-of-scope-execution-instances.md).
+
+When the scoring is filtered to in-scope instances only, the real
+ks-xlsx-parser recall is closer to **0.59**, and improving it past
+0.90 means closing ~22 named failures out of ~30 actionable cases on the
+200-instance seed=1337 sample. That's the actual work.
+
+## Worklist
+
+Status legend: 🆓 free to claim · 🔵 in progress · ✅ landed · ⛔ blocked
+
+| # | File | Cluster | Instances | Primary slice | Status |
+|---|---|---|---|---|---|
+| 00 | [00-benchmark-spec-parser-bugs.md](./00-benchmark-spec-parser-bugs.md) | Benchmark harness fails to parse malformed `data_position`/`answer_position` (`Dashboard'!B8`, fullwidth `G12：J15`) — 8 instances scored as wrong even when the parser was correct. | 8 | `scripts/eval_retrieval.py` | 🆓 |
+| 01 | [01-array-formula-rendering.md](./01-array-formula-rendering.md) | Array-formula cells surface as `<openpyxl.worksheet.formula.ArrayFormula object>` — value never lands in `render_text`. | 2 | `parsers/cell_parser.py`, `rendering/text_renderer.py` | 🆓 |
+| 02 | [02-chunk-range-vs-text-mismatch.md](./02-chunk-range-vs-text-mismatch.md) | `text_hit_geom_miss` with chunks on the GT sheet — answer text is in *some* chunk's text but no chunk's claimed A1 range overlaps GT. Range bookkeeping drift during block merge/split. | 7 | `annotation/block_splitter.py`, `analysis/pattern_splitter.py`, `chunking/chunker.py` | 🆓 |
+| 03 | [03-cell-drop-or-uncached-formula.md](./03-cell-drop-or-uncached-formula.md) | Chunk's range covers GT geometrically but its rendered text lacks the answer values — either cells dropped in render, or formula cells with no cached value rendered as the formula source. | up to 10 (some unscorable) | `parsers/cell_parser.py`, `parsers/sheet_parser.py`, `rendering/text_renderer.py` | 🆓 |
+| 04 | [04-single-chunk-multi-region-sheet.md](./04-single-chunk-multi-region-sheet.md) | Sheets containing distinct logical regions emerge as one giant chunk; the answer text dilutes against the rest of the sheet at embedding time. | 2 (+ likely more at full corpus) | `chunking/chunker.py`, `analysis/light_block_detector.py`, `analysis/table_grouper.py` | 🆓 |
+| 05 | [05-out-of-scope-execution-instances.md](./05-out-of-scope-execution-instances.md) | Informational + scoring filter. **Not a parser fix.** Surfaces benchmark instances where the parser fundamentally can't help; suggests filtering them from headline recall. | 127 / 200 | `scripts/eval_retrieval.py` (scoring), `scripts/enrich_failures.py` | 🆓 |
+
+## How parallelism works here
+
+Each TODO is scoped to a different file slice so agents won't collide:
+
+```
+slice A   benchmark harness         00, 05
+slice B   parsers/cell_parser       01, 03
+slice C   parsers/sheet_parser      03
+slice D   rendering/text_renderer   01, 03
+slice E   annotation/ + analysis/   02
+slice F   chunking/                 04
+```
+
+TODOs 00 and 05 touch the same file (`eval_retrieval.py`) — **serialize**
+those two. Everything else can run truly in parallel.
+
+If you discover a cluster mid-flight that doesn't fit anywhere here,
+add a new `NN-*.md` and append a row to the table above. Do NOT bury
+findings inside an existing TODO.
+
+## Working a TODO
+
+Each cluster file has the same structure:
+
+1. **What the cluster looks like** — a real failure example in full.
+2. **Repro instance IDs** — the 200-sample seed=1337 instance IDs that
+   match this cluster.
+3. **Diagnostic signature** — exact `enriched_failures.ndjson` columns
+   that identify a member.
+4. **File scope** — what you can change; what you can't.
+5. **Acceptance criteria** — measurable on the 200-sample seed=1337
+   rerun. "The N instances listed in §2 all flip from miss → hit on
+   `text_match_rank` AND no other previously-passing instance regresses."
+6. **Failing test sketch** — one-line idea for a regression test.
+
+What's **deliberately missing** from each file: the proposed fix and an
+effort estimate. The agent that picks up the task does that diagnosis;
+the doc just specifies success.
+
+### The loop, end-to-end
+
+```bash
+# 1. Claim the task (edit README, change 🆓 → 🔵 with your name).
+# 2. Read the cluster file + the named example instances.
+# 3. Reproduce locally:
+python scripts/eval_retrieval.py \
+    --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \
+    --parsers ks --sample 200 --seed 1337 --emit-failures
+python scripts/enrich_failures.py tests/benchmarks/reports/retrieval
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5
+
+# 4. Write a regression test that fails today (see test sketch in cluster file).
+# 5. Fix until the test passes.
+# 6. Re-run the eval + enrichment. Confirm the cluster count dropped, no regressions.
+# 7. Append to history; the delta should be visible:
+python scripts/append_bench_history.py
+
+# 8. PR the change. Title: `recall(NN): <cluster short name> — N→M`.
+#    PR body links to the cluster file and pastes the before/after histogram.
+# 9. Update README table: 🔵 → ✅ with PR link.
+```
+
+## Why the 200-sample seed=1337 is the gate, not the full 912
+
+912 takes ~40 min on a laptop. PR validation needs to be fast. The
+seed=1337 200-sample is deterministic and contains roughly proportional
+representation of every cluster. Acceptance criteria are stated against
+it. **Cross-check with the full corpus before declaring a release**:
+
+```bash
+python scripts/eval_retrieval.py --parsers ks --emit-failures
+```
+
+The weekly benchmark workflow (`.github/workflows/benchmark.yml`) runs
+the full 912 on schedule + workflow_dispatch.
+
+## Tracking progress
+
+`tests/benchmarks/reports/history.jsonl` is appended to on each
+`make bench-track` run, one row per commit. Tail it to see the trend:
+
+```bash
+tail -10 tests/benchmarks/reports/history.jsonl
+```
+
+Each row carries the failure_buckets histogram, so per-cluster progress
+is in the data, not just the headline number.
+
+## See also
+
+- `docs/recall-investigation.md` — diagnosis framework + hypotheses
+  (H1 chunk-size, H2 formula-rendering, H3 range-bookkeeping)
+- `docs/benchmark-local-setup.md` — install → corpus → eval loop
+- `scripts/triage_recall.py` — bucket histogram + exemplar dump
+- `scripts/enrich_failures.py` — per-failure diagnostic columns
+- `.claude/skills/recall-failure-triage/SKILL.md` — agent guide
diff --git a/docs/recall-investigation.md b/docs/recall-investigation.md
index d3576c8..ee630fd 100644
--- a/docs/recall-investigation.md
+++ b/docs/recall-investigation.md
@@ -32,7 +32,38 @@ fixing the biggest one.
 | `geometric_no_overlap`      | No chunk's A1 range overlaps ground truth. Range bookkeeping drifts during merge/split. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` |
 | `no_chunks` / `parse_error` | Upstream parser failure. | The parse exception — fix the crash. |
 
-## A priori hypotheses (to be confirmed by the next benchmark run)
+## First-run findings (200-sample seed=1337, May 2026)
+
+Headline numbers refuted my **H1** ("present_but_ranked_low dominates")
+and surfaced something the original analysis missed entirely. Of 157
+failures in the seed=1337 sample:
+
+* **127 (81%) are `instruction_requires_execution`** — the benchmark
+  is asking the system to *compute and write* the answer. The
+  `answer_position` cell range in `input.xlsx` is empty by design. A
+  parser cannot retrieve what isn't there. These instances need to be
+  filtered from the headline metric (TODO 05).
+* **30 are truly actionable** parser failures, clustering into 4 named
+  buckets (TODOs 00, 01, 02, 03, 04 in
+  [docs/planning/recall-90/](./planning/recall-90/)).
+* Zero of the 30 are `present_but_ranked_low`. Chunk-size dilution
+  may matter on the full 912-corpus but is NOT the dominant problem
+  on the 200-sample.
+
+The dominant **citation-grade killer** turned out to be
+`text_hit_geom_miss` (84 of 200, mostly inside the 127 execution
+instances; 9 are actionable parser bugs). The chunk text contains
+the answer string, but the chunk's claimed A1 range doesn't overlap
+ground truth — exactly **H3** ("range-bookkeeping drift").
+
+When the in-scope filter is applied, real `recall_text@5` is
+**0.59**, not 0.635. Closing the 30 actionable failures gets us
+near 0.90 on the in-scope metric.
+
+See [docs/planning/recall-90/](./planning/recall-90/) for the
+ranked TODO list.
+
+## A priori hypotheses (now confirmed/refuted; left as history)
 
 ### H1 — `present_but_ranked_low` is the biggest bucket
 
diff --git a/scripts/enrich_failures.py b/scripts/enrich_failures.py
new file mode 100755
index 0000000..a9e4452
--- /dev/null
+++ b/scripts/enrich_failures.py
@@ -0,0 +1,348 @@
+#!/usr/bin/env python3
+"""Enrich a benchmark run with per-instance diagnostics for failure clustering.
+
+`eval_retrieval.py --emit-failures` only sees *text-match* misses. For
+citation-grade scoring (geometric recall@5 = 0.283) we also need to
+classify instances where the answer text is in some chunk but no chunk's
+A1 range covers the ground truth — those don't show up in failures.ndjson
+at all.
+
+This script re-parses each instance's input.xlsx with both ks-xlsx-parser
+and openpyxl, then emits one row per FAILED instance (text-miss OR
+geometric-miss) with diagnostic columns chosen so post-hoc clustering is
+easy:
+
+    instance_id           question id
+    bucket_combined       both_miss / text_hit_geom_miss / text_miss_geom_hit
+    answer_position       the GT spec from dataset.json
+    gt_sheet              ground-truth sheet name (default: answer_sheet)
+    gt_cell_raw           openpyxl raw value at the first GT cell
+    gt_cell_formula       formula string if any
+    gt_cell_data_only     cached computed value if any
+    gt_in_workbook_sheets is gt_sheet in wb.sheetnames?
+    gt_in_chunked_sheets  did the parser produce any chunk on gt_sheet?
+    n_workbook_sheets     total sheets in the workbook (incl hidden)
+    n_chunked_sheets      distinct sheets we emitted chunks for
+    n_workbook_cells_in_gt   non-empty openpyxl cells in the GT range
+    chunks_on_gt_sheet    how many chunks we emitted for gt_sheet
+    chunk_bbox_on_gt_sheet bbox (min_r,min_c,max_r,max_c) over all chunks on gt_sheet
+    gt_range_bbox         GT range as (r0,c0,r1,c1)
+    text_match_rank       rank of first chunk whose text contains a GT value
+    geom_match_rank       rank of first chunk whose A1 range overlaps GT
+
+Usage:
+    python scripts/enrich_failures.py <run-dir-or-results.ndjson>
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Any
+
+# Hoist eval_retrieval helpers so we share the *exact* normalization
+# logic — clustering by "answer present in chunk" must be apples-to-apples.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from scripts.eval_retrieval import (  # noqa: E402
+    _matches_chunk_text,
+    parse_a1,
+    parse_position_spec,
+    parse_range,
+)
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+
+
+def find_run(arg: Path) -> Path:
+    if arg.is_file():
+        return arg.parent
+    if arg.is_dir():
+        if (arg / "results.ndjson").exists():
+            return arg
+        runs = sorted(p for p in arg.glob("*/results.ndjson"))
+        if runs:
+            return runs[-1].parent
+    sys.exit(f"ERROR: no results.ndjson at {arg}")
+
+
+def overlaps(box1: tuple[int, int, int, int], box2: tuple[int, int, int, int]) -> bool:
+    r0a, c0a, r1a, c1a = box1
+    r0b, c0b, r1b, c1b = box2
+    return not (r1a < r0b or r0a > r1b or c1a < c0b or c0a > c1b)
+
+
+def chunk_bbox(chunks) -> tuple[int, int, int, int] | None:
+    boxes = []
+    for c in chunks:
+        tl = parse_a1(c.top_left_cell) if c.top_left_cell else None
+        br = parse_a1(c.bottom_right_cell) if c.bottom_right_cell else None
+        if tl and br:
+            boxes.append((tl[0], tl[1], br[0], br[1]))
+    if not boxes:
+        return None
+    return (
+        min(b[0] for b in boxes),
+        min(b[1] for b in boxes),
+        max(b[2] for b in boxes),
+        max(b[3] for b in boxes),
+    )
+
+
+def enrich(run_dir: Path, corpus: Path, out_path: Path) -> None:
+    from openpyxl import load_workbook
+
+    from ks_xlsx_parser.pipeline import parse_workbook
+
+    # Load dataset.json once — we need question text + the original
+    # answer_sheet attribution for instances where the rank scoring
+    # already considers the spec parse-resolved.
+    dataset = {str(d["id"]): d for d in
+               json.loads((corpus / "dataset.json").read_text())}
+
+    results_path = run_dir / "results.ndjson"
+    rows: list[dict[str, Any]] = []
+    for line in results_path.read_text().splitlines():
+        if line.strip():
+            rows.append(json.loads(line))
+
+    out_rows: list[dict[str, Any]] = []
+    n_failed = 0
+    for rec in rows:
+        if rec.get("error"):
+            continue
+        text_rank = rec.get("rank_of_text_match")
+        geom_rank = rec.get("rank_of_first_overlap")
+        text_hit = text_rank is not None and text_rank <= 5
+        geom_hit = geom_rank is not None and geom_rank <= 5
+        if text_hit and geom_hit:
+            continue  # not a failure either way
+        n_failed += 1
+
+        inst_id = rec["instance_id"]
+        meta = dataset.get(inst_id, {})
+        instruction = meta.get("instruction", "")
+        answer_sheet = meta.get("answer_sheet") or None
+        if answer_sheet and "," in answer_sheet:
+            answer_sheet = answer_sheet.split(",")[0].strip()
+        answer_position = rec.get("answer_position") or meta.get("answer_position") or ""
+        data_position = meta.get("data_position") or ""
+
+        # Re-parse input.xlsx — cheap (50 ms median).
+        inst_dir = corpus / "spreadsheet" / inst_id
+        input_path = inst_dir / f"1_{inst_id}_input.xlsx"
+        if not input_path.exists():
+            continue
+
+        try:
+            result = parse_workbook(path=str(input_path))
+            chunks = list(result.chunks)
+        except Exception as exc:
+            out_rows.append({
+                "instance_id": inst_id,
+                "bucket_combined": "parse_error",
+                "error": f"{type(exc).__name__}: {exc}",
+            })
+            continue
+
+        chunked_sheets = sorted({c.sheet_name for c in chunks if c.sheet_name})
+
+        # openpyxl view — both formula and data_only passes so we can
+        # tell "formula uncached" from "cell genuinely empty".
+        try:
+            wb_f = load_workbook(str(input_path), data_only=False, read_only=False)
+            wb_d = load_workbook(str(input_path), data_only=True, read_only=False)
+            wb_sheets = list(wb_f.sheetnames)
+            hidden_sheets = [s for s in wb_sheets
+                             if getattr(wb_f[s], "sheet_state", "visible") != "visible"]
+        except Exception as exc:
+            wb_sheets = []
+            hidden_sheets = []
+            wb_f = wb_d = None
+
+        # Geometric overlap is scored against data_position (the input data
+        # region the question is asking about); falls back to answer_position
+        # when the dataset didn't fill data_position in (561 of 912 instances).
+        # This mirrors eval_retrieval.py's geom_spec = data_pos or answer_pos.
+        geom_spec = data_position or answer_position
+        regions = parse_position_spec(geom_spec, answer_sheet)
+        gt_sheet = regions[0][0] if regions else answer_sheet
+        gt_range_bbox = regions[0][1] if regions else None
+
+        # Separately track whether the *answer* region is empty in input.xlsx.
+        # If so, the question is "compute X and write here" — the parser
+        # cannot possibly contain the answer text, so this is a benchmark
+        # construct, not a parser bug. We flag it as instruction_requires_execution.
+        answer_regions = parse_position_spec(answer_position, answer_sheet)
+        answer_sheet_resolved = (answer_regions[0][0]
+                                 if answer_regions else answer_sheet)
+        answer_range_bbox = answer_regions[0][1] if answer_regions else None
+        n_input_cells_in_answer_range = 0
+        if (wb_d and answer_sheet_resolved and answer_range_bbox
+                and answer_sheet_resolved in wb_d.sheetnames):
+            try:
+                ws = wb_d[answer_sheet_resolved]
+                r0, c0, r1, c1 = answer_range_bbox
+                for row in ws.iter_rows(min_row=r0, max_row=r1, min_col=c0,
+                                        max_col=c1, values_only=True):
+                    for v in row:
+                        if v is not None and str(v).strip():
+                            n_input_cells_in_answer_range += 1
+            except Exception:
+                pass
+
+        gt_cell_raw = None
+        gt_cell_formula = None
+        gt_cell_data_only = None
+        n_workbook_cells_in_gt = 0
+        if wb_f and gt_sheet and gt_sheet in wb_f.sheetnames and gt_range_bbox:
+            ws_f = wb_f[gt_sheet]
+            ws_d = wb_d[gt_sheet]
+            r0, c0, r1, c1 = gt_range_bbox
+            # First cell only — enough to know "formula vs. value".
+            try:
+                tl_cell_f = ws_f.cell(row=r0, column=c0)
+                tl_cell_d = ws_d.cell(row=r0, column=c0)
+                gt_cell_raw = tl_cell_f.value
+                if isinstance(gt_cell_raw, str) and gt_cell_raw.startswith("="):
+                    gt_cell_formula = gt_cell_raw
+                gt_cell_data_only = tl_cell_d.value
+            except Exception:
+                pass
+            # Count non-empty cells across the range
+            try:
+                for row in ws_d.iter_rows(min_row=r0, max_row=r1, min_col=c0,
+                                          max_col=c1, values_only=True):
+                    for v in row:
+                        if v is not None and str(v).strip():
+                            n_workbook_cells_in_gt += 1
+            except Exception:
+                pass
+
+        chunks_on_gt = [c for c in chunks if gt_sheet and c.sheet_name == gt_sheet]
+        gt_chunk_bbox = chunk_bbox(chunks_on_gt)
+
+        if not text_hit and not geom_hit:
+            bucket = "both_miss"
+        elif text_hit and not geom_hit:
+            bucket = "text_hit_geom_miss"
+        elif geom_hit and not text_hit:
+            bucket = "text_miss_geom_hit"
+        else:
+            bucket = "other"
+
+        sheet_chunked = (gt_sheet in chunked_sheets) if gt_sheet else None
+        sheet_in_wb = (gt_sheet in wb_sheets) if (gt_sheet and wb_sheets) else None
+        sheet_hidden = (gt_sheet in hidden_sheets) if gt_sheet else False
+
+        # Pre-named signal heuristics — informational only; clustering is
+        # still done by reading. These flags help spot patterns FAST.
+        flags: list[str] = []
+        if sheet_in_wb is False:
+            flags.append("gt_sheet_missing_from_workbook")
+        elif sheet_chunked is False:
+            flags.append("gt_sheet_present_but_not_chunked")
+        if sheet_hidden:
+            flags.append("gt_sheet_hidden")
+        if gt_cell_formula and gt_cell_data_only in (None, ""):
+            flags.append("gt_cell_uncached_formula")
+        elif gt_cell_formula:
+            flags.append("gt_cell_is_formula")
+        if (gt_range_bbox and gt_chunk_bbox and
+                not overlaps(gt_range_bbox, gt_chunk_bbox)):
+            flags.append("gt_range_outside_chunk_bbox")
+        if (gt_range_bbox and gt_chunk_bbox and
+                overlaps(gt_range_bbox, gt_chunk_bbox)):
+            # Inside the chunk bbox but no individual chunk overlaps?
+            # That's "the parser saw the right area but split it wrong".
+            any_overlap = False
+            for c in chunks_on_gt:
+                tl = parse_a1(c.top_left_cell) if c.top_left_cell else None
+                br = parse_a1(c.bottom_right_cell) if c.bottom_right_cell else None
+                if tl and br and overlaps(
+                        gt_range_bbox, (tl[0], tl[1], br[0], br[1])):
+                    any_overlap = True
+                    break
+            if not any_overlap:
+                flags.append("gt_inside_bbox_but_no_chunk_overlap")
+        if (n_workbook_cells_in_gt == 0 and gt_range_bbox):
+            flags.append("gt_range_empty_in_workbook")
+        # The big one: if answer_position is empty in input, the benchmark
+        # is asking the system to WRITE the answer. Not a parser bug.
+        if (answer_range_bbox and n_input_cells_in_answer_range == 0):
+            flags.append("instruction_requires_execution")
+        # cell rendered but truncated to a sub-range?
+        if (gt_chunk_bbox and gt_range_bbox and
+                gt_chunk_bbox[2] < gt_range_bbox[2]):
+            flags.append("chunk_bbox_rows_truncated")
+        if (gt_chunk_bbox and gt_range_bbox and
+                gt_chunk_bbox[3] < gt_range_bbox[3]):
+            flags.append("chunk_bbox_cols_truncated")
+
+        out_rows.append({
+            "instance_id": inst_id,
+            "bucket_combined": bucket,
+            "instruction": instruction[:200],
+            "answer_position": answer_position,
+            "answer_sheet": answer_sheet,
+            "gt_sheet": gt_sheet,
+            "gt_range_bbox": list(gt_range_bbox) if gt_range_bbox else None,
+            "gt_cell_raw": str(gt_cell_raw)[:120] if gt_cell_raw is not None else None,
+            "gt_cell_formula": gt_cell_formula,
+            "gt_cell_data_only":
+                str(gt_cell_data_only)[:120] if gt_cell_data_only is not None else None,
+            "n_workbook_sheets": len(wb_sheets),
+            "n_chunked_sheets": len(chunked_sheets),
+            "wb_sheets": wb_sheets,
+            "hidden_sheets": hidden_sheets,
+            "chunked_sheets": chunked_sheets,
+            "n_chunks_total": len(chunks),
+            "n_chunks_on_gt_sheet": len(chunks_on_gt),
+            "n_workbook_cells_in_gt": n_workbook_cells_in_gt,
+            "chunk_bbox_on_gt_sheet": list(gt_chunk_bbox) if gt_chunk_bbox else None,
+            "rank_of_text_match": text_rank,
+            "rank_of_first_overlap": geom_rank,
+            "flags": flags,
+            "data_position": data_position,
+            "answer_range_bbox": list(answer_range_bbox) if answer_range_bbox else None,
+            "n_input_cells_in_answer_range": n_input_cells_in_answer_range,
+        })
+
+    out_path.write_text("\n".join(json.dumps(r, separators=(",", ":")) for r in out_rows) + "\n")
+    print(f"Examined {len(rows)} instances, {n_failed} failed (text OR geom).")
+    print(f"Wrote {len(out_rows)} enriched rows to {out_path}")
+
+    # Quick histogram for sanity
+    from collections import Counter
+    bc = Counter(r["bucket_combined"] for r in out_rows)
+    fc = Counter()
+    for r in out_rows:
+        for f in r.get("flags", []):
+            fc[f] += 1
+    print("\nCombined bucket counts:")
+    for b, n in bc.most_common():
+        print(f"  {b:<30s} {n}")
+    print("\nDiagnostic flags (rows can have multiple):")
+    for f, n in fc.most_common():
+        print(f"  {f:<40s} {n}")
+
+
+def main(argv: list[str] | None = None) -> int:
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("path", type=Path,
+                    help="run dir, results.ndjson, or parent reports dir")
+    ap.add_argument("--corpus", type=Path,
+                    default=REPO_ROOT / "data/corpora/spreadsheetbench/all_data_912_v0.1")
+    ap.add_argument("--out", type=Path, default=None,
+                    help="output path (default: <run>/enriched_failures.ndjson)")
+    args = ap.parse_args(argv)
+
+    run_dir = find_run(args.path)
+    out = args.out or (run_dir / "enriched_failures.ndjson")
+    enrich(run_dir, args.corpus, out)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())

From 186d468839cf581926f2bb7ee6aa474e17668e19 Mon Sep 17 00:00:00 2001
From: Arnav Goel <arnavgoel2@gmail.com>
Date: Tue, 19 May 2026 18:09:12 -0400
Subject: [PATCH 4/4] chore: gitignore docs/planning/ (internal recall
 worklist)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The recall→0.90 cluster TODO files were coordination notes for internal
agents, not material the open-source community needs. Keep them locally
(disk-side) but stop tracking them so they don't leak via the public
repo. docs/recall-investigation.md + docs/benchmark-local-setup.md
stay — those have value for external contributors reproducing benchmark
results.

Removed two now-broken cross-references from recall-investigation.md
that pointed at the planning dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../00-benchmark-spec-parser-bugs.md          | 119 -------------
 .../recall-90/01-array-formula-rendering.md   | 130 --------------
 .../02-chunk-range-vs-text-mismatch.md        | 150 ----------------
 .../03-cell-drop-or-uncached-formula.md       | 165 ------------------
 .../04-single-chunk-multi-region-sheet.md     | 138 ---------------
 .../05-out-of-scope-execution-instances.md    | 136 ---------------
 docs/planning/recall-90/README.md             | 137 ---------------
 7 files changed, 975 deletions(-)
 delete mode 100644 docs/planning/recall-90/00-benchmark-spec-parser-bugs.md
 delete mode 100644 docs/planning/recall-90/01-array-formula-rendering.md
 delete mode 100644 docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md
 delete mode 100644 docs/planning/recall-90/03-cell-drop-or-uncached-formula.md
 delete mode 100644 docs/planning/recall-90/04-single-chunk-multi-region-sheet.md
 delete mode 100644 docs/planning/recall-90/05-out-of-scope-execution-instances.md
 delete mode 100644 docs/planning/recall-90/README.md

diff --git a/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md b/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md
deleted file mode 100644
index 8d33b7a..0000000
--- a/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md
+++ /dev/null
@@ -1,119 +0,0 @@
-# 00 · Benchmark harness fails to parse malformed `answer_position` strings
-
-**Status:** 🆓 free to claim
-**Slice:** A (benchmark harness — `scripts/eval_retrieval.py`)
-**Independent of:** all other TODOs in this folder
-
-## What it looks like
-
-Eight instances in the 200-sample (seed=1337) score as recall@5 misses
-even though the parser produced perfectly reasonable chunks. The
-benchmark harness's `parse_position_spec()` returns no regions because
-the `answer_position` string in `dataset.json` doesn't match
-`SHEET_RANGE_RE`. The score code then sees `data_regions=0`, marks the
-instance as "no scorable target", and the recall denominator counts it
-as a miss. Net effect: the parser is blamed for benchmark-spec parsing
-bugs.
-
-Concrete malformed strings observed:
-
-| Instance | `answer_position` string | What's wrong |
-|---|---|---|
-| `50442`  | `RESULTS 1'!G17`                                  | unmatched closing quote, opening quote missing |
-| `49490`  | `Sheet1'!H3:H58`                                  | same |
-| `49036`  | `Dashboard'!B8`                                   | same |
-| `55427`  | `Compiled and located schools da'!B2:B1461`       | same + long sheet name |
-| `48975`  | `Output'!B11:B17`                                 | same |
-| `37456`  | `G12：J15`                                        | fullwidth Chinese colon `：` instead of `:` |
-| `184-6`  | `'RAWDATA!'A1:P6,'OUTPUT!'A1:P6'`                 | quote between sheet and `!`, comma-joined |
-| `164-22` | `'YEAR1!'A1:G1478,'YEAR2!'A1:G1480'`              | same shape as 184-6 |
-
-(`374-9` may also belong here — investigate.)
-
-## Diagnostic signature
-
-In `enriched_failures.ndjson`, members of this cluster have:
-
-* `gt_sheet == null` (spec parser couldn't extract the sheet name), OR
-* `gt_sheet` wrapped in apostrophes (`"'Sheet1'"`), AND
-* the workbook *does* contain a sheet with the bare name.
-
-```bash
-python scripts/enrich_failures.py tests/benchmarks/reports/retrieval
-jq -c 'select(.gt_sheet == null or (.gt_sheet | startswith("'")))' \
-    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
-```
-
-## File scope
-
-You may touch:
-
-* `scripts/eval_retrieval.py` — `SHEET_RANGE_RE`, `parse_position_spec`,
-  `_normalize_value_for_match`. This is where the bugs live.
-* `scripts/enrich_failures.py` — must stay consistent with the harness;
-  re-run after editing.
-* Add new tests under `tests/test_eval_retrieval.py` (or similar — file
-  doesn't exist yet; create it).
-
-Do **NOT** touch any `src/ks_xlsx_parser/*` code for this cluster. If
-the urge to "make the parser tolerate these patterns too" arises,
-that's a separate TODO and a category error here — these are dataset
-typos, not parser inputs.
-
-## Acceptance criteria
-
-On `python scripts/eval_retrieval.py --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 --parsers ks --sample 200 --seed 1337 --emit-failures`:
-
-1. All 8 instances listed above stop appearing as recall@5 misses
-   (text or geometric) on the diagnostic ndjson.
-2. `geometric@5` rises by ≥ 4 pp (each fixed instance flips from miss
-   to hit since the chunks already overlap the correct region).
-3. No previously-passing instance becomes a miss. Diff the
-   `results.ndjson` before and after — `pass → fail` count must be 0.
-4. `python scripts/enrich_failures.py` shows zero rows with
-   `gt_sheet == null` from the eight listed instances.
-
-## Failing test sketch
-
-```python
-# tests/test_eval_retrieval_spec_parser.py
-import pytest
-from scripts.eval_retrieval import parse_position_spec
-
-@pytest.mark.parametrize("spec, expected_sheet", [
-    ("Dashboard'!B8", "Dashboard"),
-    ("Sheet1'!H3:H58", "Sheet1"),
-    ("'RAWDATA!'A1:P6", "RAWDATA"),
-    ("G12：J15", None),  # this one keeps default_sheet; sheet unaltered
-])
-def test_unmatched_quotes_resolve_sheet(spec, expected_sheet):
-    regions = parse_position_spec(spec, default_sheet="DEFAULT")
-    assert regions, f"no regions for {spec!r}"
-    if expected_sheet is not None:
-        assert regions[0][0] == expected_sheet
-
-def test_fullwidth_colon_in_range():
-    # Excel-China dataset entries occasionally use the fullwidth colon.
-    regions = parse_position_spec("G12：J15", default_sheet="S")
-    assert regions and regions[0][1] == (12, 7, 15, 10)
-```
-
-These should fail on `main`. They pass when the fix lands.
-
-## Pitfalls
-
-* `'YEAR1!'A1:G1478,'YEAR2!'A1:G1480'` is multi-region. Make sure the
-  comma-split path still returns BOTH regions after the fix.
-* `_normalize_value_for_match` is not involved here — don't refactor it.
-* The current regex `SHEET_RANGE_RE` uses a backreference for matching
-  quotes; tightening the closing quote to "optional, but only if opening
-  present" is fragile. Consider a two-pass approach: strip stray
-  apostrophes, then re-attempt the strict regex.
-
-## Repro fixtures
-
-The eight instances live under
-`data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/<id>/`.
-Quote one as a unit-test data point — do NOT vendor the corpus into
-the repo. If a snapshot of the malformed strings is wanted, hand-copy
-them into a Python fixture (they're ≤ 100 chars each).
diff --git a/docs/planning/recall-90/01-array-formula-rendering.md b/docs/planning/recall-90/01-array-formula-rendering.md
deleted file mode 100644
index a9edd72..0000000
--- a/docs/planning/recall-90/01-array-formula-rendering.md
+++ /dev/null
@@ -1,130 +0,0 @@
-# 01 · Array-formula cells render as `<openpyxl.worksheet.formula.ArrayFormula object>`
-
-**Status:** 🆓 free to claim
-**Slice:** B + D (`parsers/cell_parser.py`, `rendering/text_renderer.py`)
-**Independent of:** 00, 02, 04 — partially overlapping with 03 on
-`text_renderer.py`; coordinate via the README table.
-
-## What it looks like
-
-Two instances (`43026`, `59969`) hit this exact pattern. The
-ground-truth answer cell contains an Excel **array formula** (an
-`{= ...}` formula entered with Ctrl-Shift-Enter that returns multiple
-cells). openpyxl returns these as an `ArrayFormula` Python object, NOT
-a string and NOT a number. Our cell extractor stuffs the repr of that
-object into the cell's value, so the rendered chunk text contains
-literal:
-
-```
-<openpyxl.worksheet.formula.ArrayFormula object at 0x10c0767b0>
-```
-
-This obviously kills both text-match (the answer values are nowhere in
-the chunk) and any downstream LLM consumption.
-
-Instance `43026` (`summary!D10`):
-
-```
-gt_cell_raw  = '<openpyxl.worksheet.formula.ArrayFormula object at 0x10c0767b0>'
-gt_cell_formula = None    ← our heuristic missed it because raw_value isn't a str starting with '='
-gt_cell_data_only = (cached value, if any)
-```
-
-## Diagnostic signature
-
-```bash
-jq -c 'select(.gt_cell_raw | tostring | contains("ArrayFormula"))' \
-    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
-```
-
-Likely a much larger cluster at full-corpus scale — most "tabular
-report" workbooks use array formulas. Two on the 200-sample is the floor.
-
-## File scope
-
-You may touch:
-
-* `src/ks_xlsx_parser/parsers/cell_parser.py` — wherever the raw value
-  is extracted from `openpyxl.cell.Cell.value`. Detect `ArrayFormula`,
-  pull the formula text out of it (`obj.text` on openpyxl) and treat it
-  as a formula cell.
-* `src/ks_xlsx_parser/rendering/text_renderer.py::_cell_render_value`
-  — make sure array-formula cells render their *cached value* (from the
-  `data_only` workbook pass) when one exists; fall back to the formula
-  expression as a string only if no cached value.
-* `src/ks_xlsx_parser/models/cell.py` — add an `is_array_formula: bool`
-  field if useful for downstream consumers.
-
-Do NOT add a new evaluator for array formulas — out of scope (and
-covered indirectly by the cached-value path).
-
-## Acceptance criteria
-
-1. Build a tiny array-formula fixture inside `tests/conftest.py` (the
-   existing programmatic-fixture pattern). Use `openpyxl`'s
-   `ArrayFormula` constructor; populate the `data_only` workbook with a
-   plausible cached value.
-2. Add a test that asserts the chunk's `render_text` for that fixture
-   contains the cached value AND does **not** contain the substring
-   `ArrayFormula object`.
-3. On the 200-sample seed=1337 rerun, instances `43026` and `59969`
-   move out of the failure set on the text-match metric IF the answer
-   values exist as cached data in the input (verify with openpyxl
-   first; if they don't, the instance is `instruction_requires_execution`
-   and you can't move it — note it in the PR).
-4. No previously-passing crossval test regresses (`make test`).
-
-## Failing test sketch
-
-```python
-# tests/test_array_formula_rendering.py
-import openpyxl
-from openpyxl.worksheet.formula import ArrayFormula
-from ks_xlsx_parser.pipeline import parse_workbook
-
-def test_array_formula_renders_cached_value(tmp_path):
-    p = tmp_path / "af.xlsx"
-    wb = openpyxl.Workbook()
-    ws = wb.active
-    ws["A1"] = 10
-    ws["A2"] = 20
-    ws["A3"] = 30
-    # Array formula in B1:B3 summing the row; cache a plausible value.
-    af = ArrayFormula("B1:B3", "=A1:A3*2")
-    ws["B1"] = af
-    # openpyxl writes the cached value separately when data_only=True is read;
-    # for fixture purposes, write the file once and patch cached values via
-    # a second openpyxl pass.
-    wb.save(p)
-
-    chunks = parse_workbook(path=str(p)).chunks
-    assert chunks, "no chunks produced"
-    text = chunks[0].render_text
-    assert "ArrayFormula" not in text, "raw ArrayFormula object leaked"
-    # Once the fix lands, expect cached values 20 / 40 / 60 to render.
-```
-
-## Pitfalls
-
-* openpyxl's behaviour for `ArrayFormula` differs between read-only
-  and normal mode. Make sure both paths in `workbook_parser.py` agree
-  on how the cell is surfaced.
-* If the input was saved by LibreOffice or generated programmatically
-  without computing array results, `cached_value` will be `None`. In
-  that case there's no useful retrieval signal — surfacing the formula
-  expression itself is acceptable as a fallback, but it must be a
-  string, not `repr()`.
-* "Spilled" dynamic arrays (Excel 365's new `=A1:A3*2` without
-  Ctrl-Shift-Enter) are a related but distinct case. Note in your PR
-  if you handled both or only the classic array formula.
-
-## Repro
-
-```bash
-python -c "
-from openpyxl import load_workbook
-wb = load_workbook('data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/43026/1_43026_input.xlsx')
-ws = wb['summary']
-print(type(ws['D10'].value), repr(ws['D10'].value))
-"
-```
diff --git a/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md b/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md
deleted file mode 100644
index bd7500b..0000000
--- a/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md
+++ /dev/null
@@ -1,150 +0,0 @@
-# 02 · `text_hit_geom_miss` — chunk contains answer text but range doesn't overlap GT
-
-**Status:** 🆓 free to claim
-**Slice:** E (`annotation/block_splitter.py`, `analysis/pattern_splitter.py`, `analysis/light_block_detector.py`)
-**Independent of:** 00, 01, 03, 04
-
-## What it looks like
-
-Seven actionable instances on the 200-sample
-(`61-4`, `353-29`, `382-29`, `80-42`, `334-11`, `495-31`, `462-45`)
-where:
-
-* `rank_of_text_match ≤ 5` — some chunk's `render_text` contains the
-  answer value(s).
-* `rank_of_first_overlap is None` — but no chunk's claimed A1 range
-  overlaps the ground-truth range.
-* `n_chunks_on_gt_sheet ≥ 1` — we DID emit chunks on the correct sheet.
-
-So the parser knows what data exists on the sheet (text is there), but
-the **range bookkeeping** on the chunk it emitted is wrong: the claimed
-`top_left_cell` / `bottom_right_cell` excludes the answer cell even
-though the chunk's text includes it.
-
-This is the **citation-grade killer**. ks-backend's UI will highlight a
-region that doesn't actually contain the answer the LLM cited — looks
-buggy to the user.
-
-Example — instance `61-4`:
-```
-ans = 'output'!A2:G15
-chunks on 'output': 1 chunk
-chunk text contains: the answer string (text_rank=1)
-chunk's reported A1 range: NOT A2:G15
-```
-
-## Hypothesis (deliberately under-specified)
-
-Block merge/split paths widen the *text* (by absorbing adjacent cells)
-faster than they update `cell_range.bottom_right`, OR a pattern-split
-narrows the `cell_range` to exclude the rows the splitter just lifted
-into a child block while the rendered text still contains them. Find
-the actual code path empirically.
-
-## Diagnostic signature
-
-```bash
-jq -c 'select(.bucket_combined == "text_hit_geom_miss"
-              and (.flags | contains(["instruction_requires_execution"]) | not)
-              and .n_chunks_on_gt_sheet > 0)' \
-    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
-```
-
-For each match, the comparison that matters is `gt_range_bbox` vs each
-chunk on the GT sheet's `top_left_cell`/`bottom_right_cell`. Print the
-chunk's `render_text` and confirm it DOES contain the answer values
-that openpyxl reads at `answer_position`.
-
-## File scope
-
-You may touch:
-
-* `src/ks_xlsx_parser/models/block.py` — invariants on `BlockDTO.cell_range`.
-* `src/ks_xlsx_parser/annotation/block_splitter.py`
-* `src/ks_xlsx_parser/analysis/pattern_splitter.py`
-* `src/ks_xlsx_parser/analysis/light_block_detector.py`
-* `src/ks_xlsx_parser/chunking/chunker.py::_block_to_chunk` — at chunk
-  creation time, you can defensively clip `cell_range` to the bounding
-  box of cells that actually contributed text. That's a safety net even
-  if the upstream bug isn't located.
-
-Do NOT touch parsing (`parsers/*`) — the inputs are correct;
-downstream block bookkeeping is wrong.
-
-## Acceptance criteria
-
-1. Add an invariant test (`tests/test_structural_invariants.py`
-   pattern): for every chunk, the cells whose values appear in
-   `render_text` MUST lie within
-   `[chunk.top_left_cell, chunk.bottom_right_cell]`. Test should fail
-   on `main` for at least 3 of the 7 cluster instances.
-2. After the fix, all 7 instances flip from `text_hit_geom_miss` to
-   `both_hit` (`rank_of_first_overlap ≤ 5`).
-3. `geometric@5` on the 200-sample seed=1337 rises by ≥ 3 pp.
-4. `recall_text@5` does NOT drop (tightening ranges shouldn't drop
-   text — if it does, the chunker is also dropping text in step with
-   the range clip, which is a separate bug).
-5. `table_fragmentation_rate` does NOT rise — if you fixed range drift
-   by splitting one chunk into many, fragmentation will spike. That
-   would solve cluster 02 by creating cluster 04 — net zero. Either
-   keep fragmentation flat or coordinate with whoever owns 04.
-
-## Failing test sketch
-
-```python
-# tests/test_chunk_range_invariants.py
-import re
-from openpyxl.utils import column_index_from_string
-from ks_xlsx_parser.pipeline import parse_workbook
-from ks_xlsx_parser.models.common import col_letter_to_number
-
-A1_RE = re.compile(r"^([A-Z]+)(\d+)$")
-
-def _parse_a1(a1):
-    m = A1_RE.match(a1)
-    return (int(m.group(2)), col_letter_to_number(m.group(1)))
-
-def test_chunk_range_covers_all_rendered_cells(tmp_path):
-    # Use one of the cluster instances or a hand-built fixture:
-    # the assertion is "for every chunk, the cell coordinates that appear
-    # in render_text fall inside the claimed range box."
-    chunks = parse_workbook(
-        path="data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/61-4/1_61-4_input.xlsx"
-    ).chunks
-    for c in chunks:
-        if not c.top_left_cell or not c.bottom_right_cell:
-            continue
-        r0, col0 = _parse_a1(c.top_left_cell)
-        r1, col1 = _parse_a1(c.bottom_right_cell)
-        # Pull A1 references that appear in the chunk header (e.g. "[Sheet1!A2:G15]")
-        # and assert each is inside the claimed range. Adapt to whatever
-        # renderer markers your current text actually uses.
-        for ref in re.findall(r"\b([A-Z]+)(\d+)\b", c.render_text or ""):
-            col = col_letter_to_number(ref[0])
-            row = int(ref[1])
-            assert r0 <= row <= r1, f"row {row} outside {c.top_left_cell}:{c.bottom_right_cell}"
-            assert col0 <= col <= col1, f"col {ref[0]} outside {c.top_left_cell}:{c.bottom_right_cell}"
-```
-
-## Pitfalls
-
-* The renderer prints A1 refs in its block header (e.g.
-  `[Sheet1!A2:G15] (table)`) — those are the chunk's CLAIMED range, not
-  a cell reference. The invariant test should look at rendered cell
-  contents (table rows), not at the block header.
-* Some text rendered into a chunk doesn't have an A1 reference (e.g.
-  pure values inside a table grid). The invariant has to use the
-  block's underlying `cells` collection, not regex on text.
-* If the merge widens the chunk's range past adjacent empty rows, you
-  may LOSE geometric overlap on those by clipping. That's fine here —
-  the cluster is text_hit/geom_miss, so clipping to actual content is
-  what's needed.
-
-## Repro
-
-```bash
-python scripts/triage_recall.py tests/benchmarks/reports/retrieval \
-    --bucket present_but_ranked_low --examples 0   # to ensure no overlap
-# Then manually read enriched_failures.ndjson and pick instance 61-4
-# or 353-29 — both are pure text_hit_geom_miss cases.
-```
diff --git a/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md b/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md
deleted file mode 100644
index ef7161d..0000000
--- a/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md
+++ /dev/null
@@ -1,165 +0,0 @@
-# 03 · Chunk's range covers GT but rendered text lacks the answer values
-
-**Status:** 🆓 free to claim
-**Slice:** B + C + D (`parsers/cell_parser.py`, `parsers/sheet_parser.py`, `rendering/text_renderer.py`)
-**Independent of:** 00, 02, 04 — partial overlap with 01 on `text_renderer.py`; coordinate via README.
-
-## What it looks like
-
-Up to ten actionable instances on the 200-sample (`CF_3712`, `57033`,
-`53-12`, `55794`, `45063`, `36842`, `384-4`, `262-17`, `189-9`, `48799`)
-where:
-
-* `rank_of_first_overlap ≤ 5` — chunk's claimed range overlaps GT.
-* `rank_of_text_match is None` — but the chunk's rendered text does NOT
-  contain the answer values.
-
-The parser knows the block exists (range is right) but it dropped or
-mis-rendered the cells whose values should appear in the chunk text.
-
-**Two root causes nested inside this cluster — confirm by inspection:**
-
-### Sub-cluster 3a: cell-drop within block
-
-The block's `cells` collection is missing the answer cell, or the
-renderer's grid-construction loop skips it (e.g. when the cell sits in
-a row otherwise mostly empty, and a sparsity filter culls it).
-
-Instance `CF_3712`: chunk on `Purchases` covers M3:M5 geometrically;
-the answer header value (`gt_cell_raw = "Product "`) doesn't render.
-
-### Sub-cluster 3b: uncached formula renders as `=A1+B1`, not the value
-
-Cell stores a formula; the workbook was saved without computing it
-(common for LibreOffice / programmatic generation). `data_only` load
-returns `None` for that cell. The renderer falls back to the formula
-source. Retrieval for the *value* (e.g. "1272") fails because the chunk
-text contains "=A1+B1" instead.
-
-Instance `48799` (`Sheet2!Z2` — a 24-deep nested IF/VLOOKUP). Same
-shape: formula text in chunk, value missing.
-
-> ⚠️ **Some of the named instances may be unscorable** — e.g. `189-9`
-> and `53-12` have answer cells that are all `None` in `answer.xlsx`,
-> meaning the benchmark expects the system to write nothing. In that
-> case `answer_cell_values=[]` and `rank_text=None` is automatic,
-> regardless of what the chunk renders. The first thing to do in this
-> TODO is to filter the ten instances through openpyxl to find the
-> truly unfixable ones; report that as part of the PR.
-
-## Diagnostic signature
-
-```bash
-jq -c 'select(.bucket_combined == "text_miss_geom_hit"
-              and (.flags | contains(["instruction_requires_execution"]) | not)
-              and .n_chunks_on_gt_sheet >= 1)' \
-    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
-```
-
-Then run for each candidate:
-
-```python
-from openpyxl import load_workbook
-wb = load_workbook(input_path, data_only=True)
-ws = wb[answer_sheet]
-# read answer.xlsx cells at answer_position — if all None, drop this instance.
-```
-
-Instances whose `answer.xlsx` GT cells are all `None` move to TODO 05
-(out-of-scope; benchmark-scoring issue).
-
-## File scope
-
-You may touch:
-
-* `src/ks_xlsx_parser/parsers/cell_parser.py` — formula vs cached-value
-  selection. Add a code path: if cell has a formula AND no cached
-  value in the `data_only` pass, mark as `formula_uncached` and emit
-  both the formula text AND any reachable proxy (e.g. blank cell with
-  formula reference, instead of literal formula in the grid).
-* `src/ks_xlsx_parser/parsers/sheet_parser.py` — make sure cells with
-  string values that look like formulas (`startswith('=')`) get
-  classified as formula cells.
-* `src/ks_xlsx_parser/rendering/text_renderer.py::_cell_render_value`
-  — for formula cells, prefer cached value; if cached is None, render
-  the formula source ONLY (not `None`, not the formula display).
-
-Do NOT add a formula evaluator. That's a separate, larger effort.
-
-## Acceptance criteria
-
-1. After culling unscorable instances (those with all-None
-   `answer.xlsx` cells), at least **5 of the remaining instances** flip
-   from `text_miss_geom_hit` to `both_hit` (rank_of_text_match ≤ 5).
-2. `recall_text@5` on the 200-sample seed=1337 rises by ≥ 2 pp.
-3. The unscorable instances are documented in the PR description with
-   a one-line "this instance has no answer values in answer.xlsx,
-   moving to TODO 05".
-4. A new test fixture in `tests/conftest.py` (a workbook with a
-   `=SUM(A1:A3)` cell saved WITHOUT cached values) plus a regression
-   test that asserts the chunk render_text contains `=SUM(A1:A3)`
-   verbatim (not the openpyxl repr of the formula object).
-
-## Failing test sketch
-
-```python
-# tests/test_formula_rendering_unfilled.py
-import openpyxl
-from ks_xlsx_parser.pipeline import parse_workbook
-
-def test_uncached_formula_renders_formula_source(tmp_path):
-    p = tmp_path / "uncached.xlsx"
-    wb = openpyxl.Workbook()
-    ws = wb.active
-    ws["A1"] = 10
-    ws["A2"] = 20
-    ws["A3"] = 30
-    ws["B1"] = "=SUM(A1:A3)"
-    wb.save(p)  # NO data_only refresh — B1's cached value remains None.
-
-    chunks = parse_workbook(path=str(p)).chunks
-    assert chunks, "no chunks"
-    text = chunks[0].render_text
-    # Whichever is acceptable: the formula source verbatim, OR the
-    # computed value via the formula engine. NOT a None/empty cell.
-    assert ("=SUM(A1:A3)" in text) or ("60" in text), (
-        "uncached formula cell dropped from render_text:\n" + text
-    )
-```
-
-## Pitfalls
-
-* `_cell_render_value` already has a branch for booleans, dates,
-  integer-valued floats. Don't break that path while adding the formula
-  fallback.
-* If you add a formula-source string to render_text, make sure the
-  retrieval text-match logic in `eval_retrieval.py::_normalize_value_for_match`
-  still works on numeric values that DO have cached results — adding
-  formula sources should AUGMENT, not replace, the cached-value path.
-* Some "cell drop" cases are actually merged-cell handling: a value is
-  on the top-left of a merged region but the renderer surfaces only an
-  empty placeholder for the merged cells. Distinct from this cluster —
-  if you find one, file it as a new TODO.
-* Don't try to evaluate formulas — Excel's calc engine is a swamp and
-  `formula/formula_parser.py` only parses, doesn't compute. The
-  acceptable fallback is "render the formula source so an LLM can read
-  it and reason."
-
-## Repro
-
-```bash
-# Inspect one candidate end-to-end.
-python << 'EOF'
-from openpyxl import load_workbook
-from ks_xlsx_parser.pipeline import parse_workbook
-p = "data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/CF_3712/1_CF_3712_input.xlsx"
-wb = load_workbook(p, data_only=True)
-print("M3:M5 on", wb.sheetnames[0])
-for r in range(3,6):
-    print(" ", wb.active.cell(row=r,column=13).value)
-r = parse_workbook(path=p)
-for c in r.chunks:
-    print(c.sheet_name, c.top_left_cell, c.bottom_right_cell)
-    print((c.render_text or "")[:600])
-EOF
-```
diff --git a/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md b/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md
deleted file mode 100644
index 97cd879..0000000
--- a/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md
+++ /dev/null
@@ -1,138 +0,0 @@
-# 04 · Whole-sheet single chunk dilutes the embedding
-
-**Status:** 🆓 free to claim
-**Slice:** F (`chunking/chunker.py`, `chunking/segmenter.py`, `analysis/light_block_detector.py`, `analysis/table_grouper.py`)
-**Independent of:** 00, 01, 02, 03
-
-## What it looks like
-
-Two clear instances on the 200-sample (`184-6`, `374-9`); the pattern
-extends to many more on the full corpus where a sheet contains a
-single large data table with a few headers and the chunker emits ONE
-chunk covering A1:end. Note: TODO 02 also has instances where the
-single-chunk-per-sheet pattern shows up (`53-12`, `189-9`,
-`334-11`, `353-29`, `382-29`, `462-45`, `495-31`, `CF_3712`). Treat
-those as **second-order beneficiaries**: when 02 lands first, the
-chunk's range tightens and may already split the over-broad chunk.
-When 04 lands first, the chunks get smaller and 02's range invariant
-becomes easier to satisfy. Order is reviewer's choice.
-
-Symptoms:
-
-* `n_chunks_on_gt_sheet == 1` and the chunk spans most of the sheet.
-* The answer values are present in the chunk's render_text BUT rank
-  >5 (or rank=1 by luck — at recall@1 the model picks based on the
-  bag-of-everything embedding). The recall@1 vs recall@5 gap of 12.4 pp
-  in the v0.2.0 numbers is consistent with this.
-* `summary.md` shows `table_fragmentation_rate ≈ 0` on these sheets
-  (which sounds good but means we're under-splitting, not perfectly
-  segmenting).
-
-This is the **chunk-granularity** problem. The block-detection pipeline
-fuses everything into one block; the chunker has no token cap and
-emits the block verbatim.
-
-## Diagnostic signature
-
-```bash
-jq -c 'select(.n_chunks_on_gt_sheet == 1 and .n_chunks_total <= 2
-              and (.flags | contains(["instruction_requires_execution"]) | not))' \
-    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
-```
-
-Also: dump the per-chunk `token_count` (already computed but unused in
-chunker.py). Any chunk over 800 tokens is over-broad on this corpus.
-
-## File scope
-
-You may touch:
-
-* `src/ks_xlsx_parser/chunking/chunker.py` — introduce a chunk-size
-  cap. Suggested target: ~512 tokens. When a block exceeds, split the
-  block into row-groups (preserving headers in each group), with each
-  group becoming its own chunk. `prev_chunk_id` / `next_chunk_id` is
-  already wired for navigation.
-* `src/ks_xlsx_parser/chunking/segmenter.py` — segmenter logic if any
-  granularity decisions live there.
-* `src/ks_xlsx_parser/analysis/light_block_detector.py` and
-  `analysis/table_grouper.py` — investigate whether *over*-merging
-  upstream is causing the one-block-per-sheet symptom. If the table
-  grouper is collapsing 3 logical tables into 1, the upstream fix is
-  better than splitting at chunk-emit time.
-
-Do NOT touch the ranking/embedding step (`scripts/eval_retrieval.py`)
-or any retrieval logic.
-
-## Acceptance criteria
-
-1. After landing, no chunk on the 200-sample seed=1337 run exceeds
-   ~800 tokens (= ~3200 characters) by default. Allow override.
-2. `table_fragmentation_rate` may rise (this is desired up to ~0.5 —
-   purposeful row-group splits). It must NOT cause a regression in
-   geometric@5 — each row-group chunk's claimed range must cover only
-   its rows.
-3. `recall_text@5` rises by ≥ 3 pp on the 200-sample.
-4. `recall_text@1` rises by ≥ 4 pp (smaller chunks = sharper embeddings
-   = top-1 sharpens).
-5. The two named instances (`184-6`, `374-9`) flip to `both_hit`. Note
-   that `184-6` is also subject to TODO 00's harness fix — order may
-   matter.
-
-## Failing test sketch
-
-```python
-# tests/test_chunk_size_cap.py
-from ks_xlsx_parser.pipeline import parse_workbook
-
-def test_no_chunk_exceeds_token_budget(tmp_path):
-    # Generate a workbook with a 1000-row table. Without a cap the
-    # chunker emits one giant chunk; with the cap it should emit several.
-    import openpyxl
-    p = tmp_path / "big.xlsx"
-    wb = openpyxl.Workbook()
-    ws = wb.active
-    ws.append(["id", "name", "value"])
-    for i in range(1, 1001):
-        ws.append([i, f"name-{i}", i * 7])
-    wb.save(p)
-
-    chunks = parse_workbook(path=str(p)).chunks
-    assert len(chunks) >= 2, "1000-row table should produce ≥2 chunks after the cap lands"
-    # Each chunk's claimed range should be a contiguous row block on the same sheet.
-    for c in chunks:
-        assert c.sheet_name and c.top_left_cell and c.bottom_right_cell
-    # No chunk should be larger than the budget (in render-text characters).
-    for c in chunks:
-        assert len(c.render_text or "") <= 4000, f"chunk too big: {len(c.render_text)} chars"
-```
-
-## Pitfalls
-
-* **Don't strand headers.** The current chunker preserves table headers
-  in render_text. When splitting a table into row groups, every group
-  needs the header rows OR a header-summary block so an LLM can
-  interpret column meanings.
-* **Range bookkeeping** — when you split, the child chunks' A1 ranges
-  MUST be tight to their row subset. Otherwise you create
-  geometric_no_overlap failures (cluster 02 in reverse).
-* **`token_count` is character-based** (`len(text) // 4`). Use the
-  same metric for the cap to avoid a circular dependency on a tokenizer.
-* **Don't split horizontally** (column subsets). Excel tables are
-  usually wider than embeddings need but column-split chunks have no
-  natural boundary and reading order breaks.
-
-## Coordinating with TODO 02
-
-If both 02 and 04 are in flight:
-- 02 is *range-tightening* (semantically correct ranges).
-- 04 is *granularity* (more chunks per sheet).
-- The risk is they fight over `_block_to_chunk`. Decide which one
-  emits the final A1 range and document the contract. Suggested:
-  04 splits blocks into sub-blocks WITH tight ranges, then 02's
-  invariant test passes for the sub-blocks.
-
-## Repro
-
-Use a synthetic 1000-row workbook (see test sketch) — much faster to
-iterate on than the corpus instances. Once your local fixture passes,
-run the 200-sample to confirm corpus-wide impact.
diff --git a/docs/planning/recall-90/05-out-of-scope-execution-instances.md b/docs/planning/recall-90/05-out-of-scope-execution-instances.md
deleted file mode 100644
index 0133d3a..0000000
--- a/docs/planning/recall-90/05-out-of-scope-execution-instances.md
+++ /dev/null
@@ -1,136 +0,0 @@
-# 05 · Out-of-scope: benchmark instances that require execution, not retrieval
-
-**Status:** 🆓 free to claim (scoring/harness work — **NOT a parser fix**)
-**Slice:** A (`scripts/eval_retrieval.py`, `scripts/enrich_failures.py`)
-**Independent of:** 01, 02, 03, 04 — partial overlap with 00 on
-`eval_retrieval.py`; coordinate via README.
-
-## Why this exists
-
-127 of 200 instances in the seed=1337 sample (≈ 63%) have
-`instruction_requires_execution = True` — the `answer_position` in the
-INPUT spreadsheet is empty because the question literally asks the
-system to *compute and write* a value there.
-
-Example instructions from this class:
-
-* "Round the value 367.5 to the nearest $5 and put the result in C2."
-* "Sum each row of B:F and fill column G."
-* "Fix the broken formulas in D2:D613."
-
-A retrieval-grade parser cannot satisfy these instances. The answer
-doesn't exist in `input.xlsx` to be retrieved. Including these
-instances in the headline `recall_text@5` number drags it down ~25
-points without representing any parser quality signal.
-
-**This TODO is not a parser improvement.** It's a benchmark-scoring
-correction so that subsequent parser work has a clean signal.
-
-## What the headline numbers actually mean today
-
-On the 200-sample seed=1337:
-
-|                         | All 200 | In-scope (73) |
-|-------------------------|---------|---------------|
-| Text recall@5           | 0.635   | 0.59          |
-| Geometric recall@5      | 0.310   | (compute it)  |
-| `instruction_requires_execution` | 127 of 200 | 0 (excluded) |
-
-The "in-scope" column is the metric the planning roadmap targets. Today
-it's hidden inside the eval output; this TODO surfaces it as a
-first-class metric.
-
-## What to ship
-
-1. **A scoring filter switch** in `scripts/eval_retrieval.py`.
-   - New optional `--exclude-execution-instances` flag (or compute both
-     filtered and unfiltered metrics every run).
-   - Definition of "execution instance": at run time, peek at
-     `input.xlsx[answer_sheet][answer_position]`. If the union of
-     non-empty cells in that range is empty, mark the instance as
-     `instruction_requires_execution` and report it in a separate
-     bucket. **Do this once per instance, not per parser** — the
-     classification is parser-independent.
-2. **Two recall numbers** in `summary.json` / `summary.md`:
-   - `recall_text@5_all` (current behaviour — count every instance).
-   - `recall_text@5_in_scope` (denominator excludes
-     execution instances).
-3. **History tracking**: include both numbers in
-   `scripts/append_bench_history.py` so the trend chart shows the
-   in-scope number — that's the target for the 0.90 goal.
-4. **README + recall-investigation.md update** so the in-scope number
-   is named and called out as the gate. The 0.90 target is on
-   *in-scope* recall, not all-instances recall.
-5. **Out-of-scope summary file** alongside the run reports:
-   `tests/benchmarks/reports/retrieval/<stamp>/out_of_scope.txt` listing
-   the instance IDs that were filtered, so the filter's behaviour is
-   auditable. If somebody disagrees with the classification, they can
-   diff the list.
-
-## File scope
-
-You may touch:
-
-* `scripts/eval_retrieval.py` — add classification step BEFORE
-  `score_instance` (so unfiltered metric still works).
-* `scripts/enrich_failures.py` — make `instruction_requires_execution`
-  the canonical source of truth (it already detects this).
-* `scripts/append_bench_history.py` — record both numbers per run.
-* `scripts/triage_recall.py` — by default exclude out-of-scope rows.
-* `docs/planning/recall-90/README.md` — update target row.
-* `docs/recall-investigation.md` — call out the in-scope number.
-
-Do NOT touch any `src/ks_xlsx_parser/*` code for this cluster.
-
-## Acceptance criteria
-
-1. On the same 200-sample seed=1337 run, the new
-   `recall_text@5_in_scope` number is emitted alongside the existing
-   one and printed in `summary.md`.
-2. The classifier reproducibly identifies ≥ 120 of the 127 currently-
-   flagged instances (some borderline cases — a single non-empty
-   header in the answer range — are debatable).
-3. `history.jsonl` rows after the change have both fields populated.
-4. A unit test under `tests/test_eval_retrieval_classify.py` covers
-   the classifier with hand-built fixtures: empty range → out-of-scope,
-   range with one value → in-scope, range with only string headers
-   (no numeric data) → in-scope, range with formula cells that have
-   cached values → in-scope.
-
-## Why this isn't TODO 00
-
-* TODO 00 fixes the harness's *spec parser* — wrong sheet names get
-  resolved.
-* TODO 05 fixes the harness's *scoring* — instances the parser can't
-  affect get a separate bucket.
-
-Both touch `eval_retrieval.py` but in different functions. If they're
-worked simultaneously, coordinate over `parse_position_spec` —
-TODO 00 owns the regex; TODO 05 owns the classifier.
-
-## Pitfalls
-
-* Don't over-exclude. An instance where `answer_position` covers
-  `A2:G15` and only `A1` is non-empty (a single header row above the
-  empty area) is still execution-required, but use the right cell
-  range — read the actual `answer_position`, not `data_position`.
-* The classifier needs the openpyxl read, which costs ~50ms per
-  instance. Cache it or do it once at script start, not inside the
-  ranking loop.
-* Don't delete the unfiltered metric. Both numbers are useful: the
-  unfiltered one tells you how a downstream consumer experiences
-  ks-xlsx-parser on real spreadsheet questions (most of which DO
-  require execution); the in-scope one tells you whether the parser
-  is doing its job.
-
-## Sample of what gets filtered
-
-Random sampling of `instruction_requires_execution = True` instances:
-
-| Instance | Instruction (first 60 chars) | Answer range |
-|---|---|---|
-| ...      | "Calculate the sum of column B for products where..."   | `Sheet1!E2:E10` |
-| ...      | "For each row in column A, look up the corresponding..."| `Sheet1!C2:C50` |
-| ...      | "Pivot the data in A1:D100 by month..."                 | `Sheet2!A1:E13` |
-
-Run the enricher to see the full list.
diff --git a/docs/planning/recall-90/README.md b/docs/planning/recall-90/README.md
deleted file mode 100644
index 4f50b46..0000000
--- a/docs/planning/recall-90/README.md
+++ /dev/null
@@ -1,137 +0,0 @@
-# Roadmap: ks-xlsx-parser retrieval recall → 0.90
-
-This directory is the worklist for getting SpreadsheetBench recall on the
-canonical ks-xlsx-parser benchmark from the v0.2.0 baseline up to:
-
-| Metric                 | v0.2.0 (912) | 200-sample baseline | Target | Stretch |
-|------------------------|--------------|---------------------|--------|---------|
-| Text recall@5          | 0.704        | 0.635               | ≥ 0.90 | 0.95    |
-| Geometric recall@5     | 0.369        | 0.310               | ≥ 0.70 | 0.85    |
-| In-scope text recall@5 | unmeasured   | **0.59**            | ≥ 0.90 | —       |
-
-Each `NN-*.md` in this folder is one **independent, parallel-safe TODO**.
-Pick one, claim it (table below), execute, hand off the next agent.
-
-## Read this first — the headline number is misleading
-
-The benchmark dataset has 912 instances. **~63% of them require the
-system to *compute and write* the answer** (e.g. "fill column C with
-ROUND(B*5)") — the answer literally doesn't exist in the input
-spreadsheet. A parser cannot retrieve what isn't there. We call this
-class `instruction_requires_execution` and treat them as **out of scope
-for parser work**. See [05-out-of-scope-execution-instances.md](./05-out-of-scope-execution-instances.md).
-
-When the scoring is filtered to in-scope instances only, the real
-ks-xlsx-parser recall is closer to **0.59**, and improving it past
-0.90 means closing ~22 named failures out of ~30 actionable cases on the
-200-instance seed=1337 sample. That's the actual work.
-
-## Worklist
-
-Status legend: 🆓 free to claim · 🔵 in progress · ✅ landed · ⛔ blocked
-
-| # | File | Cluster | Instances | Primary slice | Status |
-|---|---|---|---|---|---|
-| 00 | [00-benchmark-spec-parser-bugs.md](./00-benchmark-spec-parser-bugs.md) | Benchmark harness fails to parse malformed `data_position`/`answer_position` (`Dashboard'!B8`, fullwidth `G12：J15`) — 8 instances scored as wrong even when the parser was correct. | 8 | `scripts/eval_retrieval.py` | 🆓 |
-| 01 | [01-array-formula-rendering.md](./01-array-formula-rendering.md) | Array-formula cells surface as `<openpyxl.worksheet.formula.ArrayFormula object>` — value never lands in `render_text`. | 2 | `parsers/cell_parser.py`, `rendering/text_renderer.py` | 🆓 |
-| 02 | [02-chunk-range-vs-text-mismatch.md](./02-chunk-range-vs-text-mismatch.md) | `text_hit_geom_miss` with chunks on the GT sheet — answer text is in *some* chunk's text but no chunk's claimed A1 range overlaps GT. Range bookkeeping drift during block merge/split. | 7 | `annotation/block_splitter.py`, `analysis/pattern_splitter.py`, `chunking/chunker.py` | 🆓 |
-| 03 | [03-cell-drop-or-uncached-formula.md](./03-cell-drop-or-uncached-formula.md) | Chunk's range covers GT geometrically but its rendered text lacks the answer values — either cells dropped in render, or formula cells with no cached value rendered as the formula source. | up to 10 (some unscorable) | `parsers/cell_parser.py`, `parsers/sheet_parser.py`, `rendering/text_renderer.py` | 🆓 |
-| 04 | [04-single-chunk-multi-region-sheet.md](./04-single-chunk-multi-region-sheet.md) | Sheets containing distinct logical regions emerge as one giant chunk; the answer text dilutes against the rest of the sheet at embedding time. | 2 (+ likely more at full corpus) | `chunking/chunker.py`, `analysis/light_block_detector.py`, `analysis/table_grouper.py` | 🆓 |
-| 05 | [05-out-of-scope-execution-instances.md](./05-out-of-scope-execution-instances.md) | Informational + scoring filter. **Not a parser fix.** Surfaces benchmark instances where the parser fundamentally can't help; suggests filtering them from headline recall. | 127 / 200 | `scripts/eval_retrieval.py` (scoring), `scripts/enrich_failures.py` | 🆓 |
-
-## How parallelism works here
-
-Each TODO is scoped to a different file slice so agents won't collide:
-
-```
-slice A   benchmark harness         00, 05
-slice B   parsers/cell_parser       01, 03
-slice C   parsers/sheet_parser      03
-slice D   rendering/text_renderer   01, 03
-slice E   annotation/ + analysis/   02
-slice F   chunking/                 04
-```
-
-TODOs 00 and 05 touch the same file (`eval_retrieval.py`) — **serialize**
-those two. Everything else can run truly in parallel.
-
-If you discover a cluster mid-flight that doesn't fit anywhere here,
-add a new `NN-*.md` and append a row to the table above. Do NOT bury
-findings inside an existing TODO.
-
-## Working a TODO
-
-Each cluster file has the same structure:
-
-1. **What the cluster looks like** — a real failure example in full.
-2. **Repro instance IDs** — the 200-sample seed=1337 instance IDs that
-   match this cluster.
-3. **Diagnostic signature** — exact `enriched_failures.ndjson` columns
-   that identify a member.
-4. **File scope** — what you can change; what you can't.
-5. **Acceptance criteria** — measurable on the 200-sample seed=1337
-   rerun. "The N instances listed in §2 all flip from miss → hit on
-   `text_match_rank` AND no other previously-passing instance regresses."
-6. **Failing test sketch** — one-line idea for a regression test.
-
-What's **deliberately missing** from each file: the proposed fix and an
-effort estimate. The agent that picks up the task does that diagnosis;
-the doc just specifies success.
-
-### The loop, end-to-end
-
-```bash
-# 1. Claim the task (edit README, change 🆓 → 🔵 with your name).
-# 2. Read the cluster file + the named example instances.
-# 3. Reproduce locally:
-python scripts/eval_retrieval.py \
-    --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \
-    --parsers ks --sample 200 --seed 1337 --emit-failures
-python scripts/enrich_failures.py tests/benchmarks/reports/retrieval
-python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5
-
-# 4. Write a regression test that fails today (see test sketch in cluster file).
-# 5. Fix until the test passes.
-# 6. Re-run the eval + enrichment. Confirm the cluster count dropped, no regressions.
-# 7. Append to history; the delta should be visible:
-python scripts/append_bench_history.py
-
-# 8. PR the change. Title: `recall(NN): <cluster short name> — N→M`.
-#    PR body links to the cluster file and pastes the before/after histogram.
-# 9. Update README table: 🔵 → ✅ with PR link.
-```
-
-## Why the 200-sample seed=1337 is the gate, not the full 912
-
-912 takes ~40 min on a laptop. PR validation needs to be fast. The
-seed=1337 200-sample is deterministic and contains roughly proportional
-representation of every cluster. Acceptance criteria are stated against
-it. **Cross-check with the full corpus before declaring a release**:
-
-```bash
-python scripts/eval_retrieval.py --parsers ks --emit-failures
-```
-
-The weekly benchmark workflow (`.github/workflows/benchmark.yml`) runs
-the full 912 on schedule + workflow_dispatch.
-
-## Tracking progress
-
-`tests/benchmarks/reports/history.jsonl` is appended to on each
-`make bench-track` run, one row per commit. Tail it to see the trend:
-
-```bash
-tail -10 tests/benchmarks/reports/history.jsonl
-```
-
-Each row carries the failure_buckets histogram, so per-cluster progress
-is in the data, not just the headline number.
-
-## See also
-
-- `docs/recall-investigation.md` — diagnosis framework + hypotheses
-  (H1 chunk-size, H2 formula-rendering, H3 range-bookkeeping)
-- `docs/benchmark-local-setup.md` — install → corpus → eval loop
-- `scripts/triage_recall.py` — bucket histogram + exemplar dump
-- `scripts/enrich_failures.py` — per-failure diagnostic columns
-- `.claude/skills/recall-failure-triage/SKILL.md` — agent guide