knowledgestack · arnav2 · May 19, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/.claude/skills/recall-failure-triage/SKILL.md b/.claude/skills/recall-failure-triage/SKILL.md
@@ -0,0 +1,154 @@
+---
+name: recall-failure-triage
+description: Diagnose why ks-xlsx-parser's retrieval recall is below target by reading the failure-bucket histogram and exemplar failures, then propose ranked fixes. Use when investigating low recall@k on SpreadsheetBench, deciding what to fix next, or reviewing a benchmark run. Sister skill to excel-extraction-pipeline-improver.
+---
+
+# Recall Failure Triage — turn `recall@5 = X` into a ranked worklist
+
+A single recall number tells you nothing about *what* to fix. Five
+distinct failure modes produce identical recall drops, and the fix is
+completely different per mode. This skill is the bridge from "recall is
+low" to "open `parsers/cell_parser.py` line N, here is the bug".
+
+Pair with `excel-extraction-pipeline-improver` — that skill implements
+fixes; this one decides what to fix.
+
+## When this skill fires
+
+* User says "recall is low / dropped / not improving"
+* New benchmark run produced a `failures.ndjson` or `summary.json`
+* PR-time benchmark sample regressed on the CI step summary
+* You're picking the next parser/chunking PR and need to choose what's
+  highest leverage
+
+## Inputs you should expect
+
+| Input | Where it comes from |
+|---|---|
+| `summary.json` + `summary.md` | latest `tests/benchmarks/reports/retrieval/<stamp>/` |
+| `failures.ndjson` | same dir, **only present if** the run used `--emit-failures` |
+| `history.jsonl` | `tests/benchmarks/reports/history.jsonl` — recall over commits |
+| current branch + recent commits | to know what changed since the last run |
+
+## The decision procedure
+
+### 0. Apply the in-scope filter — most "misses" are unfixable
+
+Before counting buckets, run `scripts/enrich_failures.py` to mark
+instances flagged `instruction_requires_execution` — those where the
+benchmark's `answer_position` is empty in `input.xlsx` because the
+question literally asks the system to *write* the answer. ~63% of
+the SpreadsheetBench corpus is this class on the 200-sample. They are
+NOT parser bugs and CANNOT be fixed by parser work. Exclude them.
+
+```bash
+python scripts/enrich_failures.py tests/benchmarks/reports/retrieval
+jq -c 'select((.flags | contains(["instruction_requires_execution"])) | not)' \
+    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
+```
+
+### 1. Read the bucket histogram FIRST
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
+```
+
+The output ranks buckets by count. **Spend your time on the largest
+bucket only.** The next-largest is the next PR.
+
+| Bucket | Root cause | Where to fix |
+|---|---|---|
+| `answer_absent_from_chunks` | The answer value never made it into ANY chunk. The cell was dropped or rendered as a formula expression instead of its computed value. | `parsers/cell_parser.py::_extract_cell_value` and `rendering/text_renderer.py::_cell_render_value`. Check `data_only` plumbing in `parsers/workbook_parser.py`. |
+| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. The chunk is too big/heterogeneous — embedding is diluted. | `chunking/chunker.py`. There is no token cap today; large blocks are emitted whole. Cap at ~512 tokens and row-split tall tables. |
+| `wrong_sheet` | Answer sheet was never chunked. | `parsers/workbook_parser.py` sheet loop. Check hidden sheets, very-hidden sheets, and `wb.sheetnames` vs `wb.worksheets`. |
+| `geometric_no_overlap` | Answer text matches but the chunk's A1 range doesn't overlap ground truth. Range drifts during merge/split. | `annotation/block_splitter.py` + `analysis/pattern_splitter.py`. Add invariant: `block.cell_range` ⊆ bbox of cells that actually rendered text. |
+| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. Reproduce with `parse_workbook(path)`. |
+
+### 2. Read 5 example failures from the top bucket
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval \
+  --bucket <largest> --examples 5
+```
+
+For each example, the script prints:
+* the natural-language question
+* `answer_position` (the ground-truth cell)
+* the ground-truth string values that should appear in a chunk
+* the top-8 ranked chunks with sheet, range, and whether each contains
+  the answer
+
+**Pattern-match.** What do the 5 examples have in common? E.g.
+* All answer cells are formulas → H2 confirmed: render the cached value
+* All chunks span the entire sheet → H1 confirmed: cap chunk size
+* All answer sheets are hidden → wrong_sheet from hidden-sheet skip
+
+### 3. State the hypothesis before changing code
+
+In your reply, write the hypothesis as one sentence:
+> "Bucket X dominates because [observable common pattern in examples].
+> Expected fix: [specific file + change]. Expected lift on recall@5:
+> [number]."
+
+The expected lift = (bucket count) ÷ (scoreable instances). If a fix
+is supposed to halve the biggest bucket but recall barely moves on the
+next run, the hypothesis was wrong — go back to step 1, don't pile on.
+
+### 4. Wire the fix into the existing pipeline
+
+Pass the hypothesis to the `excel-extraction-pipeline-improver` skill
+(or implement directly) with TDD: write a failing crossval/invariant
+test that captures the bug from one example, fix, run `make test`, then
+re-run `make bench-track` to confirm the bucket shrank.
+
+### 5. Verify on the history
+
+```bash
+tail -5 tests/benchmarks/reports/history.jsonl
+```
+
+The script prints a row-over-row delta. Expected behaviour:
+* The targeted bucket count drops
+* `recall_text@5` rises by approximately `target_bucket_drop /
+  scoreable_instances`
+* `recall_text@1` improves proportionally OR more (better chunks rank
+  higher too)
+* Other buckets shouldn't grow — if they do, the fix had a side effect
+
+## Anti-patterns
+
+* **Don't chase the headline number.** "Recall went up 1 pp" without a
+  bucket explanation is suspicious — it might be benchmark noise.
+* **Don't fix two buckets at once.** You won't know which fix worked.
+* **Don't tune chunk size as the first move.** It only helps
+  `present_but_ranked_low`. If `answer_absent_from_chunks` is bigger,
+  chunk tuning is wasted effort.
+* **Don't trust geometric and text recall independently.** A docling-style
+  parser with zero geometric overlap can still have high text recall
+  (because it serializes everything to plain text). That's not the same
+  thing as good — citation overlays need the geometric metric. Always
+  report both.
+
+## Quick reference — running it from scratch
+
+```bash
+make corpus-download                                    # one-time, large
+make bench-track                                        # runs + triages
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5
+cat tests/benchmarks/reports/history.jsonl | tail -10   # commit-over-commit
+```
+
+CI runs a 60-instance sample on every PR (`.github/workflows/benchmark.yml`)
+and posts the bucket histogram to the GitHub job summary. Weekly schedule
+runs the full 912-instance corpus.
+
+## Hand-off contract
+
+When you finish a triage round, your final message MUST include:
+1. The bucket histogram (counts + percentages).
+2. The dominant bucket + one-sentence root cause hypothesis.
+3. The 1–2 file paths where the fix lives.
+4. The expected lift on `recall_text@5` if the fix lands.
+5. A pointer to the example instance IDs you read.
+
+Anything less and the next agent will have to re-do the analysis.
diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
@@ -0,0 +1,85 @@
+name: Benchmark
+
+# Tracks ks-xlsx-parser retrieval recall on SpreadsheetBench over time.
+# The headline goal: text recall@5 > 0.90 (currently ~0.70).
+#
+#   * Pull requests run a fast SAMPLE (60 instances) as a regression smoke
+#     test — keeps the signal without a 40-minute wait.
+#   * The weekly schedule + manual dispatch run the FULL 912-instance
+#     corpus and publish the recall trend.
+
+on:
+  pull_request:
+    branches: [main]
+    paths:
+      - "src/**"
+      - "scripts/eval_retrieval.py"
+      - "scripts/triage_recall.py"
+      - "Dockerfile.bench"
+      - ".github/workflows/benchmark.yml"
+  schedule:
+    - cron: "0 6 * * 1" # Mondays 06:00 UTC
+  workflow_dispatch:
+    inputs:
+      sample:
+        description: "Instances to sample (0 = full 912 corpus)"
+        default: "0"
+
+concurrency:
+  group: benchmark-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  benchmark:
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - uses: actions/checkout@v4
+
+      # PRs use a 60-instance sample; scheduled/dispatch runs use the full
+      # corpus (or whatever the dispatch input requests).
+      - name: Resolve sample size
+        id: cfg
+        run: |
+          if [ "${{ github.event_name }}" = "pull_request" ]; then
+            echo "sample=60" >> "$GITHUB_OUTPUT"
+          else
+            echo "sample=${{ github.event.inputs.sample || 0 }}" >> "$GITHUB_OUTPUT"
+          fi
+
+      - name: Cache SpreadsheetBench corpus
+        uses: actions/cache@v4
+        with:
+          path: data/corpora/spreadsheetbench
+          key: spreadsheetbench-912-v0.1
+
+      - name: Build benchmark image
+        run: docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+
+      - name: Run benchmark
+        run: |
+          mkdir -p tests/benchmarks/reports data
+          docker run --rm \
+            -e BENCH_SAMPLE=${{ steps.cfg.outputs.sample }} \
+            -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+            -v "$PWD/data:/app/data" \
+            ks-xlsx-parser-bench | tee bench.log
+
+      - name: Publish recall to job summary
+        if: always()
+        run: |
+          {
+            echo '## ks-xlsx-parser retrieval benchmark'
+            echo ''
+            echo '```'
+            tail -n 40 bench.log || true
+            echo '```'
+          } >> "$GITHUB_STEP_SUMMARY"
+
+      - name: Upload benchmark reports
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: benchmark-reports-${{ github.run_number }}
+          path: tests/benchmarks/reports/
+          if-no-files-found: warn
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -23,22 +23,20 @@ jobs:
     steps:
       - uses: actions/checkout@v4
 
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          version: "latest"
+          enable-cache: true
+          cache-dependency-glob: "uv.lock"
+
       - name: Set up Python ${{ matrix.python-version }}
         uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
-          cache: pip
-          cache-dependency-path: pyproject.toml
 
       - name: Install
-        run: |
-          python -m pip install --upgrade pip
-          pip install -e ".[dev,api]"
-
-      - name: Ruff lint
-        run: |
-          pip install ruff
-          ruff check src/ tests/ scripts/ || true  # non-blocking until cleanup PR lands
+        run: uv pip install --system -e ".[dev,api]"
 
       - name: Run test suite
         run: make test-ci
@@ -50,3 +48,56 @@ jobs:
           name: junit-${{ matrix.os }}-py${{ matrix.python-version }}
           path: reports/junit.xml
           if-no-files-found: ignore
+
+  # Builds the wheel and proves it installs + imports in a CLEAN venv.
+  # The matrix `test` job runs against an editable install, which exposes the
+  # whole src/ tree on sys.path and therefore CANNOT catch a broken wheel —
+  # exactly how the v0.2.0 "pipeline.py missing from wheel" bug shipped.
+  # This job is the regression guard. Keep it required.
+  wheel-check:
+    name: wheel install smoke test
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Build wheel
+        run: |
+          python -m pip install --upgrade pip build
+          python -m build --wheel
+
+      - name: Verify wheel installs and imports in a clean venv
+        run: python scripts/verify_wheel.py
+
+  lint:
+    name: ruff lint
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Ruff lint
+        run: |
+          python -m pip install --upgrade pip ruff
+          # TODO(cleanup): 45 pre-existing findings (E402/B905/SIM*). Drop the
+          # `|| true` once the lint-cleanup PR lands so this job gates merges.
+          ruff check src/ tests/ scripts/ || true
+
+  typecheck:
+    name: mypy
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: mypy
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -e ".[dev,api]"
+          # TODO(cleanup): 59 pre-existing findings. Drop `|| true` once typed.
+          python -m mypy src/ks_xlsx_parser || true
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -37,6 +37,12 @@ jobs:
       - name: Build wheel + sdist
         run: python -m build
 
+      # Gate the release on a clean-venv install of the freshly built wheel.
+      # Prevents shipping a wheel that drops modules or leaks top-level
+      # packages (the v0.2.0 packaging regression).
+      - name: Verify wheel
+        run: python scripts/verify_wheel.py
+
       - name: Upload distribution artifacts
         uses: actions/upload-artifact@v4
         with:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -45,6 +45,52 @@ Template for a new release (copy this block, fill in, move Unreleased items in):
 
 ## [Unreleased]
 
+### ⚠️ BREAKING (Fixed — see also #ks-xlsx-parser channel report)
+- Repository layout flattened on `src/` was leaking 13 generic top-level
+  packages (`models`, `utils`, `parsers`, …) into installed wheels and
+  silently dropping `pipeline.py` and `api.py` (setuptools `packages.find`
+  only finds *packages*, not top-level modules). Users hitting
+  `from ks_xlsx_parser.pipeline import ...` on 0.2.0 from PyPI got
+  `ModuleNotFoundError`. **All modules now live under
+  `src/ks_xlsx_parser/`**; the wheel's `top_level.txt` contains only
+  `ks_xlsx_parser`. Imports inside the package switched from
+  `from pipeline import` to `from ks_xlsx_parser.pipeline import`.
+  Downstream code that imported the leaked generics
+  (`from models import …`) MUST migrate to `from ks_xlsx_parser.models …`.
+
+### Added
+- `scripts/verify_wheel.py` — builds the wheel, installs it in a fresh
+  venv, and asserts the public import surface resolves. Wired into a
+  new `wheel-check` job in `.github/workflows/ci.yml` and a `Verify wheel`
+  step in `release.yml`. Regression guard for the packaging bug above.
+- `scripts/triage_recall.py` + `scripts/append_bench_history.py` — turn
+  `failures.ndjson` into a ranked bucket histogram with exemplar
+  failures, and append each benchmark run to
+  `tests/benchmarks/reports/history.jsonl` so recall is tracked
+  commit-over-commit. Goal: text recall@5 > 0.90.
+- `eval_retrieval.py --emit-failures` — dumps top-8 ranked chunks per
+  miss with a `failure_bucket` (answer_absent_from_chunks /
+  present_but_ranked_low / wrong_sheet / geometric_no_overlap / …) for
+  triage. Summary JSON gains a `failure_buckets` histogram.
+- `Dockerfile.bench` + `.github/workflows/benchmark.yml` — reproducible
+  benchmark image; PR sample run (60 instances), weekly full corpus run.
+- `make install-dev` alias and `make wheel-check` / `make bench-track`
+  / `make docker-bench` targets.
+- New `bench` optional-dependency group (`sentence-transformers`,
+  `numpy`) — only the benchmark needs these.
+- `docs/recall-investigation.md` documenting the diagnosis framework and
+  three named hypotheses (chunk-size dilution, formula-expression
+  rendering, range-bookkeeping drift).
+- `.claude/skills/recall-failure-triage/SKILL.md` — agent skill that
+  consumes the bucket output and proposes ranked fixes.
+
+### Changed
+- Dropped `PYTHONPATH=src` from Makefile benchmark targets — the
+  package is now properly installable so callers don't need it.
+- `pyproject.toml`: `packages.find` constrained to `ks_xlsx_parser*`,
+  `py.typed` declared as package data, `xlsx-parser-api` console script
+  updated to `ks_xlsx_parser.api:main`.
+
 ### ⚠️ BREAKING
 - Retired the in-tree `testBench/` corpus. The 1054-workbook stress dataset
   and `make testbench*` targets are gone — benchmarks now run against the