Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions .claude/skills/recall-failure-triage/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
name: recall-failure-triage
description: Diagnose why ks-xlsx-parser's retrieval recall is below target by reading the failure-bucket histogram and exemplar failures, then propose ranked fixes. Use when investigating low recall@k on SpreadsheetBench, deciding what to fix next, or reviewing a benchmark run. Sister skill to excel-extraction-pipeline-improver.
---

# Recall Failure Triage — turn `recall@5 = X` into a ranked worklist

A single recall number tells you nothing about *what* to fix. Five
distinct failure modes produce identical recall drops, and the fix is
completely different per mode. This skill is the bridge from "recall is
low" to "open `parsers/cell_parser.py` line N, here is the bug".

Pair with `excel-extraction-pipeline-improver` — that skill implements
fixes; this one decides what to fix.

## When this skill fires

* User says "recall is low / dropped / not improving"
* New benchmark run produced a `failures.ndjson` or `summary.json`
* PR-time benchmark sample regressed on the CI step summary
* You're picking the next parser/chunking PR and need to choose what's
highest leverage

## Inputs you should expect

| Input | Where it comes from |
|---|---|
| `summary.json` + `summary.md` | latest `tests/benchmarks/reports/retrieval/<stamp>/` |
| `failures.ndjson` | same dir, **only present if** the run used `--emit-failures` |
| `history.jsonl` | `tests/benchmarks/reports/history.jsonl` — recall over commits |
| current branch + recent commits | to know what changed since the last run |

## The decision procedure

### 0. Apply the in-scope filter — most "misses" are unfixable

Before counting buckets, run `scripts/enrich_failures.py` to mark
instances flagged `instruction_requires_execution` — those where the
benchmark's `answer_position` is empty in `input.xlsx` because the
question literally asks the system to *write* the answer. ~63% of
the SpreadsheetBench corpus is this class on the 200-sample. They are
NOT parser bugs and CANNOT be fixed by parser work. Exclude them.

```bash
python scripts/enrich_failures.py tests/benchmarks/reports/retrieval
jq -c 'select((.flags | contains(["instruction_requires_execution"])) | not)' \
tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
```

### 1. Read the bucket histogram FIRST

```bash
python scripts/triage_recall.py tests/benchmarks/reports/retrieval
```

The output ranks buckets by count. **Spend your time on the largest
bucket only.** The next-largest is the next PR.

| Bucket | Root cause | Where to fix |
|---|---|---|
| `answer_absent_from_chunks` | The answer value never made it into ANY chunk. The cell was dropped or rendered as a formula expression instead of its computed value. | `parsers/cell_parser.py::_extract_cell_value` and `rendering/text_renderer.py::_cell_render_value`. Check `data_only` plumbing in `parsers/workbook_parser.py`. |
| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. The chunk is too big/heterogeneous — embedding is diluted. | `chunking/chunker.py`. There is no token cap today; large blocks are emitted whole. Cap at ~512 tokens and row-split tall tables. |
| `wrong_sheet` | Answer sheet was never chunked. | `parsers/workbook_parser.py` sheet loop. Check hidden sheets, very-hidden sheets, and `wb.sheetnames` vs `wb.worksheets`. |
| `geometric_no_overlap` | Answer text matches but the chunk's A1 range doesn't overlap ground truth. Range drifts during merge/split. | `annotation/block_splitter.py` + `analysis/pattern_splitter.py`. Add invariant: `block.cell_range` ⊆ bbox of cells that actually rendered text. |
| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. Reproduce with `parse_workbook(path)`. |

### 2. Read 5 example failures from the top bucket

```bash
python scripts/triage_recall.py tests/benchmarks/reports/retrieval \
--bucket <largest> --examples 5
```

For each example, the script prints:
* the natural-language question
* `answer_position` (the ground-truth cell)
* the ground-truth string values that should appear in a chunk
* the top-8 ranked chunks with sheet, range, and whether each contains
the answer

**Pattern-match.** What do the 5 examples have in common? E.g.
* All answer cells are formulas → H2 confirmed: render the cached value
* All chunks span the entire sheet → H1 confirmed: cap chunk size
* All answer sheets are hidden → wrong_sheet from hidden-sheet skip

### 3. State the hypothesis before changing code

In your reply, write the hypothesis as one sentence:
> "Bucket X dominates because [observable common pattern in examples].
> Expected fix: [specific file + change]. Expected lift on recall@5:
> [number]."

The expected lift = (bucket count) ÷ (scoreable instances). If a fix
is supposed to halve the biggest bucket but recall barely moves on the
next run, the hypothesis was wrong — go back to step 1, don't pile on.

### 4. Wire the fix into the existing pipeline

Pass the hypothesis to the `excel-extraction-pipeline-improver` skill
(or implement directly) with TDD: write a failing crossval/invariant
test that captures the bug from one example, fix, run `make test`, then
re-run `make bench-track` to confirm the bucket shrank.

### 5. Verify on the history

```bash
tail -5 tests/benchmarks/reports/history.jsonl
```

The script prints a row-over-row delta. Expected behaviour:
* The targeted bucket count drops
* `recall_text@5` rises by approximately `target_bucket_drop /
scoreable_instances`
* `recall_text@1` improves proportionally OR more (better chunks rank
higher too)
* Other buckets shouldn't grow — if they do, the fix had a side effect

## Anti-patterns

* **Don't chase the headline number.** "Recall went up 1 pp" without a
bucket explanation is suspicious — it might be benchmark noise.
* **Don't fix two buckets at once.** You won't know which fix worked.
* **Don't tune chunk size as the first move.** It only helps
`present_but_ranked_low`. If `answer_absent_from_chunks` is bigger,
chunk tuning is wasted effort.
* **Don't trust geometric and text recall independently.** A docling-style
parser with zero geometric overlap can still have high text recall
(because it serializes everything to plain text). That's not the same
thing as good — citation overlays need the geometric metric. Always
report both.

## Quick reference — running it from scratch

```bash
make corpus-download # one-time, large
make bench-track # runs + triages
python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5
cat tests/benchmarks/reports/history.jsonl | tail -10 # commit-over-commit
```

CI runs a 60-instance sample on every PR (`.github/workflows/benchmark.yml`)
and posts the bucket histogram to the GitHub job summary. Weekly schedule
runs the full 912-instance corpus.

## Hand-off contract

When you finish a triage round, your final message MUST include:
1. The bucket histogram (counts + percentages).
2. The dominant bucket + one-sentence root cause hypothesis.
3. The 1–2 file paths where the fix lives.
4. The expected lift on `recall_text@5` if the fix lands.
5. A pointer to the example instance IDs you read.

Anything less and the next agent will have to re-do the analysis.
85 changes: 85 additions & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
name: Benchmark

# Tracks ks-xlsx-parser retrieval recall on SpreadsheetBench over time.
# The headline goal: text recall@5 > 0.90 (currently ~0.70).
#
# * Pull requests run a fast SAMPLE (60 instances) as a regression smoke
# test — keeps the signal without a 40-minute wait.
# * The weekly schedule + manual dispatch run the FULL 912-instance
# corpus and publish the recall trend.

on:
pull_request:
branches: [main]
paths:
- "src/**"
- "scripts/eval_retrieval.py"
- "scripts/triage_recall.py"
- "Dockerfile.bench"
- ".github/workflows/benchmark.yml"
schedule:
- cron: "0 6 * * 1" # Mondays 06:00 UTC
workflow_dispatch:
inputs:
sample:
description: "Instances to sample (0 = full 912 corpus)"
default: "0"

concurrency:
group: benchmark-${{ github.ref }}
cancel-in-progress: true

jobs:
benchmark:
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- uses: actions/checkout@v4

# PRs use a 60-instance sample; scheduled/dispatch runs use the full
# corpus (or whatever the dispatch input requests).
- name: Resolve sample size
id: cfg
run: |
if [ "${{ github.event_name }}" = "pull_request" ]; then
echo "sample=60" >> "$GITHUB_OUTPUT"
else
echo "sample=${{ github.event.inputs.sample || 0 }}" >> "$GITHUB_OUTPUT"
fi

- name: Cache SpreadsheetBench corpus
uses: actions/cache@v4
with:
path: data/corpora/spreadsheetbench
key: spreadsheetbench-912-v0.1

- name: Build benchmark image
run: docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .

- name: Run benchmark
run: |
mkdir -p tests/benchmarks/reports data
docker run --rm \
-e BENCH_SAMPLE=${{ steps.cfg.outputs.sample }} \
-v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
-v "$PWD/data:/app/data" \
ks-xlsx-parser-bench | tee bench.log

- name: Publish recall to job summary
if: always()
run: |
{
echo '## ks-xlsx-parser retrieval benchmark'
echo ''
echo '```'
tail -n 40 bench.log || true
echo '```'
} >> "$GITHUB_STEP_SUMMARY"

- name: Upload benchmark reports
if: always()
uses: actions/upload-artifact@v4
with:
name: benchmark-reports-${{ github.run_number }}
path: tests/benchmarks/reports/
if-no-files-found: warn
71 changes: 61 additions & 10 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,22 +23,20 @@ jobs:
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
version: "latest"
enable-cache: true
cache-dependency-glob: "uv.lock"

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip
cache-dependency-path: pyproject.toml

- name: Install
run: |
python -m pip install --upgrade pip
pip install -e ".[dev,api]"

- name: Ruff lint
run: |
pip install ruff
ruff check src/ tests/ scripts/ || true # non-blocking until cleanup PR lands
run: uv pip install --system -e ".[dev,api]"

- name: Run test suite
run: make test-ci
Expand All @@ -50,3 +48,56 @@ jobs:
name: junit-${{ matrix.os }}-py${{ matrix.python-version }}
path: reports/junit.xml
if-no-files-found: ignore

# Builds the wheel and proves it installs + imports in a CLEAN venv.
# The matrix `test` job runs against an editable install, which exposes the
# whole src/ tree on sys.path and therefore CANNOT catch a broken wheel —
# exactly how the v0.2.0 "pipeline.py missing from wheel" bug shipped.
# This job is the regression guard. Keep it required.
wheel-check:
name: wheel install smoke test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Build wheel
run: |
python -m pip install --upgrade pip build
python -m build --wheel

- name: Verify wheel installs and imports in a clean venv
run: python scripts/verify_wheel.py

lint:
name: ruff lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Ruff lint
run: |
python -m pip install --upgrade pip ruff
# TODO(cleanup): 45 pre-existing findings (E402/B905/SIM*). Drop the
# `|| true` once the lint-cleanup PR lands so this job gates merges.
ruff check src/ tests/ scripts/ || true

typecheck:
name: mypy
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: mypy
run: |
python -m pip install --upgrade pip
python -m pip install -e ".[dev,api]"
# TODO(cleanup): 59 pre-existing findings. Drop `|| true` once typed.
python -m mypy src/ks_xlsx_parser || true
6 changes: 6 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,12 @@ jobs:
- name: Build wheel + sdist
run: python -m build

# Gate the release on a clean-venv install of the freshly built wheel.
# Prevents shipping a wheel that drops modules or leaks top-level
# packages (the v0.2.0 packaging regression).
- name: Verify wheel
run: python scripts/verify_wheel.py

- name: Upload distribution artifacts
uses: actions/upload-artifact@v4
with:
Expand Down
46 changes: 46 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,52 @@ Template for a new release (copy this block, fill in, move Unreleased items in):

## [Unreleased]

### ⚠️ BREAKING (Fixed — see also #ks-xlsx-parser channel report)
- Repository layout flattened on `src/` was leaking 13 generic top-level
packages (`models`, `utils`, `parsers`, …) into installed wheels and
silently dropping `pipeline.py` and `api.py` (setuptools `packages.find`
only finds *packages*, not top-level modules). Users hitting
`from ks_xlsx_parser.pipeline import ...` on 0.2.0 from PyPI got
`ModuleNotFoundError`. **All modules now live under
`src/ks_xlsx_parser/`**; the wheel's `top_level.txt` contains only
`ks_xlsx_parser`. Imports inside the package switched from
`from pipeline import` to `from ks_xlsx_parser.pipeline import`.
Downstream code that imported the leaked generics
(`from models import …`) MUST migrate to `from ks_xlsx_parser.models …`.

### Added
- `scripts/verify_wheel.py` — builds the wheel, installs it in a fresh
venv, and asserts the public import surface resolves. Wired into a
new `wheel-check` job in `.github/workflows/ci.yml` and a `Verify wheel`
step in `release.yml`. Regression guard for the packaging bug above.
- `scripts/triage_recall.py` + `scripts/append_bench_history.py` — turn
`failures.ndjson` into a ranked bucket histogram with exemplar
failures, and append each benchmark run to
`tests/benchmarks/reports/history.jsonl` so recall is tracked
commit-over-commit. Goal: text recall@5 > 0.90.
- `eval_retrieval.py --emit-failures` — dumps top-8 ranked chunks per
miss with a `failure_bucket` (answer_absent_from_chunks /
present_but_ranked_low / wrong_sheet / geometric_no_overlap / …) for
triage. Summary JSON gains a `failure_buckets` histogram.
- `Dockerfile.bench` + `.github/workflows/benchmark.yml` — reproducible
benchmark image; PR sample run (60 instances), weekly full corpus run.
- `make install-dev` alias and `make wheel-check` / `make bench-track`
/ `make docker-bench` targets.
- New `bench` optional-dependency group (`sentence-transformers`,
`numpy`) — only the benchmark needs these.
- `docs/recall-investigation.md` documenting the diagnosis framework and
three named hypotheses (chunk-size dilution, formula-expression
rendering, range-bookkeeping drift).
- `.claude/skills/recall-failure-triage/SKILL.md` — agent skill that
consumes the bucket output and proposes ranked fixes.

### Changed
- Dropped `PYTHONPATH=src` from Makefile benchmark targets — the
package is now properly installable so callers don't need it.
- `pyproject.toml`: `packages.find` constrained to `ks_xlsx_parser*`,
`py.typed` declared as package data, `xlsx-parser-api` console script
updated to `ks_xlsx_parser.api:main`.

### ⚠️ BREAKING
- Retired the in-tree `testBench/` corpus. The 1054-workbook stress dataset
and `make testbench*` targets are gone — benchmarks now run against the
Expand Down
Loading
Loading