From 421783f7d6c93bf23bec859736b191d47ba4e511 Mon Sep 17 00:00:00 2001 From: Arnav Goel Date: Tue, 19 May 2026 17:35:11 -0400 Subject: [PATCH 1/4] fix(packaging): nest everything under ks_xlsx_parser, add CI + bench infra MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Frank's report on the v0.2.0 PyPI release was correct: the wheel was missing pipeline.py and api.py (top-level modules under src/, which setuptools packages.find skips), AND it was leaking 13 generic top-level packages (models, utils, parsers, ...) into site-packages. Confirmed by building the wheel and inspecting top_level.txt + the file list. Packaging - Move 13 packages + pipeline.py + api.py + py.typed under src/ks_xlsx_parser/ via git mv (history preserved). - Mechanical rewrite of 47 files: `from pipeline import` → `from ks_xlsx_parser.pipeline import`, etc. Dead `xlsx_parser.` refs in docstrings and examples updated to `ks_xlsx_parser.`. - pyproject.toml: packages.find constrained to ks_xlsx_parser*, py.typed declared as package data, `xlsx-parser-api` console script fixed. - Drop PYTHONPATH=src from Makefile — package is now importable. Regression guard - scripts/verify_wheel.py builds the wheel, installs in a clean venv, asserts top_level.txt = ['ks_xlsx_parser'] and the public import surface resolves. Wired into a new wheel-check CI job and the release workflow. The matrix `test` job uses an editable install, which structurally cannot catch a broken wheel — this is the gap that let v0.2.0 ship. CI (ks-backend-style) - Split ci.yml into test/lint/typecheck/wheel-check jobs, uv-backed installs, lint+typecheck non-blocking with TODO until the cleanup PR lands (114 → 45 ruff findings after safe autofix; 59 mypy unchanged). - Add wheel verification step to release.yml before PyPI publish. Benchmark + accuracy tracking - Dockerfile.bench reproduces the SpreadsheetBench retrieval benchmark. Pre-warms BAAI/bge-small-en-v1.5; entrypoint ensures corpus, runs eval, appends to history, triages. - .github/workflows/benchmark.yml: 60-instance sample on PRs, full 912-instance corpus weekly + on manual dispatch. Uploads reports as artifact and posts recall summary to the job step summary. - scripts/append_bench_history.py appends one row per run to tests/benchmarks/reports/history.jsonl tagged with the git commit and prints the row-over-row delta. Goal documented: text recall@5 > 0.90 (currently 0.704). Recall failure triage (turns "recall is low" into a ranked worklist) - eval_retrieval.py --emit-failures dumps every recall@5 miss with the top-8 ranked chunks + ground-truth values + a failure_bucket label (answer_absent_from_chunks / present_but_ranked_low / wrong_sheet / geometric_no_overlap / no_chunks / parse_error / unparseable_ground_truth). - summary.json gains a failure_buckets histogram per parser; summary.md prints the bucket table. - scripts/triage_recall.py reads failures.ndjson, ranks buckets by count, prints 3 exemplar failures per bucket so a human or agent can see WHY each miss happened without re-running the benchmark. - docs/recall-investigation.md documents the diagnosis framework and three named hypotheses (chunk-size dilution, formula-expression rendering, range-bookkeeping drift) for the next investigation pass. - .claude/skills/recall-failure-triage/SKILL.md — agent guide. Misc - New `bench` optional-dependency group (sentence-transformers, numpy). - make install-dev alias (Frank pattern-matched this verb off ks-backend). - New make targets: wheel-check, bench-track, docker-bench. Verification - 1041 tests pass with no PYTHONPATH=src. - Wheel built + verified in clean venv. - mypy: 59 pre-existing findings, no new ones. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/benchmark.yml | 85 +++++++++ .github/workflows/ci.yml | 71 ++++++-- .github/workflows/release.yml | 6 + CHANGELOG.md | 46 +++++ Dockerfile.bench | 49 +++++ Makefile | 37 +++- docs/recall-investigation.md | 145 +++++++++++++++ examples/demo.py | 6 +- examples/generate_examples.py | 2 +- examples/stress_test/stress_test_runner.py | 2 +- pyproject.toml | 12 +- scripts/append_bench_history.py | 101 +++++++++++ scripts/eval_retrieval.py | 168 +++++++++++++++++- scripts/run_bench.sh | 56 ++++++ scripts/run_enterprise_metrics.py | 3 +- scripts/track_corpus_metrics.py | 1 - scripts/triage_recall.py | 114 ++++++++++++ scripts/verify_wheel.py | 81 +++++++++ src/ks_xlsx_parser/__init__.py | 10 +- src/{ => ks_xlsx_parser}/analysis/__init__.py | 0 .../analysis/light_block_detector.py | 6 +- .../analysis/llm_artifacts.py | 16 +- .../analysis/pattern_splitter.py | 9 +- .../analysis/table_assembler.py | 8 +- .../analysis/table_grouper.py | 8 +- .../analysis/template_extractor.py | 8 +- .../analysis/tree_builder.py | 9 +- .../annotation/__init__.py | 0 .../annotation/block_splitter.py | 7 +- .../annotation/cell_annotator.py | 6 +- src/{ => ks_xlsx_parser}/api.py | 11 +- src/{ => ks_xlsx_parser}/charts/__init__.py | 0 .../charts/chart_extractor.py | 4 +- src/{ => ks_xlsx_parser}/chunking/__init__.py | 0 src/{ => ks_xlsx_parser}/chunking/chunker.py | 16 +- .../chunking/segmenter.py | 8 +- .../comparison/__init__.py | 0 .../comparison/template_comparator.py | 4 +- src/{ => ks_xlsx_parser}/export/__init__.py | 0 .../export/model_exporter.py | 6 +- src/{ => ks_xlsx_parser}/formula/__init__.py | 0 .../formula/dependency_builder.py | 9 +- .../formula/formula_parser.py | 4 +- src/{ => ks_xlsx_parser}/models/__init__.py | 4 +- src/{ => ks_xlsx_parser}/models/block.py | 0 src/{ => ks_xlsx_parser}/models/cell.py | 0 src/{ => ks_xlsx_parser}/models/chart.py | 0 src/{ => ks_xlsx_parser}/models/common.py | 0 src/{ => ks_xlsx_parser}/models/dependency.py | 0 src/{ => ks_xlsx_parser}/models/shape.py | 2 +- src/{ => ks_xlsx_parser}/models/sheet.py | 2 - src/{ => ks_xlsx_parser}/models/table.py | 0 .../models/table_structure.py | 0 src/{ => ks_xlsx_parser}/models/template.py | 0 src/{ => ks_xlsx_parser}/models/tree.py | 0 src/{ => ks_xlsx_parser}/models/workbook.py | 2 +- src/{ => ks_xlsx_parser}/parsers/__init__.py | 0 .../parsers/calamine_core.py | 1 - .../parsers/cell_parser.py | 4 +- .../parsers/sheet_parser.py | 14 +- .../parsers/table_parser.py | 4 +- .../parsers/workbook_parser.py | 13 +- src/{ => ks_xlsx_parser}/pipeline.py | 44 ++--- src/{ => ks_xlsx_parser}/py.typed | 0 .../rendering/__init__.py | 0 .../rendering/html_renderer.py | 8 +- .../rendering/text_renderer.py | 8 +- src/{ => ks_xlsx_parser}/storage/__init__.py | 0 .../storage/serializer.py | 4 +- src/{ => ks_xlsx_parser}/utils/__init__.py | 0 .../utils/logging_config.py | 2 +- .../verification/__init__.py | 0 .../verification/stage_verifier.py | 32 ++-- tests/benchmarks/_driver.py | 4 +- tests/benchmarks/_runner.py | 2 +- tests/benchmarks/adapters/docling_adapter.py | 2 +- tests/benchmarks/adapters/ks_adapter.py | 4 +- tests/conftest.py | 2 - tests/helpers/invariant_checker.py | 2 +- tests/test_charts.py | 7 +- tests/test_corpus_robustness.py | 12 +- tests/test_formula_handling.py | 15 +- tests/test_formula_parser.py | 5 +- tests/test_llm_artifacts.py | 7 +- tests/test_models.py | 4 +- tests/test_multi_table_layout.py | 9 +- tests/test_parsers.py | 3 +- tests/test_pipeline.py | 4 +- tests/test_rendering.py | 18 +- tests/test_segmentation.py | 7 +- tests/test_stage_verification.py | 5 +- tests/test_structural_invariants.py | 9 +- 92 files changed, 1146 insertions(+), 253 deletions(-) create mode 100644 .github/workflows/benchmark.yml create mode 100644 Dockerfile.bench create mode 100644 docs/recall-investigation.md create mode 100755 scripts/append_bench_history.py create mode 100755 scripts/run_bench.sh create mode 100755 scripts/triage_recall.py create mode 100755 scripts/verify_wheel.py rename src/{ => ks_xlsx_parser}/analysis/__init__.py (100%) rename src/{ => ks_xlsx_parser}/analysis/light_block_detector.py (96%) rename src/{ => ks_xlsx_parser}/analysis/llm_artifacts.py (98%) rename src/{ => ks_xlsx_parser}/analysis/pattern_splitter.py (96%) rename src/{ => ks_xlsx_parser}/analysis/table_assembler.py (96%) rename src/{ => ks_xlsx_parser}/analysis/table_grouper.py (97%) rename src/{ => ks_xlsx_parser}/analysis/template_extractor.py (95%) rename src/{ => ks_xlsx_parser}/analysis/tree_builder.py (96%) rename src/{ => ks_xlsx_parser}/annotation/__init__.py (100%) rename src/{ => ks_xlsx_parser}/annotation/block_splitter.py (97%) rename src/{ => ks_xlsx_parser}/annotation/cell_annotator.py (98%) rename src/{ => ks_xlsx_parser}/api.py (97%) rename src/{ => ks_xlsx_parser}/charts/__init__.py (100%) rename src/{ => ks_xlsx_parser}/charts/chart_extractor.py (99%) rename src/{ => ks_xlsx_parser}/chunking/__init__.py (100%) rename src/{ => ks_xlsx_parser}/chunking/chunker.py (95%) rename src/{ => ks_xlsx_parser}/chunking/segmenter.py (98%) rename src/{ => ks_xlsx_parser}/comparison/__init__.py (100%) rename src/{ => ks_xlsx_parser}/comparison/template_comparator.py (99%) rename src/{ => ks_xlsx_parser}/export/__init__.py (100%) rename src/{ => ks_xlsx_parser}/export/model_exporter.py (98%) rename src/{ => ks_xlsx_parser}/formula/__init__.py (100%) rename src/{ => ks_xlsx_parser}/formula/dependency_builder.py (93%) rename src/{ => ks_xlsx_parser}/formula/formula_parser.py (98%) rename src/{ => ks_xlsx_parser}/models/__init__.py (94%) rename src/{ => ks_xlsx_parser}/models/block.py (100%) rename src/{ => ks_xlsx_parser}/models/cell.py (100%) rename src/{ => ks_xlsx_parser}/models/chart.py (100%) rename src/{ => ks_xlsx_parser}/models/common.py (100%) rename src/{ => ks_xlsx_parser}/models/dependency.py (100%) rename src/{ => ks_xlsx_parser}/models/shape.py (97%) rename src/{ => ks_xlsx_parser}/models/sheet.py (99%) rename src/{ => ks_xlsx_parser}/models/table.py (100%) rename src/{ => ks_xlsx_parser}/models/table_structure.py (100%) rename src/{ => ks_xlsx_parser}/models/template.py (100%) rename src/{ => ks_xlsx_parser}/models/tree.py (100%) rename src/{ => ks_xlsx_parser}/models/workbook.py (99%) rename src/{ => ks_xlsx_parser}/parsers/__init__.py (100%) rename src/{ => ks_xlsx_parser}/parsers/calamine_core.py (99%) rename src/{ => ks_xlsx_parser}/parsers/cell_parser.py (99%) rename src/{ => ks_xlsx_parser}/parsers/sheet_parser.py (98%) rename src/{ => ks_xlsx_parser}/parsers/table_parser.py (97%) rename src/{ => ks_xlsx_parser}/parsers/workbook_parser.py (97%) rename src/{ => ks_xlsx_parser}/pipeline.py (90%) rename src/{ => ks_xlsx_parser}/py.typed (100%) rename src/{ => ks_xlsx_parser}/rendering/__init__.py (100%) rename src/{ => ks_xlsx_parser}/rendering/html_renderer.py (96%) rename src/{ => ks_xlsx_parser}/rendering/text_renderer.py (97%) rename src/{ => ks_xlsx_parser}/storage/__init__.py (100%) rename src/{ => ks_xlsx_parser}/storage/serializer.py (98%) rename src/{ => ks_xlsx_parser}/utils/__init__.py (100%) rename src/{ => ks_xlsx_parser}/utils/logging_config.py (97%) rename src/{ => ks_xlsx_parser}/verification/__init__.py (100%) rename src/{ => ks_xlsx_parser}/verification/stage_verifier.py (97%) diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml new file mode 100644 index 0000000..026b535 --- /dev/null +++ b/.github/workflows/benchmark.yml @@ -0,0 +1,85 @@ +name: Benchmark + +# Tracks ks-xlsx-parser retrieval recall on SpreadsheetBench over time. +# The headline goal: text recall@5 > 0.90 (currently ~0.70). +# +# * Pull requests run a fast SAMPLE (60 instances) as a regression smoke +# test — keeps the signal without a 40-minute wait. +# * The weekly schedule + manual dispatch run the FULL 912-instance +# corpus and publish the recall trend. + +on: + pull_request: + branches: [main] + paths: + - "src/**" + - "scripts/eval_retrieval.py" + - "scripts/triage_recall.py" + - "Dockerfile.bench" + - ".github/workflows/benchmark.yml" + schedule: + - cron: "0 6 * * 1" # Mondays 06:00 UTC + workflow_dispatch: + inputs: + sample: + description: "Instances to sample (0 = full 912 corpus)" + default: "0" + +concurrency: + group: benchmark-${{ github.ref }} + cancel-in-progress: true + +jobs: + benchmark: + runs-on: ubuntu-latest + timeout-minutes: 90 + steps: + - uses: actions/checkout@v4 + + # PRs use a 60-instance sample; scheduled/dispatch runs use the full + # corpus (or whatever the dispatch input requests). + - name: Resolve sample size + id: cfg + run: | + if [ "${{ github.event_name }}" = "pull_request" ]; then + echo "sample=60" >> "$GITHUB_OUTPUT" + else + echo "sample=${{ github.event.inputs.sample || 0 }}" >> "$GITHUB_OUTPUT" + fi + + - name: Cache SpreadsheetBench corpus + uses: actions/cache@v4 + with: + path: data/corpora/spreadsheetbench + key: spreadsheetbench-912-v0.1 + + - name: Build benchmark image + run: docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . + + - name: Run benchmark + run: | + mkdir -p tests/benchmarks/reports data + docker run --rm \ + -e BENCH_SAMPLE=${{ steps.cfg.outputs.sample }} \ + -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ + -v "$PWD/data:/app/data" \ + ks-xlsx-parser-bench | tee bench.log + + - name: Publish recall to job summary + if: always() + run: | + { + echo '## ks-xlsx-parser retrieval benchmark' + echo '' + echo '```' + tail -n 40 bench.log || true + echo '```' + } >> "$GITHUB_STEP_SUMMARY" + + - name: Upload benchmark reports + if: always() + uses: actions/upload-artifact@v4 + with: + name: benchmark-reports-${{ github.run_number }} + path: tests/benchmarks/reports/ + if-no-files-found: warn diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 1168a01..ba7b9e2 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -23,22 +23,20 @@ jobs: steps: - uses: actions/checkout@v4 + - name: Install uv + uses: astral-sh/setup-uv@v5 + with: + version: "latest" + enable-cache: true + cache-dependency-glob: "uv.lock" + - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v5 with: python-version: ${{ matrix.python-version }} - cache: pip - cache-dependency-path: pyproject.toml - name: Install - run: | - python -m pip install --upgrade pip - pip install -e ".[dev,api]" - - - name: Ruff lint - run: | - pip install ruff - ruff check src/ tests/ scripts/ || true # non-blocking until cleanup PR lands + run: uv pip install --system -e ".[dev,api]" - name: Run test suite run: make test-ci @@ -50,3 +48,56 @@ jobs: name: junit-${{ matrix.os }}-py${{ matrix.python-version }} path: reports/junit.xml if-no-files-found: ignore + + # Builds the wheel and proves it installs + imports in a CLEAN venv. + # The matrix `test` job runs against an editable install, which exposes the + # whole src/ tree on sys.path and therefore CANNOT catch a broken wheel — + # exactly how the v0.2.0 "pipeline.py missing from wheel" bug shipped. + # This job is the regression guard. Keep it required. + wheel-check: + name: wheel install smoke test + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + + - name: Build wheel + run: | + python -m pip install --upgrade pip build + python -m build --wheel + + - name: Verify wheel installs and imports in a clean venv + run: python scripts/verify_wheel.py + + lint: + name: ruff lint + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + - name: Ruff lint + run: | + python -m pip install --upgrade pip ruff + # TODO(cleanup): 45 pre-existing findings (E402/B905/SIM*). Drop the + # `|| true` once the lint-cleanup PR lands so this job gates merges. + ruff check src/ tests/ scripts/ || true + + typecheck: + name: mypy + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + - name: mypy + run: | + python -m pip install --upgrade pip + python -m pip install -e ".[dev,api]" + # TODO(cleanup): 59 pre-existing findings. Drop `|| true` once typed. + python -m mypy src/ks_xlsx_parser || true diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 5cf43b6..1da9bcc 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -37,6 +37,12 @@ jobs: - name: Build wheel + sdist run: python -m build + # Gate the release on a clean-venv install of the freshly built wheel. + # Prevents shipping a wheel that drops modules or leaks top-level + # packages (the v0.2.0 packaging regression). + - name: Verify wheel + run: python scripts/verify_wheel.py + - name: Upload distribution artifacts uses: actions/upload-artifact@v4 with: diff --git a/CHANGELOG.md b/CHANGELOG.md index d7527b2..99d5832 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -45,6 +45,52 @@ Template for a new release (copy this block, fill in, move Unreleased items in): ## [Unreleased] +### ⚠️ BREAKING (Fixed — see also #ks-xlsx-parser channel report) +- Repository layout flattened on `src/` was leaking 13 generic top-level + packages (`models`, `utils`, `parsers`, …) into installed wheels and + silently dropping `pipeline.py` and `api.py` (setuptools `packages.find` + only finds *packages*, not top-level modules). Users hitting + `from ks_xlsx_parser.pipeline import ...` on 0.2.0 from PyPI got + `ModuleNotFoundError`. **All modules now live under + `src/ks_xlsx_parser/`**; the wheel's `top_level.txt` contains only + `ks_xlsx_parser`. Imports inside the package switched from + `from pipeline import` to `from ks_xlsx_parser.pipeline import`. + Downstream code that imported the leaked generics + (`from models import …`) MUST migrate to `from ks_xlsx_parser.models …`. + +### Added +- `scripts/verify_wheel.py` — builds the wheel, installs it in a fresh + venv, and asserts the public import surface resolves. Wired into a + new `wheel-check` job in `.github/workflows/ci.yml` and a `Verify wheel` + step in `release.yml`. Regression guard for the packaging bug above. +- `scripts/triage_recall.py` + `scripts/append_bench_history.py` — turn + `failures.ndjson` into a ranked bucket histogram with exemplar + failures, and append each benchmark run to + `tests/benchmarks/reports/history.jsonl` so recall is tracked + commit-over-commit. Goal: text recall@5 > 0.90. +- `eval_retrieval.py --emit-failures` — dumps top-8 ranked chunks per + miss with a `failure_bucket` (answer_absent_from_chunks / + present_but_ranked_low / wrong_sheet / geometric_no_overlap / …) for + triage. Summary JSON gains a `failure_buckets` histogram. +- `Dockerfile.bench` + `.github/workflows/benchmark.yml` — reproducible + benchmark image; PR sample run (60 instances), weekly full corpus run. +- `make install-dev` alias and `make wheel-check` / `make bench-track` + / `make docker-bench` targets. +- New `bench` optional-dependency group (`sentence-transformers`, + `numpy`) — only the benchmark needs these. +- `docs/recall-investigation.md` documenting the diagnosis framework and + three named hypotheses (chunk-size dilution, formula-expression + rendering, range-bookkeeping drift). +- `.claude/skills/recall-failure-triage/SKILL.md` — agent skill that + consumes the bucket output and proposes ranked fixes. + +### Changed +- Dropped `PYTHONPATH=src` from Makefile benchmark targets — the + package is now properly installable so callers don't need it. +- `pyproject.toml`: `packages.find` constrained to `ks_xlsx_parser*`, + `py.typed` declared as package data, `xlsx-parser-api` console script + updated to `ks_xlsx_parser.api:main`. + ### ⚠️ BREAKING - Retired the in-tree `testBench/` corpus. The 1054-workbook stress dataset and `make testbench*` targets are gone — benchmarks now run against the diff --git a/Dockerfile.bench b/Dockerfile.bench new file mode 100644 index 0000000..cf2755f --- /dev/null +++ b/Dockerfile.bench @@ -0,0 +1,49 @@ +# Benchmark image for ks-xlsx-parser. +# +# Builds once, then on each run downloads SpreadsheetBench (if not cached), +# parses the corpus, embeds chunks with a small sentence-transformer, and +# emits a recall@k report + failure-bucket triage. The output lands in +# tests/benchmarks/reports/ — mount that path as a volume to persist results. +# +# Usage: +# docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . +# docker run --rm \ +# -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ +# -v "$PWD/data:/app/data" \ +# ks-xlsx-parser-bench +# +# # Quick sanity run on 20 instances: +# docker run --rm -e BENCH_SAMPLE=20 ks-xlsx-parser-bench + +FROM python:3.12-slim + +ENV PYTHONDONTWRITEBYTECODE=1 \ + PYTHONUNBUFFERED=1 \ + PIP_NO_CACHE_DIR=1 \ + PIP_DISABLE_PIP_VERSION_CHECK=1 + +WORKDIR /app + +RUN apt-get update && apt-get install -y --no-install-recommends \ + curl unzip ca-certificates git \ + && rm -rf /var/lib/apt/lists/* + +# Install deps first to keep layers cacheable across code edits. +COPY pyproject.toml README.md ./ +COPY src ./src +RUN pip install -e ".[dev,bench]" + +# Pre-warm the embedding model so the first ``docker run`` doesn't pay the +# ~80 MB download. Same model name eval_retrieval.py defaults to. +RUN python -c "from sentence_transformers import SentenceTransformer; \ +SentenceTransformer('BAAI/bge-small-en-v1.5')" + +COPY scripts ./scripts +COPY tests ./tests +COPY Makefile ./ + +ENV BENCH_SAMPLE=0 \ + BENCH_PARSERS=ks \ + BENCH_TIMEOUT=120 + +ENTRYPOINT ["bash", "scripts/run_bench.sh"] diff --git a/Makefile b/Makefile index 9bedb5b..5331c52 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,5 @@ -.PHONY: help install test test-ci lint format typecheck clean corpus-download bench-robust bench-retrieval bench +.PHONY: help install install-dev test test-ci lint format typecheck wheel-check clean \ + corpus-download bench-robust bench-retrieval bench bench-track docker-bench PYTHON ?= python PKG_VERSION := $(shell $(PYTHON) -c "import tomllib, pathlib; print(tomllib.loads(pathlib.Path('pyproject.toml').read_text())['project']['version'])") @@ -7,22 +8,29 @@ help: @echo "ks-xlsx-parser — common targets" @echo "" @echo " make install Install package and dev deps (editable)" + @echo " make install-dev Alias for install (matches ks-backend)" @echo " make test Run the default test suite" @echo " make test-ci Run the suite with verbose output for CI" @echo "" @echo " make lint Ruff lint" @echo " make format Ruff format" @echo " make typecheck mypy" + @echo " make wheel-check Build wheel + verify it imports in a clean venv" @echo "" @echo " make corpus-download Fetch SpreadsheetBench for benchmark runs" @echo "" @echo " make bench-robust Robustness on SpreadsheetBench (ks vs docling, ~20 min)" @echo " make bench-retrieval Retrieval recall on SpreadsheetBench (ks vs docling, ~40 min)" @echo " make bench Run both benchmarks back-to-back" + @echo " make bench-track Run retrieval bench + append metrics to history" + @echo " make docker-bench Build + run the benchmark Docker image" install: $(PYTHON) -m pip install -e ".[dev,api]" +# Alias — junior devs pattern-match off ks-backend's `make install-dev`. +install-dev: install + test: $(PYTHON) -m pytest tests/ -v --tb=short -W ignore::UserWarning @@ -36,7 +44,15 @@ format: $(PYTHON) -m ruff format src/ tests/ scripts/ typecheck: - $(PYTHON) -m mypy src/xlsx_parser + $(PYTHON) -m mypy src/ks_xlsx_parser + +# Build the wheel and prove it imports outside the editable source tree. +# This is the regression guard for the v0.2.0 packaging bug (pipeline.py +# missing from the wheel because it was a top-level module, not a package). +wheel-check: + rm -rf dist build + $(PYTHON) -m build --wheel + $(PYTHON) scripts/verify_wheel.py clean: rm -rf build/ dist/ *.egg-info src/*.egg-info .pytest_cache .ruff_cache .mypy_cache @@ -47,13 +63,26 @@ corpus-download: bench-robust: @test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1) - PYTHONPATH=src $(PYTHON) -m tests.benchmarks.vs_hucre \ + $(PYTHON) -m tests.benchmarks.vs_hucre \ --corpus data/corpora/spreadsheetbench --parsers ks,docling \ --per-file-timeout 120 \ --out tests/benchmarks/reports/spreadsheetbench bench-retrieval: @test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1) - PYTHONPATH=src $(PYTHON) scripts/eval_retrieval.py --parsers ks,docling + $(PYTHON) scripts/eval_retrieval.py --parsers ks,docling bench: bench-robust bench-retrieval + +# Run the retrieval benchmark and append a row to history.jsonl so +# accuracy can be tracked commit-over-commit. Goal: text recall@5 > 0.90. +bench-track: + @test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1) + $(PYTHON) scripts/eval_retrieval.py --parsers ks --emit-failures \ + --out tests/benchmarks/reports/retrieval + $(PYTHON) scripts/append_bench_history.py + $(PYTHON) scripts/triage_recall.py tests/benchmarks/reports/retrieval + +docker-bench: + docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . + docker run --rm -v "$(PWD)/tests/benchmarks/reports:/app/tests/benchmarks/reports" ks-xlsx-parser-bench diff --git a/docs/recall-investigation.md b/docs/recall-investigation.md new file mode 100644 index 0000000..d3576c8 --- /dev/null +++ b/docs/recall-investigation.md @@ -0,0 +1,145 @@ +# Retrieval-recall investigation — getting ks-xlsx-parser to >0.90 + +## Where we are (v0.2.0 on SpreadsheetBench, 912 instances) + +| Metric | ks-xlsx-parser | docling 2.93 | +|------------------------|----------------|--------------| +| Parse success | 99.945% | not run at scale | +| Recall@1 (text-match) | 0.580 | 0.579 | +| Recall@3 (text-match) | 0.697 | 0.670 | +| Recall@5 (text-match) | **0.704** | 0.686 | +| Recall@5 (geometric) | 0.369 | 0.000 (no A1 anchors) | +| Mean parse time | 251 ms | 265 ms | + +Recall@5 = 0.704 means **~30% of questions miss** with k=5. To reach 0.90 +we need to roughly cut the miss rate to a third. A single recall number +hides which lever to pull, which is why this branch ships failure +bucketing (`scripts/eval_retrieval.py --emit-failures` + +`scripts/triage_recall.py`). + +## The diagnosis framework — why the bucket histogram is the answer + +Every recall@5 miss falls into one of these buckets. The fix is +completely different per bucket, and only one or two will dominate. The +job of the investigator is to read the histogram FIRST, then commit to +fixing the biggest one. + +| Bucket | What it means | Where to look | +|---|---|---| +| `answer_absent_from_chunks` | Answer value is in NO chunk. Cell was dropped or garbled. | `parsers/cell_parser.py`, `rendering/text_renderer.py::_cell_render_value` | +| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. Chunk is too large/heterogeneous; the embedding is diluted. | `chunking/chunker.py` (no token cap), `analysis/table_assembler.py` (over-merging) | +| `wrong_sheet` | Answer sheet was never chunked. Sheet enumeration missed it. | `parsers/workbook_parser.py` sheet loop | +| `geometric_no_overlap` | No chunk's A1 range overlaps ground truth. Range bookkeeping drifts during merge/split. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` | +| `no_chunks` / `parse_error` | Upstream parser failure. | The parse exception — fix the crash. | + +## A priori hypotheses (to be confirmed by the next benchmark run) + +### H1 — `present_but_ranked_low` is the biggest bucket + +There is no per-chunk token cap in `chunking/chunker.py` (`CHARS_PER_TOKEN` +is only used to *report* `token_count`, never to split). On +SpreadsheetBench many input files are single-sheet ledgers where the +block-assembler collapses the whole sheet into one chunk. The +sentence-transformer query embedding then has to compete against ~2k +tokens of mostly irrelevant text; the relevant ~5 tokens get washed out. + +If H1 is right, the histogram will show `present_but_ranked_low` ≫ the +others, and recall@1 (0.580) will be much worse than recall@5 (0.704) +— exactly what we observe (Δ = 12.4 pp, vs typical Δ ≈ 5–6 pp when +chunks are right-sized). + +**Fix**: hard cap chunks at ~512 tokens by row-splitting tables and add +a "row group" sub-chunk for tall tables. This is a 1–2 day surgical +change in `chunking/chunker.py`. + +### H2 — `answer_absent_from_chunks` dominates the geometric gap + +`parsers/workbook_parser.py` loads both `data_only=False` (formula +expressions) and `data_only=True` (computed values). But what flows into +`render_text` is whichever `_cell_render_value` picks. If the cell is a +formula like `=SUM(B2:B10)`, `display_value` may be the *expression* +when the workbook was saved without cached values (LibreOffice and some +generated files do this). Those answer cells become unfindable by text +match even though the data IS in the spreadsheet. + +**Diagnostic**: count failure rows where every `top_chunks[*].text` +matches the formula expression pattern (`=`, function name) but not the +expected numeric value. The bucket emits the top-8 chunks for inspection. + +**Fix**: when the cached value is missing for a formula cell, evaluate it +with our own formula engine (`formula/formula_parser.py` already exists) +or use python-calamine's value-only pass as the source of truth for +render text — never the formula source. + +### H3 — `geometric_no_overlap` is high because block ranges over-extend + +Geometric recall@5 = 0.369 means **only ~37% of the time** does the +chunk a parser surfaces actually cover the ground-truth answer cell — +even when the text match works. The block-detection pipeline merges +sparse blocks (`analysis/light_block_detector.py`) and groups by +similarity (`analysis/table_grouper.py`). Each merge widens the +top-left/bottom-right anchors. If the anchors are widened past the +sheet's true content, downstream citation overlays in ks-backend will +highlight whitespace, and the geometric metric registers the chunk as +"not overlapping" because its claimed range is so large it's not useful. + +**Fix**: after every merge/split, clip `cell_range` to the tight bounding +box of the cells that actually contributed text. Add an invariant test +that `block.cell_range` ⊆ `bounding_box(block.cells)`. + +## How to confirm — the next benchmark run + +1. `make corpus-download` (one-time, ~hundreds of MB). +2. `make bench-track` — runs the full benchmark, appends to + `tests/benchmarks/reports/history.jsonl`, prints the bucket triage. +3. Read the histogram. Pick the biggest bucket. Open 3–5 example + failures with `python scripts/triage_recall.py + --bucket --examples 5`. Each row shows: + * the natural-language question + * the ground-truth answer cell + values + * the top-8 ranked chunks we produced (sheet, A1 range, text snippet) + * whether each chunk contains the answer +4. Pattern-match across 5 examples — what's the common parser behaviour? + That tells you the fix. +5. Implement, re-run `make bench-track`. The script prints the delta + row-over-row so improvement is visible immediately. + +## How to use the Docker image (CI + reproducibility) + +```bash +# Build once +docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . + +# Quick smoke (60 instances, < 2 min) +docker run --rm -e BENCH_SAMPLE=60 ks-xlsx-parser-bench + +# Full corpus, persist reports + corpus cache +docker run --rm \ + -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ + -v "$PWD/data:/app/data" \ + ks-xlsx-parser-bench +``` + +The `Benchmark` GitHub workflow: +* Runs a 60-instance smoke on every PR that touches `src/` or the + benchmark scripts. +* Runs the full 912-instance corpus weekly (Monday 06:00 UTC) and on + manual dispatch. +* Uploads `tests/benchmarks/reports/*` as a build artifact and posts the + recall summary to the job step summary. + +## Goal & cadence + +Target: **text recall@5 ≥ 0.90** by end of the current quarter. + +Track in `tests/benchmarks/reports/history.jsonl` (commit-over-commit +row append). Refuse merges that drop recall@5 by ≥ 2 pp on the sample +run (planned gate; today the PR job is reporting-only). + +## See also + +* `scripts/eval_retrieval.py` — the benchmark itself. +* `scripts/triage_recall.py` — bucket histogram + example dump. +* `scripts/append_bench_history.py` — history.jsonl row writer. +* `.claude/skills/recall-failure-triage/SKILL.md` — agent guide. +* `Dockerfile.bench` — reproducible benchmark image. diff --git a/examples/demo.py b/examples/demo.py index dbf6118..de59151 100644 --- a/examples/demo.py +++ b/examples/demo.py @@ -12,8 +12,8 @@ # Add src to path for development sys.path.insert(0, str(Path(__file__).parent.parent / "src")) -from xlsx_parser.pipeline import parse_workbook -from xlsx_parser.utils.logging_config import configure_logging +from ks_xlsx_parser.pipeline import parse_workbook +from ks_xlsx_parser.utils.logging_config import configure_logging EXAMPLES_DIR = Path(__file__).parent / "fixtures" @@ -136,7 +136,7 @@ def demo_engineering_calcs(): print(f" Named Ranges: {[nr.name for nr in wb.named_ranges]}") # Show dependency chain for Design Moment (C15) - from xlsx_parser.models import CellCoord + from ks_xlsx_parser.models import CellCoord upstream = wb.dependency_graph.get_upstream( "Beam Design", CellCoord(row=15, col=3), max_depth=3 ) diff --git a/examples/generate_examples.py b/examples/generate_examples.py index 8d25e01..245caa2 100644 --- a/examples/generate_examples.py +++ b/examples/generate_examples.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 """ -Generate example Excel workbooks for demonstrating the xlsx_parser. +Generate example Excel workbooks for demonstrating the ks_xlsx_parser. Creates several representative workbooks in the examples/ folder that showcase the parser's capabilities across different Excel features. diff --git a/examples/stress_test/stress_test_runner.py b/examples/stress_test/stress_test_runner.py index 2551765..98128fa 100644 --- a/examples/stress_test/stress_test_runner.py +++ b/examples/stress_test/stress_test_runner.py @@ -18,7 +18,7 @@ PROJECT_ROOT = Path(__file__).parent.parent.parent sys.path.insert(0, str(PROJECT_ROOT / "src")) -from xlsx_parser.pipeline import parse_workbook +from ks_xlsx_parser.pipeline import parse_workbook STRESS_DIR = Path(__file__).parent diff --git a/pyproject.toml b/pyproject.toml index a425853..bbaa8ac 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -46,7 +46,7 @@ Repository = "https://github.com/knowledgestack/ks-xlsx-parser" Documentation = "https://github.com/knowledgestack/ks-xlsx-parser#readme" [project.scripts] -xlsx-parser-api = "api:main" +xlsx-parser-api = "ks_xlsx_parser.api:main" [project.optional-dependencies] api = [ @@ -64,6 +64,12 @@ dev = [ "ruff>=0.6.0", "mypy>=1.0", ] +# Retrieval-recall benchmark (scripts/eval_retrieval.py). Heavy — only the +# benchmark Docker image and `make bench-retrieval` need these. +bench = [ + "sentence-transformers>=2.2.0", + "numpy>=1.24.0", +] [tool.pytest.ini_options] testpaths = ["tests"] @@ -80,6 +86,10 @@ addopts = "-m 'not corpus'" [tool.setuptools.packages.find] where = ["src"] +include = ["ks_xlsx_parser*"] + +[tool.setuptools.package-data] +ks_xlsx_parser = ["py.typed"] [tool.ruff] line-length = 110 diff --git a/scripts/append_bench_history.py b/scripts/append_bench_history.py new file mode 100755 index 0000000..01def1f --- /dev/null +++ b/scripts/append_bench_history.py @@ -0,0 +1,101 @@ +#!/usr/bin/env python3 +"""Append the latest retrieval-benchmark run to a commit-over-commit history. + +``eval_retrieval.py`` writes a timestamped ``summary.json`` per run. This +script picks the most recent one, flattens the headline metrics, tags it +with the current git commit, and appends one JSON line to +``tests/benchmarks/reports/history.jsonl``. + +That history file is what makes "is recall improving over time?" answerable +— plot it, diff it in CI, or just ``tail`` it. Goal: text recall@5 > 0.90. + +Usage: + python scripts/append_bench_history.py + python scripts/append_bench_history.py --reports-dir tests/benchmarks/reports/retrieval +""" +from __future__ import annotations + +import argparse +import json +import subprocess +from datetime import UTC, datetime +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent +HISTORY = ROOT / "tests" / "benchmarks" / "reports" / "history.jsonl" + + +def git_commit() -> str: + try: + return subprocess.check_output( + ["git", "rev-parse", "--short", "HEAD"], cwd=ROOT, text=True + ).strip() + except Exception: + return "unknown" + + +def latest_summary(reports_dir: Path) -> Path: + summaries = sorted(reports_dir.glob("*/summary.json")) + if not summaries: + raise SystemExit( + f"no summary.json under {reports_dir} — run `make bench-retrieval` first" + ) + return summaries[-1] + + +def main(argv: list[str] | None = None) -> int: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--reports-dir", type=Path, + default=ROOT / "tests" / "benchmarks" / "reports" / "retrieval") + ap.add_argument("--parser", default="ks-xlsx-parser", + help="which parser's metrics to record") + args = ap.parse_args(argv) + + summary_path = latest_summary(args.reports_dir) + summary = json.loads(summary_path.read_text()) + metrics = summary.get(args.parser) + if metrics is None: + raise SystemExit( + f"parser {args.parser!r} not in {summary_path}; " + f"have: {list(summary)}" + ) + + row = { + "timestamp": datetime.now(UTC).isoformat(), + "commit": git_commit(), + "parser": args.parser, + "run": summary_path.parent.name, + "instances": metrics.get("instances"), + "recall_text@1": metrics.get("recall_text@1"), + "recall_text@3": metrics.get("recall_text@3"), + "recall_text@5": metrics.get("recall_text@5"), + "recall_geometric@5": metrics.get("recall_geometric@5"), + "table_fragmentation_rate": metrics.get("table_fragmentation_rate"), + "mean_parse_ms": metrics.get("mean_parse_ms"), + "errors": metrics.get("errors"), + "failure_buckets": metrics.get("failure_buckets"), + } + + HISTORY.parent.mkdir(parents=True, exist_ok=True) + with HISTORY.open("a") as f: + f.write(json.dumps(row, separators=(",", ":")) + "\n") + + print(f"appended to {HISTORY.relative_to(ROOT)}:") + print(f" commit {row['commit']} recall_text@5={row['recall_text@5']} " + f"recall_text@1={row['recall_text@1']}") + + # Show the trend if there's history to compare against. + rows = [json.loads(ln) for ln in HISTORY.read_text().splitlines() if ln.strip()] + if len(rows) >= 2: + prev, cur = rows[-2], rows[-1] + for k in ("recall_text@5", "recall_text@1"): + p, c = prev.get(k), cur.get(k) + if isinstance(p, int | float) and isinstance(c, int | float): + delta = c - p + arrow = "▲" if delta > 0 else ("▼" if delta < 0 else "—") + print(f" {k}: {p:.4f} → {c:.4f} {arrow} {delta:+.4f}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/eval_retrieval.py b/scripts/eval_retrieval.py index 64c1743..3d4cdff 100644 --- a/scripts/eval_retrieval.py +++ b/scripts/eval_retrieval.py @@ -34,13 +34,16 @@ import signal import sys import time +from collections.abc import Iterable from dataclasses import dataclass, field from pathlib import Path -from typing import Any, Iterable +from typing import Any REPO_ROOT = Path(__file__).resolve().parent.parent +# Keep ``import scripts.X`` style imports working when invoked as +# ``python scripts/eval_retrieval.py``. We no longer need ``src`` on the +# path — ks_xlsx_parser is a properly-installed package now. sys.path.insert(0, str(REPO_ROOT)) -sys.path.insert(0, str(REPO_ROOT / "src")) def _normalize_value_for_match(s: str) -> set[str]: @@ -254,7 +257,7 @@ def parse_position_spec( def extract_chunks_ks(path: Path) -> list[Chunk]: - from pipeline import parse_workbook + from ks_xlsx_parser.pipeline import parse_workbook result = parse_workbook(path=str(path)) out: list[Chunk] = [] @@ -442,6 +445,62 @@ class InstanceResult: extra: dict[str, Any] = field(default_factory=dict) +# ────────────────────────────────────────────────────────── failure triage +# +# recall@5 sitting at ~0.70 means ~30% of questions miss. A single recall +# number can't tell you WHY — and the fix is completely different per cause: +# +# * answer_absent_from_chunks → the answer value is in NO chunk at all. +# The parser dropped/garbled the cell. Fix the EXTRACTION pipeline. +# * present_but_ranked_low → a chunk DOES contain the answer, but the +# embedding model ranked it past k. Fix CHUNKING (smaller/cleaner +# chunks rank better) or the embedding step — NOT the parser. +# * wrong_sheet → answer is on a sheet we produced no chunk +# for, or mis-attributed the sheet. Fix sheet enumeration. +# * geometric_no_overlap → no chunk's A1 range overlaps ground truth +# even though text may match. Fix RANGE bookkeeping (top_left / +# bottom_right anchors drift during merge/split). +# * no_chunks / parse_error → upstream parser failure. +# +# Splitting the miss population into these buckets turns "recall is low" +# into a ranked, actionable worklist. This is the single most useful thing +# for an agent iterating on recall — see scripts/triage_recall.py. + +FAILURE_BUCKETS = ( + "answer_absent_from_chunks", + "present_but_ranked_low", + "wrong_sheet", + "geometric_no_overlap", + "no_chunks", + "parse_error", + "unparseable_ground_truth", +) + + +def classify_text_failure(rec: dict[str, Any]) -> str | None: + """Bucket a single result record for the TEXT-match recall@5 metric. + + Returns None if the instance was a recall@5 hit (rank ≤ 5) or was not + scoreable (no ground-truth values to match against). + """ + if rec.get("error"): + return "no_chunks" if rec.get("n_chunks", 0) == 0 else "parse_error" + if not rec.get("had_answer_values", True): + return None # not scoreable for text-match — exclude, don't penalise + if rec.get("n_chunks", 0) == 0: + return "no_chunks" + rank = rec.get("rank_of_text_match") + if rank is not None and rank <= 5: + return None # hit + if rank is None: + # Not in any chunk. Distinguish "value never extracted" from + # "value extracted but on a sheet we never chunked". + if rec.get("answer_on_unchunked_sheet"): + return "wrong_sheet" + return "answer_absent_from_chunks" + return "present_but_ranked_low" # rank > 5 + + def score_instance( *, parser_name: str, @@ -452,8 +511,10 @@ def score_instance( answer_position: str, default_sheet: str | None, answer_cell_values: list[str], + answer_regions: list[tuple[str | None, tuple[int, int, int, int]]] | None = None, model, per_parser_timeout_s: float = 60.0, + emit_failures: bool = False, ) -> InstanceResult: import numpy as np @@ -547,6 +608,38 @@ def score_instance( rank_text = r break + # Did the parser produce ANY chunk on the sheet(s) the answer lives on? + # If not, a text miss is a sheet-enumeration bug, not a cell-drop bug. + answer_on_unchunked_sheet = False + if answer_regions: + gt_sheets = {s for s, _ in answer_regions if s} + chunk_sheets = {c.sheet for c in chunks if c.sheet} + if gt_sheets and chunk_sheets and not (gt_sheets & chunk_sheets): + answer_on_unchunked_sheet = True + + extra: dict[str, Any] = { + "answer_on_unchunked_sheet": answer_on_unchunked_sheet, + "had_answer_values": bool(answer_cell_values), + } + if emit_failures: + # Dump the top-8 ranked chunks so a human/agent can eyeball WHY the + # answer was missed without re-running the (expensive) benchmark. + top: list[dict[str, Any]] = [] + for r, idx in enumerate(ranking[:8], start=1): + c = chunks[idx] + top.append({ + "rank": r, + "sheet": c.sheet, + "range": ( + f"{c.top_left}:{c.bottom_right}" + if c.top_left else None + ), + "contains_answer": _matches_chunk_text(answer_cell_values, c.text or ""), + "text": (c.text or "")[:280], + }) + extra["top_chunks"] = top + extra["answer_values"] = answer_cell_values[:20] + return InstanceResult( instance_id=inst_id, parser=parser_name, @@ -558,6 +651,7 @@ def score_instance( chunks_overlapping_data=len(overlap_idxs), rank_of_first_overlap=rank_overlap, rank_of_text_match=rank_text, + extra=extra, ) @@ -650,6 +744,21 @@ def _recall_at(k: int, key: str) -> float: parse_times = [r.parse_ms for r in recs if not r.error] + # Failure-bucket histogram for the text-match recall@5 metric. + # Only counts instances that HAD ground-truth values to match + # against (others aren't scoreable). See classify_text_failure. + buckets = dict.fromkeys(FAILURE_BUCKETS, 0) + for r in recs: + bucket = classify_text_failure({ + "error": r.error, + "n_chunks": r.n_chunks, + "rank_of_text_match": r.rank_of_text_match, + "had_answer_values": r.extra.get("had_answer_values", True), + "answer_on_unchunked_sheet": r.extra.get("answer_on_unchunked_sheet"), + }) + if bucket is not None: + buckets[bucket] += 1 + summary[parser] = { "instances": total, "ok": ok, @@ -667,6 +776,9 @@ def _recall_at(k: int, key: str) -> float: if parse_times else None, "p50_parse_ms": round(sorted(parse_times)[len(parse_times) // 2], 2) if parse_times else None, + # Why the text-match recall@5 misses happen, bucketed. The + # biggest bucket is the highest-leverage thing to fix next. + "failure_buckets": buckets, } return summary @@ -701,6 +813,10 @@ def main(argv: list[str] | None = None) -> int: parser.add_argument("--test-case", type=int, default=1, help="Which of the (typically 3) test cases per instance " "to score on. We use one to keep eval costs bounded.") + parser.add_argument("--emit-failures", action="store_true", + help="Also write failures.ndjson — one row per " + "recall@5 miss with the top-8 ranked chunks and " + "ground-truth values, for failure triage.") parser.add_argument("--per-parser-timeout", type=float, default=60.0, help="Wall-clock seconds before a parser is " "considered hung on a single file (docling can " @@ -736,6 +852,7 @@ def main(argv: list[str] | None = None) -> int: ndjson_path = out_dir / "results.ndjson" results: list[InstanceResult] = [] + failure_rows: list[dict[str, Any]] = [] n = len(instances) * len(parser_fns) done = 0 @@ -786,10 +903,20 @@ def main(argv: list[str] | None = None) -> int: answer_position=answer_pos, default_sheet=default_sheet, answer_cell_values=answer_values, + answer_regions=answer_regions, model=model, per_parser_timeout_s=args.per_parser_timeout, + emit_failures=args.emit_failures, ) results.append(res) + bucket = classify_text_failure({ + "error": res.error, + "n_chunks": res.n_chunks, + "rank_of_text_match": res.rank_of_text_match, + "had_answer_values": res.extra.get("had_answer_values", True), + "answer_on_unchunked_sheet": res.extra.get( + "answer_on_unchunked_sheet"), + }) f.write(json.dumps({ "instance_id": res.instance_id, "parser": res.parser, @@ -801,8 +928,22 @@ def main(argv: list[str] | None = None) -> int: "chunks_overlapping_data": res.chunks_overlapping_data, "rank_of_first_overlap": res.rank_of_first_overlap, "rank_of_text_match": res.rank_of_text_match, + "failure_bucket": bucket, "error": res.error, }, separators=(",", ":")) + "\n") + if args.emit_failures and bucket is not None: + failure_rows.append({ + "instance_id": res.instance_id, + "parser": res.parser, + "failure_bucket": bucket, + "instruction": instr, + "answer_position": answer_pos, + "answer_values": res.extra.get("answer_values", []), + "rank_of_text_match": res.rank_of_text_match, + "n_chunks": res.n_chunks, + "top_chunks": res.extra.get("top_chunks", []), + "error": res.error, + }) done += 1 if done % 10 == 0: sys.stderr.write(f"\r[{done}/{n}] ") @@ -810,6 +951,13 @@ def main(argv: list[str] | None = None) -> int: sys.stderr.write(f"\nWrote {ndjson_path}\n") + if args.emit_failures: + fail_path = out_dir / "failures.ndjson" + with fail_path.open("w") as ff: + for row in failure_rows: + ff.write(json.dumps(row, separators=(",", ":")) + "\n") + sys.stderr.write(f"Wrote {fail_path} ({len(failure_rows)} failure rows)\n") + summary = aggregate(results) summary_path = out_dir / "summary.json" summary_path.write_text(json.dumps(summary, indent=2)) @@ -853,6 +1001,20 @@ def main(argv: list[str] | None = None) -> int: "(sheet, range) per chunk — docling does not, so its geometric " "recall is structurally 0.") md_lines.append("") + md_lines.append("## Failure buckets (text-match recall@5 misses)") + md_lines.append("") + md_lines.append("Why each miss happened — the biggest bucket is the highest-") + md_lines.append("leverage fix. `answer_absent_from_chunks` → fix extraction; ") + md_lines.append("`present_but_ranked_low` → fix chunking/embedding.") + md_lines.append("") + md_lines.append("| Bucket | " + " | ".join(parsers) + " |") + md_lines.append("|---|" + "|".join(["---"] * len(parsers)) + "|") + for b in FAILURE_BUCKETS: + row = [b] + for p in parsers: + row.append(str(summary[p].get("failure_buckets", {}).get(b, 0))) + md_lines.append("| " + " | ".join(row) + " |") + md_lines.append("") md_lines.append("**Text-match** = the answer cell's actual string value appears " "as a substring of the chunk's text. Parser-agnostic; this is " "the apples-to-apples retrieval comparison.") diff --git a/scripts/run_bench.sh b/scripts/run_bench.sh new file mode 100755 index 0000000..05a77fb --- /dev/null +++ b/scripts/run_bench.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# Entrypoint for the benchmark Docker image (Dockerfile.bench). +# +# Ensures the SpreadsheetBench corpus is present, runs the retrieval-recall +# benchmark for ks-xlsx-parser, appends the result to history.jsonl, and +# prints a failure-bucket triage so accuracy can be tracked over time. +# +# Env vars: +# BENCH_SAMPLE parse only N random instances (0 / unset = full 912) +# BENCH_PARSERS comma list passed to eval_retrieval.py (default: ks) +# BENCH_TIMEOUT per-file parser timeout in seconds (default: 120) +set -euo pipefail + +cd "$(dirname "$0")/.." + +CORPUS_DIR="data/corpora/spreadsheetbench" +SAMPLE="${BENCH_SAMPLE:-0}" +PARSERS="${BENCH_PARSERS:-ks}" +TIMEOUT="${BENCH_TIMEOUT:-120}" + +if [ ! -d "$CORPUS_DIR" ]; then + echo "→ Downloading SpreadsheetBench corpus ..." + mkdir -p "$CORPUS_DIR" + curl -L --fail --retry 3 --connect-timeout 20 \ + -o /tmp/sb.tar.gz \ + "https://raw.githubusercontent.com/RUCKBReasoning/SpreadsheetBench/main/data/spreadsheetbench_912_v0.1.tar.gz" + tar -xzf /tmp/sb.tar.gz -C "$CORPUS_DIR" + rm -f /tmp/sb.tar.gz +fi + +# eval_retrieval.py expects the dataset.json + spreadsheet/ dir. Find it. +CORPUS_ARG="$CORPUS_DIR" +if [ -d "$CORPUS_DIR/all_data_912_v0.1" ]; then + CORPUS_ARG="$CORPUS_DIR/all_data_912_v0.1" +fi + +SAMPLE_ARG=() +if [ "$SAMPLE" != "0" ]; then + SAMPLE_ARG=(--sample "$SAMPLE") + echo "→ Sampling $SAMPLE instances" +fi + +echo "→ Running retrieval benchmark (parsers=$PARSERS) ..." +python scripts/eval_retrieval.py \ + --corpus "$CORPUS_ARG" \ + --parsers "$PARSERS" \ + --emit-failures \ + --per-parser-timeout "$TIMEOUT" \ + --out tests/benchmarks/reports/retrieval \ + "${SAMPLE_ARG[@]}" + +echo "→ Appending to history.jsonl ..." +python scripts/append_bench_history.py + +echo +python scripts/triage_recall.py tests/benchmarks/reports/retrieval diff --git a/scripts/run_enterprise_metrics.py b/scripts/run_enterprise_metrics.py index 1d6821a..6366fbd 100644 --- a/scripts/run_enterprise_metrics.py +++ b/scripts/run_enterprise_metrics.py @@ -7,9 +7,8 @@ import json -from xlsx_parser import parse_workbook - from scripts.generate_enterprise_fixtures import generate_all +from xlsx_parser import parse_workbook class EnterpriseScorecard: diff --git a/scripts/track_corpus_metrics.py b/scripts/track_corpus_metrics.py index 779995b..a0cb244 100644 --- a/scripts/track_corpus_metrics.py +++ b/scripts/track_corpus_metrics.py @@ -6,7 +6,6 @@ from datetime import datetime from pathlib import Path - ROOT = Path(__file__).resolve().parent.parent METRICS_DIR = ROOT / "metrics" / "corpus" diff --git a/scripts/triage_recall.py b/scripts/triage_recall.py new file mode 100755 index 0000000..41a766a --- /dev/null +++ b/scripts/triage_recall.py @@ -0,0 +1,114 @@ +#!/usr/bin/env python3 +"""Triage retrieval-recall failures into a ranked, actionable worklist. + +Reads a ``failures.ndjson`` produced by +``eval_retrieval.py --emit-failures`` and prints: + + 1. A histogram of failure buckets (biggest = highest leverage). + 2. For each bucket, a few concrete example failures with the + ground-truth answer and the top ranked chunks the parser produced. + +The point: turn "recall@5 is 0.70" into "N misses are answer_absent_from_chunks +— the parser is dropping cells; here are 5 examples to reproduce." + +Usage: + python scripts/triage_recall.py tests/benchmarks/reports/retrieval//failures.ndjson + python scripts/triage_recall.py # finds the latest run + python scripts/triage_recall.py --bucket answer_absent_from_chunks --examples 10 +""" +from __future__ import annotations + +import argparse +import json +import sys +from collections import Counter +from pathlib import Path + +# Ordered worst→least-actionable. Used to sort the worklist. +BUCKET_GUIDANCE = { + "parse_error": "Parser raised on the file. Reproduce with parse_workbook(path) and fix the crash.", + "no_chunks": "Parser produced zero chunks. Sheet/region detection collapsed — check chunking/segmenter.py.", + "answer_absent_from_chunks": "Answer value is in NO chunk. EXTRACTION gap — the cell was dropped or garbled. Highest leverage.", + "wrong_sheet": "Answer sheet was never chunked. Sheet enumeration bug — check workbook_parser.py sheet loop.", + "geometric_no_overlap": "No chunk's A1 range overlaps ground truth. RANGE bookkeeping drift in merge/split.", + "present_but_ranked_low": "A chunk DOES contain the answer but ranked >5. Not a parser bug — fix chunk granularity/embedding.", + "unparseable_ground_truth": "Could not parse the dataset's answer_position. Benchmark-harness issue, not the parser.", +} + + +def find_failures_file(arg: Path) -> Path: + if arg.is_file(): + return arg + if arg.is_dir(): + direct = arg / "failures.ndjson" + if direct.exists(): + return direct + runs = sorted(p for p in arg.glob("*/failures.ndjson")) + if runs: + return runs[-1] + sys.exit(f"ERROR: no failures.ndjson found at {arg}") + + +def load(path: Path) -> list[dict]: + rows = [] + for line in path.read_text().splitlines(): + line = line.strip() + if line: + rows.append(json.loads(line)) + return rows + + +def main(argv: list[str] | None = None) -> int: + ap = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("path", type=Path, + help="failures.ndjson, or a dir containing/parenting one") + ap.add_argument("--bucket", help="only show examples for this bucket") + ap.add_argument("--examples", type=int, default=3, + help="example failures to print per bucket (default 3)") + args = ap.parse_args(argv) + + path = find_failures_file(args.path) + rows = load(path) + if not rows: + print(f"{path} is empty — no failures recorded (or run was a no-op).") + return 0 + + print(f"# Recall failure triage — {path}") + print(f"# {len(rows)} total failure rows\n") + + hist = Counter(r.get("failure_bucket") for r in rows) + print("## Bucket histogram (ranked by count — fix the top one first)\n") + width = max(len(b or "?") for b in hist) + for bucket, count in hist.most_common(): + pct = 100.0 * count / len(rows) + print(f" {str(bucket):<{width}} {count:5d} ({pct:5.1f}%)") + print(f" {'':<{width}} → {BUCKET_GUIDANCE.get(bucket, '')}") + print() + + buckets = [args.bucket] if args.bucket else [b for b, _ in hist.most_common()] + for bucket in buckets: + examples = [r for r in rows if r.get("failure_bucket") == bucket][:args.examples] + if not examples: + continue + print(f"## Examples — {bucket}\n") + for r in examples: + print(f" instance {r.get('instance_id')} ({r.get('parser')})") + print(f" Q: {(r.get('instruction') or '')[:160]}") + print(f" answer_position: {r.get('answer_position')}") + print(f" ground-truth values: {r.get('answer_values')}") + print(f" n_chunks={r.get('n_chunks')} " + f"rank_of_text_match={r.get('rank_of_text_match')}") + if r.get("error"): + print(f" ERROR: {r['error']}") + for c in (r.get("top_chunks") or [])[:4]: + mark = "✓" if c.get("contains_answer") else " " + snippet = (c.get("text") or "").replace("\n", " ")[:120] + print(f" [{mark}] #{c.get('rank')} {c.get('sheet')} " + f"{c.get('range')} {snippet}") + print() + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/verify_wheel.py b/scripts/verify_wheel.py new file mode 100755 index 0000000..26801f3 --- /dev/null +++ b/scripts/verify_wheel.py @@ -0,0 +1,81 @@ +#!/usr/bin/env python3 +"""Verify the built wheel is installable and importable in a clean venv. + +This is the regression guard for the v0.2.0 packaging bug: ``pipeline.py`` +and ``api.py`` were top-level modules under ``src/`` and ``setuptools`` +``packages.find`` only picks up *packages*, so they were silently dropped +from the wheel — ``from ks_xlsx_parser.pipeline import ...`` failed for +every installed user. The flat layout also leaked 13 generic top-level +packages (``models``, ``utils``, ``parsers`` ...) into ``site-packages``. + +Run after ``python -m build --wheel``. Exits non-zero on any problem so it +can gate CI and ``make wheel-check``. +""" +from __future__ import annotations + +import subprocess +import sys +import tempfile +import venv +import zipfile +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent + +# Imports a real downstream consumer relies on. Keep in sync with the +# public surface in ks_xlsx_parser/__init__.py. +SMOKE_IMPORTS = [ + "from ks_xlsx_parser import parse_workbook, ParseResult", + "from ks_xlsx_parser.pipeline import parse_workbook", + "from ks_xlsx_parser.verification import StageVerifier", + "from ks_xlsx_parser.analysis.table_assembler import TableAssembler", + "from ks_xlsx_parser.models.workbook import WorkbookDTO", +] + + +def find_wheel() -> Path: + wheels = sorted((ROOT / "dist").glob("*.whl")) + if not wheels: + sys.exit("ERROR: no wheel in dist/ — run `python -m build --wheel` first") + return wheels[-1] + + +def check_wheel_contents(wheel: Path) -> None: + """Fail loudly if the wheel pollutes the global namespace or drops modules.""" + with zipfile.ZipFile(wheel) as zf: + names = zf.namelist() + top_level = next((n for n in names if n.endswith("top_level.txt")), None) + if top_level: + packages = zf.read(top_level).decode().split() + if packages != ["ks_xlsx_parser"]: + sys.exit( + f"ERROR: wheel exposes top-level packages {packages}; " + "expected only ['ks_xlsx_parser']. The flat src/ layout leaked." + ) + required = ["ks_xlsx_parser/pipeline.py", "ks_xlsx_parser/api.py"] + for req in required: + if not any(n == req for n in names): + sys.exit(f"ERROR: wheel is missing {req}") + print(f"wheel contents OK ({len(names)} entries, top-level: ks_xlsx_parser)") + + +def check_install_and_import(wheel: Path) -> None: + with tempfile.TemporaryDirectory() as tmp: + env_dir = Path(tmp) / "venv" + venv.create(env_dir, with_pip=True) + py = env_dir / ("Scripts" if sys.platform == "win32" else "bin") / "python" + subprocess.run([str(py), "-m", "pip", "install", "-q", str(wheel)], check=True) + script = "; ".join(SMOKE_IMPORTS) + "; print('clean-venv import OK')" + subprocess.run([str(py), "-c", script], check=True) + + +def main() -> None: + wheel = find_wheel() + print(f"verifying {wheel.name}") + check_wheel_contents(wheel) + check_install_and_import(wheel) + print("wheel verification PASSED") + + +if __name__ == "__main__": + main() diff --git a/src/ks_xlsx_parser/__init__.py b/src/ks_xlsx_parser/__init__.py index 184452b..cabcd96 100644 --- a/src/ks_xlsx_parser/__init__.py +++ b/src/ks_xlsx_parser/__init__.py @@ -1,25 +1,27 @@ """ ks_xlsx_parser — public API entry point for the ks-xlsx-parser package. -The source tree is flat: top-level modules at ``src/`` (``pipeline``, +All modules live under the ``ks_xlsx_parser`` package (``pipeline``, ``models``, ``analysis``, ``verification``, etc.). This module re-exports the stable, user-facing names so callers can do:: from ks_xlsx_parser import parse_workbook, ParseResult -regardless of internal layout. +and submodules remain importable directly:: + + from ks_xlsx_parser.pipeline import parse_workbook """ from __future__ import annotations __version__ = "0.2.0" -from pipeline import ( # noqa: F401 +from .pipeline import ( # noqa: F401 ParseResult, compare_workbooks, export_importer, parse_workbook, ) -from verification import ( # noqa: F401 +from .verification import ( # noqa: F401 ExcellentStage, StageVerifier, VerificationReport, diff --git a/src/analysis/__init__.py b/src/ks_xlsx_parser/analysis/__init__.py similarity index 100% rename from src/analysis/__init__.py rename to src/ks_xlsx_parser/analysis/__init__.py diff --git a/src/analysis/light_block_detector.py b/src/ks_xlsx_parser/analysis/light_block_detector.py similarity index 96% rename from src/analysis/light_block_detector.py rename to src/ks_xlsx_parser/analysis/light_block_detector.py index 9816cd0..7fc2d11 100644 --- a/src/analysis/light_block_detector.py +++ b/src/ks_xlsx_parser/analysis/light_block_detector.py @@ -10,9 +10,9 @@ import logging -from models.block import BlockDTO -from models.common import BlockType, CellCoord, CellRange -from models.table_structure import TableRegion, TableRegionRole, TableStructure +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange +from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure logger = logging.getLogger(__name__) diff --git a/src/analysis/llm_artifacts.py b/src/ks_xlsx_parser/analysis/llm_artifacts.py similarity index 98% rename from src/analysis/llm_artifacts.py rename to src/ks_xlsx_parser/analysis/llm_artifacts.py index 01d0777..2122c45 100644 --- a/src/analysis/llm_artifacts.py +++ b/src/ks_xlsx_parser/analysis/llm_artifacts.py @@ -11,24 +11,22 @@ from __future__ import annotations import logging -import re from collections import Counter from pydantic import Field -from models.block import BlockDTO -from models.chart import ChartDTO -from models.common import ( - BlockType, +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.chart import ChartDTO +from ks_xlsx_parser.models.common import ( CellCoord, SheetPurpose, StableModel, compute_hash, ) -from models.dependency import DependencyGraph -from models.sheet import SheetDTO -from models.table import TableDTO -from models.workbook import KpiDTO, SheetSummaryDTO +from ks_xlsx_parser.models.dependency import DependencyGraph +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table import TableDTO +from ks_xlsx_parser.models.workbook import KpiDTO, SheetSummaryDTO logger = logging.getLogger(__name__) diff --git a/src/analysis/pattern_splitter.py b/src/ks_xlsx_parser/analysis/pattern_splitter.py similarity index 96% rename from src/analysis/pattern_splitter.py rename to src/ks_xlsx_parser/analysis/pattern_splitter.py index 92ac82f..e065801 100644 --- a/src/analysis/pattern_splitter.py +++ b/src/ks_xlsx_parser/analysis/pattern_splitter.py @@ -10,12 +10,11 @@ import logging import math -from collections import defaultdict -from models.block import BlockDTO -from models.common import BlockType, CellAnnotation, CellCoord, CellRange -from models.sheet import SheetDTO -from models.table_structure import TableRegion, TableRegionRole, TableStructure +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellAnnotation, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure logger = logging.getLogger(__name__) diff --git a/src/analysis/table_assembler.py b/src/ks_xlsx_parser/analysis/table_assembler.py similarity index 96% rename from src/analysis/table_assembler.py rename to src/ks_xlsx_parser/analysis/table_assembler.py index 279603d..28bc3a3 100644 --- a/src/analysis/table_assembler.py +++ b/src/ks_xlsx_parser/analysis/table_assembler.py @@ -9,10 +9,10 @@ import logging -from models.block import BlockDTO -from models.common import BlockType, CellCoord, CellRange, compute_hash -from models.sheet import SheetDTO -from models.table_structure import TableRegion, TableRegionRole, TableStructure +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure logger = logging.getLogger(__name__) diff --git a/src/analysis/table_grouper.py b/src/ks_xlsx_parser/analysis/table_grouper.py similarity index 97% rename from src/analysis/table_grouper.py rename to src/ks_xlsx_parser/analysis/table_grouper.py index 94fe862..597da57 100644 --- a/src/analysis/table_grouper.py +++ b/src/ks_xlsx_parser/analysis/table_grouper.py @@ -11,10 +11,10 @@ import logging from collections import defaultdict -from models.block import BlockDTO -from models.common import BlockType, CellCoord, CellRange, compute_hash -from models.sheet import SheetDTO -from models.table_structure import TableStructure +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table_structure import TableStructure logger = logging.getLogger(__name__) diff --git a/src/analysis/template_extractor.py b/src/ks_xlsx_parser/analysis/template_extractor.py similarity index 95% rename from src/analysis/template_extractor.py rename to src/ks_xlsx_parser/analysis/template_extractor.py index 27a8025..54dedd9 100644 --- a/src/analysis/template_extractor.py +++ b/src/ks_xlsx_parser/analysis/template_extractor.py @@ -10,15 +10,15 @@ import logging -from models.common import CellAnnotation, CellCoord -from models.sheet import SheetDTO -from models.template import ( +from ks_xlsx_parser.models.common import CellAnnotation, CellCoord +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.template import ( DOFType, TemplateCellSpec, TemplateConstraint, TemplateNode, ) -from models.tree import TreeNode, TreeNodeType +from ks_xlsx_parser.models.tree import TreeNode, TreeNodeType logger = logging.getLogger(__name__) diff --git a/src/analysis/tree_builder.py b/src/ks_xlsx_parser/analysis/tree_builder.py similarity index 96% rename from src/analysis/tree_builder.py rename to src/ks_xlsx_parser/analysis/tree_builder.py index 0e5192f..22a2ed1 100644 --- a/src/analysis/tree_builder.py +++ b/src/ks_xlsx_parser/analysis/tree_builder.py @@ -10,11 +10,10 @@ import logging -from models.block import BlockDTO -from models.common import CellCoord, CellRange, compute_hash -from models.sheet import SheetDTO -from models.table_structure import TableStructure -from models.tree import TreeNode, TreeNodeType +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table_structure import TableStructure +from ks_xlsx_parser.models.tree import TreeNode, TreeNodeType logger = logging.getLogger(__name__) diff --git a/src/annotation/__init__.py b/src/ks_xlsx_parser/annotation/__init__.py similarity index 100% rename from src/annotation/__init__.py rename to src/ks_xlsx_parser/annotation/__init__.py diff --git a/src/annotation/block_splitter.py b/src/ks_xlsx_parser/annotation/block_splitter.py similarity index 97% rename from src/annotation/block_splitter.py rename to src/ks_xlsx_parser/annotation/block_splitter.py index 0ec332f..2d45209 100644 --- a/src/annotation/block_splitter.py +++ b/src/ks_xlsx_parser/annotation/block_splitter.py @@ -8,11 +8,10 @@ from __future__ import annotations import logging -from collections import defaultdict -from models.block import BlockDTO -from models.common import BlockType, CellAnnotation, CellCoord, CellRange -from models.sheet import SheetDTO +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellAnnotation, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO logger = logging.getLogger(__name__) diff --git a/src/annotation/cell_annotator.py b/src/ks_xlsx_parser/annotation/cell_annotator.py similarity index 98% rename from src/annotation/cell_annotator.py rename to src/ks_xlsx_parser/annotation/cell_annotator.py index 5f80bdc..af82bac 100644 --- a/src/annotation/cell_annotator.py +++ b/src/ks_xlsx_parser/annotation/cell_annotator.py @@ -11,9 +11,9 @@ import logging from collections import defaultdict -from models.cell import CellDTO -from models.common import CellAnnotation -from models.sheet import SheetDTO +from ks_xlsx_parser.models.cell import CellDTO +from ks_xlsx_parser.models.common import CellAnnotation +from ks_xlsx_parser.models.sheet import SheetDTO logger = logging.getLogger(__name__) diff --git a/src/api.py b/src/ks_xlsx_parser/api.py similarity index 97% rename from src/api.py rename to src/ks_xlsx_parser/api.py index 36469c5..82a367d 100644 --- a/src/api.py +++ b/src/ks_xlsx_parser/api.py @@ -2,7 +2,7 @@ FastAPI application for uploading and parsing Excel files. Run with: - uvicorn xlsx_parser.api:app --reload --port 8080 + uvicorn ks_xlsx_parser.api:app --reload --port 8080 Or use the xlsx-parser-api script (runs on port 8080 by default): xlsx-parser-api @@ -12,16 +12,11 @@ from __future__ import annotations -import json -import time -import traceback -from pathlib import Path - from fastapi import FastAPI, File, UploadFile from fastapi.responses import HTMLResponse, JSONResponse -from pipeline import parse_workbook -from verification import StageVerifier +from ks_xlsx_parser.pipeline import parse_workbook +from ks_xlsx_parser.verification import StageVerifier app = FastAPI( title="ks-xlsx-parser API", diff --git a/src/charts/__init__.py b/src/ks_xlsx_parser/charts/__init__.py similarity index 100% rename from src/charts/__init__.py rename to src/ks_xlsx_parser/charts/__init__.py diff --git a/src/charts/chart_extractor.py b/src/ks_xlsx_parser/charts/chart_extractor.py similarity index 99% rename from src/charts/chart_extractor.py rename to src/ks_xlsx_parser/charts/chart_extractor.py index 4c4cdca..017feb9 100644 --- a/src/charts/chart_extractor.py +++ b/src/ks_xlsx_parser/charts/chart_extractor.py @@ -21,8 +21,8 @@ from pathlib import Path from xml.etree import ElementTree as ET -from models.chart import ChartAnchor, ChartAxis, ChartDTO, ChartSeries -from models.common import ChartType +from ks_xlsx_parser.models.chart import ChartAnchor, ChartAxis, ChartDTO, ChartSeries +from ks_xlsx_parser.models.common import ChartType logger = logging.getLogger(__name__) diff --git a/src/chunking/__init__.py b/src/ks_xlsx_parser/chunking/__init__.py similarity index 100% rename from src/chunking/__init__.py rename to src/ks_xlsx_parser/chunking/__init__.py diff --git a/src/chunking/chunker.py b/src/ks_xlsx_parser/chunking/chunker.py similarity index 95% rename from src/chunking/chunker.py rename to src/ks_xlsx_parser/chunking/chunker.py index bb4d4f9..7057e78 100644 --- a/src/chunking/chunker.py +++ b/src/ks_xlsx_parser/chunking/chunker.py @@ -13,14 +13,12 @@ import logging -from models.block import BlockDTO, ChunkDTO, DependencySummary -from models.common import CellCoord, EdgeType -from models.dependency import DependencyGraph -from models.sheet import SheetDTO -from models.table import TableDTO -from models.workbook import NamedRangeDTO, WorkbookDTO -from rendering.html_renderer import HtmlRenderer -from rendering.text_renderer import TextRenderer +from ks_xlsx_parser.models.block import BlockDTO, ChunkDTO, DependencySummary +from ks_xlsx_parser.models.common import CellCoord, EdgeType +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.workbook import WorkbookDTO +from ks_xlsx_parser.rendering.html_renderer import HtmlRenderer +from ks_xlsx_parser.rendering.text_renderer import TextRenderer logger = logging.getLogger(__name__) @@ -179,7 +177,7 @@ def _chart_to_chunk(self, chart) -> ChunkDTO: top_left = "A1" bottom_right = "A1" if chart.anchor: - from models.common import col_number_to_letter + from ks_xlsx_parser.models.common import col_number_to_letter top_left = f"{col_number_to_letter(chart.anchor.from_col + 1)}{chart.anchor.from_row + 1}" if chart.anchor.to_col is not None and chart.anchor.to_row is not None: bottom_right = f"{col_number_to_letter(chart.anchor.to_col + 1)}{chart.anchor.to_row + 1}" diff --git a/src/chunking/segmenter.py b/src/ks_xlsx_parser/chunking/segmenter.py similarity index 98% rename from src/chunking/segmenter.py rename to src/ks_xlsx_parser/chunking/segmenter.py index 1b7898a..bde7b2a 100644 --- a/src/chunking/segmenter.py +++ b/src/ks_xlsx_parser/chunking/segmenter.py @@ -22,10 +22,10 @@ import logging -from models.block import BlockDTO -from models.common import BlockType, CellCoord, CellRange -from models.sheet import SheetDTO -from models.table import TableDTO +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table import TableDTO logger = logging.getLogger(__name__) diff --git a/src/comparison/__init__.py b/src/ks_xlsx_parser/comparison/__init__.py similarity index 100% rename from src/comparison/__init__.py rename to src/ks_xlsx_parser/comparison/__init__.py diff --git a/src/comparison/template_comparator.py b/src/ks_xlsx_parser/comparison/template_comparator.py similarity index 99% rename from src/comparison/template_comparator.py rename to src/ks_xlsx_parser/comparison/template_comparator.py index e7e043e..0c4fd64 100644 --- a/src/comparison/template_comparator.py +++ b/src/ks_xlsx_parser/comparison/template_comparator.py @@ -11,8 +11,8 @@ import logging from collections import defaultdict -from models.common import CellCoord, compute_hash -from models.template import ( +from ks_xlsx_parser.models.common import CellCoord +from ks_xlsx_parser.models.template import ( DOFConflict, DOFType, GeneralizedTemplate, diff --git a/src/export/__init__.py b/src/ks_xlsx_parser/export/__init__.py similarity index 100% rename from src/export/__init__.py rename to src/ks_xlsx_parser/export/__init__.py diff --git a/src/export/model_exporter.py b/src/ks_xlsx_parser/export/model_exporter.py similarity index 98% rename from src/export/model_exporter.py rename to src/ks_xlsx_parser/export/model_exporter.py index d47dc7d..0784aac 100644 --- a/src/export/model_exporter.py +++ b/src/ks_xlsx_parser/export/model_exporter.py @@ -12,8 +12,8 @@ from pathlib import Path from typing import Any -from models.common import CellCoord -from models.template import DOFType, GeneralizedTemplate, TemplateNode +from ks_xlsx_parser.models.common import CellCoord +from ks_xlsx_parser.models.template import DOFType, GeneralizedTemplate logger = logging.getLogger(__name__) @@ -205,7 +205,7 @@ def export_code( Constants: {template.total_constants} | DOFs: {template.total_dofs} | Formulas: {template.total_formulas} """ - from xlsx_parser.export.model_exporter import SpreadsheetImporter + from ks_xlsx_parser.export.model_exporter import SpreadsheetImporter from xlsx_parser import parse_workbook diff --git a/src/formula/__init__.py b/src/ks_xlsx_parser/formula/__init__.py similarity index 100% rename from src/formula/__init__.py rename to src/ks_xlsx_parser/formula/__init__.py diff --git a/src/formula/dependency_builder.py b/src/ks_xlsx_parser/formula/dependency_builder.py similarity index 93% rename from src/formula/dependency_builder.py rename to src/ks_xlsx_parser/formula/dependency_builder.py index 8bd85b0..19785af 100644 --- a/src/formula/dependency_builder.py +++ b/src/ks_xlsx_parser/formula/dependency_builder.py @@ -11,13 +11,14 @@ import logging from typing import TYPE_CHECKING -from models.common import CellCoord, EdgeType -from models.dependency import DependencyEdgeDTO, DependencyGraph +from ks_xlsx_parser.models.common import CellCoord, EdgeType +from ks_xlsx_parser.models.dependency import DependencyEdgeDTO, DependencyGraph + from .formula_parser import FormulaParser if TYPE_CHECKING: - from models.sheet import SheetDTO - from models.workbook import NamedRangeDTO + from ks_xlsx_parser.models.sheet import SheetDTO + from ks_xlsx_parser.models.workbook import NamedRangeDTO logger = logging.getLogger(__name__) diff --git a/src/formula/formula_parser.py b/src/ks_xlsx_parser/formula/formula_parser.py similarity index 98% rename from src/formula/formula_parser.py rename to src/ks_xlsx_parser/formula/formula_parser.py index f850b91..941f7a4 100644 --- a/src/formula/formula_parser.py +++ b/src/ks_xlsx_parser/formula/formula_parser.py @@ -10,9 +10,9 @@ from __future__ import annotations import re -from dataclasses import dataclass, field +from dataclasses import dataclass -from models.common import CellCoord, CellRange, col_letter_to_number +from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number # Optional Rust fast-path: ks_xlsx_core.scan_formula returns reference tuples # in the same emit order as the Python regex pipeline. When available, we diff --git a/src/models/__init__.py b/src/ks_xlsx_parser/models/__init__.py similarity index 94% rename from src/models/__init__.py rename to src/ks_xlsx_parser/models/__init__.py index 42814ea..5c91780 100644 --- a/src/models/__init__.py +++ b/src/ks_xlsx_parser/models/__init__.py @@ -1,8 +1,8 @@ """ -Data models (DTOs) for the xlsx_parser. +Data models (DTOs) for the ks_xlsx_parser. Re-exports all model classes for convenient importing: - from xlsx_parser.models import WorkbookDTO, SheetDTO, CellDTO, ... + from ks_xlsx_parser.models import WorkbookDTO, SheetDTO, CellDTO, ... """ from .block import BlockDTO, ChunkDTO, DependencySummary diff --git a/src/models/block.py b/src/ks_xlsx_parser/models/block.py similarity index 100% rename from src/models/block.py rename to src/ks_xlsx_parser/models/block.py diff --git a/src/models/cell.py b/src/ks_xlsx_parser/models/cell.py similarity index 100% rename from src/models/cell.py rename to src/ks_xlsx_parser/models/cell.py diff --git a/src/models/chart.py b/src/ks_xlsx_parser/models/chart.py similarity index 100% rename from src/models/chart.py rename to src/ks_xlsx_parser/models/chart.py diff --git a/src/models/common.py b/src/ks_xlsx_parser/models/common.py similarity index 100% rename from src/models/common.py rename to src/ks_xlsx_parser/models/common.py diff --git a/src/models/dependency.py b/src/ks_xlsx_parser/models/dependency.py similarity index 100% rename from src/models/dependency.py rename to src/ks_xlsx_parser/models/dependency.py diff --git a/src/models/shape.py b/src/ks_xlsx_parser/models/shape.py similarity index 97% rename from src/models/shape.py rename to src/ks_xlsx_parser/models/shape.py index 178ebaf..a0f94f9 100644 --- a/src/models/shape.py +++ b/src/ks_xlsx_parser/models/shape.py @@ -9,7 +9,7 @@ from pydantic import Field -from .common import CellRange, StableModel, compute_hash +from .common import StableModel, compute_hash class ShapeAnchor(StableModel): diff --git a/src/models/sheet.py b/src/ks_xlsx_parser/models/sheet.py similarity index 99% rename from src/models/sheet.py rename to src/ks_xlsx_parser/models/sheet.py index bfc2758..68c8cb1 100644 --- a/src/models/sheet.py +++ b/src/ks_xlsx_parser/models/sheet.py @@ -8,8 +8,6 @@ from __future__ import annotations -from typing import Any - from pydantic import Field from .cell import CellDTO diff --git a/src/models/table.py b/src/ks_xlsx_parser/models/table.py similarity index 100% rename from src/models/table.py rename to src/ks_xlsx_parser/models/table.py diff --git a/src/models/table_structure.py b/src/ks_xlsx_parser/models/table_structure.py similarity index 100% rename from src/models/table_structure.py rename to src/ks_xlsx_parser/models/table_structure.py diff --git a/src/models/template.py b/src/ks_xlsx_parser/models/template.py similarity index 100% rename from src/models/template.py rename to src/ks_xlsx_parser/models/template.py diff --git a/src/models/tree.py b/src/ks_xlsx_parser/models/tree.py similarity index 100% rename from src/models/tree.py rename to src/ks_xlsx_parser/models/tree.py diff --git a/src/models/workbook.py b/src/ks_xlsx_parser/models/workbook.py similarity index 99% rename from src/models/workbook.py rename to src/ks_xlsx_parser/models/workbook.py index 772c934..39a428b 100644 --- a/src/models/workbook.py +++ b/src/ks_xlsx_parser/models/workbook.py @@ -32,7 +32,7 @@ from .sheet import SheetDTO from .table import TableDTO from .table_structure import TableStructure -from .template import GeneralizedTemplate, TemplateNode +from .template import TemplateNode from .tree import TreeNode diff --git a/src/parsers/__init__.py b/src/ks_xlsx_parser/parsers/__init__.py similarity index 100% rename from src/parsers/__init__.py rename to src/ks_xlsx_parser/parsers/__init__.py diff --git a/src/parsers/calamine_core.py b/src/ks_xlsx_parser/parsers/calamine_core.py similarity index 99% rename from src/parsers/calamine_core.py rename to src/ks_xlsx_parser/parsers/calamine_core.py index 830df5d..e6ac959 100644 --- a/src/parsers/calamine_core.py +++ b/src/ks_xlsx_parser/parsers/calamine_core.py @@ -17,7 +17,6 @@ """ from __future__ import annotations -import io import logging import os import tempfile diff --git a/src/parsers/cell_parser.py b/src/ks_xlsx_parser/parsers/cell_parser.py similarity index 99% rename from src/parsers/cell_parser.py rename to src/ks_xlsx_parser/parsers/cell_parser.py index cc22092..bf1857c 100644 --- a/src/parsers/cell_parser.py +++ b/src/ks_xlsx_parser/parsers/cell_parser.py @@ -17,7 +17,7 @@ from openpyxl.cell.cell import Cell as OpenpyxlCell from openpyxl.cell.cell import MergedCell as OpenpyxlMergedCell -from models.cell import ( +from ks_xlsx_parser.models.cell import ( AlignmentStyle, BorderSide, BorderStyle, @@ -26,7 +26,7 @@ FillStyle, FontStyle, ) -from models.common import CellCoord, RichTextRun +from ks_xlsx_parser.models.common import CellCoord, RichTextRun logger = logging.getLogger(__name__) diff --git a/src/parsers/sheet_parser.py b/src/ks_xlsx_parser/parsers/sheet_parser.py similarity index 98% rename from src/parsers/sheet_parser.py rename to src/ks_xlsx_parser/parsers/sheet_parser.py index 5883a92..d49a28d 100644 --- a/src/parsers/sheet_parser.py +++ b/src/ks_xlsx_parser/parsers/sheet_parser.py @@ -18,21 +18,22 @@ from lxml import etree from openpyxl.worksheet.worksheet import Worksheet as OpenpyxlWorksheet -from models.cell import CellDTO -from models.common import CellCoord, CellRange, ParseError, Severity, col_letter_to_number +from ks_xlsx_parser.models.cell import CellDTO +from ks_xlsx_parser.models.common import CellCoord, CellRange, ParseError, Severity, col_letter_to_number # OOXML namespace for spreadsheetML _OOXML_NS = {"s": "http://schemas.openxmlformats.org/spreadsheetml/2006/main"} # Regex to split an A1-style cell reference like "B1" into ("B", "1") _CELL_REF_RE = re.compile(r"^([A-Z]+)(\d+)$") -from models.sheet import ( +from ks_xlsx_parser.models.sheet import ( ConditionalFormatRule, DataValidationRule, MergedRegion, SheetDTO, SheetProperties, ) + from .cell_parser import CellParser logger = logging.getLogger(__name__) @@ -282,7 +283,7 @@ def _extract_dimensions(self, sheet: SheetDTO) -> None: for col_letter, cd in self._ws.column_dimensions.items(): if cd.width: - from models.common import col_letter_to_number + from ks_xlsx_parser.models.common import col_letter_to_number col_num = col_letter_to_number(col_letter) sheet.col_widths[col_num] = cd.width * 7.5 # chars to points approx @@ -294,7 +295,7 @@ def _extract_hidden(self, sheet: SheetDTO) -> None: for col_letter, cd in self._ws.column_dimensions.items(): if cd.hidden: - from models.common import col_letter_to_number + from ks_xlsx_parser.models.common import col_letter_to_number col_num = col_letter_to_number(col_letter) sheet.hidden_cols.add(col_num) @@ -347,11 +348,10 @@ def _extract_autofilter(self, sheet: SheetDTO) -> None: if not af or not af.ref: return try: - from models.common import FilterCriteria + from ks_xlsx_parser.models.common import FilterCriteria ref = str(af.ref) parts = ref.split(":") if len(parts) == 2: - from models.common import col_letter_to_number as c2n start_match = _CELL_REF_RE.match(parts[0]) end_match = _CELL_REF_RE.match(parts[1]) if start_match and end_match: diff --git a/src/parsers/table_parser.py b/src/ks_xlsx_parser/parsers/table_parser.py similarity index 97% rename from src/parsers/table_parser.py rename to src/ks_xlsx_parser/parsers/table_parser.py index d2eb2eb..bdc0d55 100644 --- a/src/parsers/table_parser.py +++ b/src/ks_xlsx_parser/parsers/table_parser.py @@ -12,8 +12,8 @@ from openpyxl.worksheet.worksheet import Worksheet as OpenpyxlWorksheet -from models.common import CellCoord, CellRange, col_letter_to_number -from models.table import TableColumn, TableDTO +from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number +from ks_xlsx_parser.models.table import TableColumn, TableDTO logger = logging.getLogger(__name__) diff --git a/src/parsers/workbook_parser.py b/src/ks_xlsx_parser/parsers/workbook_parser.py similarity index 97% rename from src/parsers/workbook_parser.py rename to src/ks_xlsx_parser/parsers/workbook_parser.py index bee0452..57e0c93 100644 --- a/src/parsers/workbook_parser.py +++ b/src/ks_xlsx_parser/parsers/workbook_parser.py @@ -11,7 +11,6 @@ import io import logging import time -from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor from pathlib import Path import xxhash @@ -21,7 +20,7 @@ # both values AND formulas in one read; python-calamine is kept as a fallback # for environments where the Rust crate hasn't been built. Both are bypassed # cleanly if absent — correctness always falls back to openpyxl. -from parsers import calamine_core as _calamine_core +from ks_xlsx_parser.parsers import calamine_core as _calamine_core try: from python_calamine import CalamineWorkbook @@ -30,14 +29,14 @@ CalamineWorkbook = None # type: ignore[assignment] _HAS_PYCALAMINE = False -from models.common import ParseError, Severity -from models.common import CalculationMode, DateSystem -from models.workbook import ( +from ks_xlsx_parser.models.common import CalculationMode, DateSystem, ParseError, Severity +from ks_xlsx_parser.models.workbook import ( ExternalLink, NamedRangeDTO, WorkbookDTO, WorkbookProperties, ) + from .sheet_parser import SheetParser from .table_parser import TableParser @@ -234,7 +233,7 @@ def parse(self) -> WorkbookDTO: # Extract charts via OOXML parsing try: - from charts.chart_extractor import ChartExtractor + from ks_xlsx_parser.charts.chart_extractor import ChartExtractor chart_extractor = ChartExtractor( self._path or self._content, wb_formula.sheetnames ) @@ -253,7 +252,7 @@ def parse(self) -> WorkbookDTO: # accounts for ~25% of the full-mode wall clock). if self._build_dep_graph: try: - from formula.dependency_builder import DependencyBuilder + from ks_xlsx_parser.formula.dependency_builder import DependencyBuilder dep_builder = DependencyBuilder(result.sheets, result.named_ranges) result.dependency_graph = dep_builder.build() except Exception as e: diff --git a/src/pipeline.py b/src/ks_xlsx_parser/pipeline.py similarity index 90% rename from src/pipeline.py rename to src/ks_xlsx_parser/pipeline.py index 9b902c5..1451074 100644 --- a/src/pipeline.py +++ b/src/ks_xlsx_parser/pipeline.py @@ -1,7 +1,7 @@ """ End-to-end parsing pipeline. -This is the primary public API for the xlsx_parser. It orchestrates +This is the primary public API for the ks_xlsx_parser. It orchestrates the full 11-stage Excellent Algorithm: Stage 0: Sheet Chunking (adaptive gaps + style boundaries) Stage 1: Cell Annotation (feature-based scorer) @@ -15,7 +15,7 @@ Render + Store Usage: - from xlsx_parser.pipeline import parse_workbook + from ks_xlsx_parser.pipeline import parse_workbook result = parse_workbook("path/to/workbook.xlsx") print(result.workbook.total_cells) for chunk in result.chunks: @@ -32,25 +32,25 @@ ParseMode = Literal["full", "fast"] -from analysis.light_block_detector import LightBlockDetector -from models.common import CellCoord, CellRange, col_letter_to_number -from analysis.pattern_splitter import PatternSplitter -from analysis.table_assembler import TableAssembler -from analysis.table_grouper import TableGrouper -from analysis.template_extractor import TemplateExtractor -from analysis.tree_builder import TreeBuilder -from annotation.block_splitter import BlockSplitter -from annotation.cell_annotator import CellAnnotator -from chunking.chunker import ChunkBuilder -from comparison.template_comparator import TemplateComparator -from export.model_exporter import ModelExporter -from models.block import ChunkDTO -from models.table_structure import TableStructure -from models.template import GeneralizedTemplate, TemplateNode -from models.tree import TreeNode -from models.workbook import WorkbookDTO -from parsers.workbook_parser import WorkbookParser -from storage.serializer import WorkbookSerializer +from ks_xlsx_parser.analysis.light_block_detector import LightBlockDetector +from ks_xlsx_parser.analysis.pattern_splitter import PatternSplitter +from ks_xlsx_parser.analysis.table_assembler import TableAssembler +from ks_xlsx_parser.analysis.table_grouper import TableGrouper +from ks_xlsx_parser.analysis.template_extractor import TemplateExtractor +from ks_xlsx_parser.analysis.tree_builder import TreeBuilder +from ks_xlsx_parser.annotation.block_splitter import BlockSplitter +from ks_xlsx_parser.annotation.cell_annotator import CellAnnotator +from ks_xlsx_parser.chunking.chunker import ChunkBuilder +from ks_xlsx_parser.comparison.template_comparator import TemplateComparator +from ks_xlsx_parser.export.model_exporter import ModelExporter +from ks_xlsx_parser.models.block import ChunkDTO +from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number +from ks_xlsx_parser.models.table_structure import TableStructure +from ks_xlsx_parser.models.template import GeneralizedTemplate, TemplateNode +from ks_xlsx_parser.models.tree import TreeNode +from ks_xlsx_parser.models.workbook import WorkbookDTO +from ks_xlsx_parser.parsers.workbook_parser import WorkbookParser +from ks_xlsx_parser.storage.serializer import WorkbookSerializer logger = logging.getLogger(__name__) @@ -261,7 +261,7 @@ def parse_workbook( annotator.annotate() # Stage 0+2: Segment then split by annotation - from chunking.segmenter import LayoutSegmenter + from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter sheet_tables = [t for t in workbook.tables if t.sheet_name == sheet.sheet_name] sheet_named = [ nr.name for nr in workbook.named_ranges diff --git a/src/py.typed b/src/ks_xlsx_parser/py.typed similarity index 100% rename from src/py.typed rename to src/ks_xlsx_parser/py.typed diff --git a/src/rendering/__init__.py b/src/ks_xlsx_parser/rendering/__init__.py similarity index 100% rename from src/rendering/__init__.py rename to src/ks_xlsx_parser/rendering/__init__.py diff --git a/src/rendering/html_renderer.py b/src/ks_xlsx_parser/rendering/html_renderer.py similarity index 96% rename from src/rendering/html_renderer.py rename to src/ks_xlsx_parser/rendering/html_renderer.py index c7c1a1c..50a8b21 100644 --- a/src/rendering/html_renderer.py +++ b/src/ks_xlsx_parser/rendering/html_renderer.py @@ -14,10 +14,10 @@ import html import logging -from models.block import BlockDTO -from models.cell import CellDTO, CellStyle -from models.common import BlockType, CellCoord, col_number_to_letter -from models.sheet import SheetDTO +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.cell import CellStyle +from ks_xlsx_parser.models.common import BlockType, col_number_to_letter +from ks_xlsx_parser.models.sheet import SheetDTO logger = logging.getLogger(__name__) diff --git a/src/rendering/text_renderer.py b/src/ks_xlsx_parser/rendering/text_renderer.py similarity index 97% rename from src/rendering/text_renderer.py rename to src/ks_xlsx_parser/rendering/text_renderer.py index 76b9b76..e2c15f4 100644 --- a/src/rendering/text_renderer.py +++ b/src/ks_xlsx_parser/rendering/text_renderer.py @@ -11,10 +11,10 @@ import datetime as _dt import logging -from models.block import BlockDTO -from models.common import BlockType, col_number_to_letter -from models.chart import ChartDTO -from models.sheet import SheetDTO +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.chart import ChartDTO +from ks_xlsx_parser.models.common import BlockType, col_number_to_letter +from ks_xlsx_parser.models.sheet import SheetDTO logger = logging.getLogger(__name__) diff --git a/src/storage/__init__.py b/src/ks_xlsx_parser/storage/__init__.py similarity index 100% rename from src/storage/__init__.py rename to src/ks_xlsx_parser/storage/__init__.py diff --git a/src/storage/serializer.py b/src/ks_xlsx_parser/storage/serializer.py similarity index 98% rename from src/storage/serializer.py rename to src/ks_xlsx_parser/storage/serializer.py index 5ced8f9..66add02 100644 --- a/src/storage/serializer.py +++ b/src/ks_xlsx_parser/storage/serializer.py @@ -15,8 +15,8 @@ import logging from typing import Any -from models.block import ChunkDTO -from models.workbook import WorkbookDTO +from ks_xlsx_parser.models.block import ChunkDTO +from ks_xlsx_parser.models.workbook import WorkbookDTO logger = logging.getLogger(__name__) diff --git a/src/utils/__init__.py b/src/ks_xlsx_parser/utils/__init__.py similarity index 100% rename from src/utils/__init__.py rename to src/ks_xlsx_parser/utils/__init__.py diff --git a/src/utils/logging_config.py b/src/ks_xlsx_parser/utils/logging_config.py similarity index 97% rename from src/utils/logging_config.py rename to src/ks_xlsx_parser/utils/logging_config.py index ca23108..94ceb54 100644 --- a/src/utils/logging_config.py +++ b/src/ks_xlsx_parser/utils/logging_config.py @@ -1,5 +1,5 @@ """ -Structured logging configuration for the xlsx_parser. +Structured logging configuration for the ks_xlsx_parser. Provides a JSON-structured logging format with workbook/sheet/block context fields for observability and debugging in production. diff --git a/src/verification/__init__.py b/src/ks_xlsx_parser/verification/__init__.py similarity index 100% rename from src/verification/__init__.py rename to src/ks_xlsx_parser/verification/__init__.py diff --git a/src/verification/stage_verifier.py b/src/ks_xlsx_parser/verification/stage_verifier.py similarity index 97% rename from src/verification/stage_verifier.py rename to src/ks_xlsx_parser/verification/stage_verifier.py index 3fb315b..88fb4f8 100644 --- a/src/verification/stage_verifier.py +++ b/src/ks_xlsx_parser/verification/stage_verifier.py @@ -16,23 +16,21 @@ from pydantic import Field -from analysis.light_block_detector import LightBlockDetector -from analysis.pattern_splitter import PatternSplitter -from analysis.table_assembler import TableAssembler -from analysis.table_grouper import TableGrouper -from analysis.template_extractor import TemplateExtractor -from analysis.tree_builder import TreeBuilder -from annotation.block_splitter import BlockSplitter -from annotation.cell_annotator import CellAnnotator -from chunking.segmenter import LayoutSegmenter -from comparison.template_comparator import TemplateComparator -from export.model_exporter import ModelExporter -from models.block import BlockDTO -from models.common import BlockType, CellAnnotation, StableModel -from models.table_structure import TableStructure -from models.template import TemplateNode -from models.tree import TreeNode -from parsers.workbook_parser import WorkbookParser +from ks_xlsx_parser.analysis.light_block_detector import LightBlockDetector +from ks_xlsx_parser.analysis.pattern_splitter import PatternSplitter +from ks_xlsx_parser.analysis.table_assembler import TableAssembler +from ks_xlsx_parser.analysis.table_grouper import TableGrouper +from ks_xlsx_parser.analysis.template_extractor import TemplateExtractor +from ks_xlsx_parser.analysis.tree_builder import TreeBuilder +from ks_xlsx_parser.annotation.block_splitter import BlockSplitter +from ks_xlsx_parser.annotation.cell_annotator import CellAnnotator +from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellAnnotation, StableModel +from ks_xlsx_parser.models.table_structure import TableStructure +from ks_xlsx_parser.models.template import TemplateNode +from ks_xlsx_parser.models.tree import TreeNode +from ks_xlsx_parser.parsers.workbook_parser import WorkbookParser logger = logging.getLogger(__name__) diff --git a/tests/benchmarks/_driver.py b/tests/benchmarks/_driver.py index a66acd5..94ee633 100644 --- a/tests/benchmarks/_driver.py +++ b/tests/benchmarks/_driver.py @@ -18,10 +18,8 @@ import subprocess import sys from collections import defaultdict -from dataclasses import asdict from datetime import UTC, datetime from pathlib import Path -from typing import Iterable from ._runner import Runner from ._schema import CSV_FIELDS, BenchmarkRecord, record_to_csv_row, validate_record @@ -344,7 +342,7 @@ def generate_drift(out_dir: Path) -> None: short = Path(fname).name lines.append(f"| {short} | {va} | {vb} | {va - vb:+d} |") if n_drift == 0: - lines.append(f"| _no drift_ | | | |") + lines.append("| _no drift_ | | | |") elif n_drift > 50: lines.append(f"| … {n_drift - 50} more rows truncated | | | |") lines.append("") diff --git a/tests/benchmarks/_runner.py b/tests/benchmarks/_runner.py index 2bdb95c..a5febc9 100644 --- a/tests/benchmarks/_runner.py +++ b/tests/benchmarks/_runner.py @@ -18,7 +18,7 @@ import time from dataclasses import dataclass from pathlib import Path -from typing import Any, Iterator +from typing import Any HERE = Path(__file__).resolve().parent REPO_ROOT = HERE.parent.parent diff --git a/tests/benchmarks/adapters/docling_adapter.py b/tests/benchmarks/adapters/docling_adapter.py index 17d4b02..bd573f5 100644 --- a/tests/benchmarks/adapters/docling_adapter.py +++ b/tests/benchmarks/adapters/docling_adapter.py @@ -41,7 +41,7 @@ sys.path.insert(0, str(_HERE.parent.parent.parent)) # repo root from tests.benchmarks._mem import peak_rss_mb # noqa: E402 -from tests.benchmarks._schema import BenchmarkRecord, SCHEMA_VERSION # noqa: E402 +from tests.benchmarks._schema import SCHEMA_VERSION, BenchmarkRecord # noqa: E402 PARSER_NAME = "docling" MAX_ERR_LEN = 500 diff --git a/tests/benchmarks/adapters/ks_adapter.py b/tests/benchmarks/adapters/ks_adapter.py index 917df29..754b2a6 100644 --- a/tests/benchmarks/adapters/ks_adapter.py +++ b/tests/benchmarks/adapters/ks_adapter.py @@ -28,10 +28,10 @@ sys.path.insert(0, str(_HERE.parent.parent.parent)) # repo root sys.path.insert(0, str(_HERE.parent.parent.parent / "src")) # src layout -from tests.benchmarks._mem import peak_rss_mb # noqa: E402 -from tests.benchmarks._schema import BenchmarkRecord, SCHEMA_VERSION # noqa: E402 from ks_xlsx_parser import __version__ as KS_VERSION # noqa: E402 from ks_xlsx_parser import parse_workbook # noqa: E402 +from tests.benchmarks._mem import peak_rss_mb # noqa: E402 +from tests.benchmarks._schema import SCHEMA_VERSION, BenchmarkRecord # noqa: E402 MAX_ERR_LEN = 500 diff --git a/tests/conftest.py b/tests/conftest.py index d422b9b..1417e30 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -8,7 +8,6 @@ -import os import tempfile from pathlib import Path @@ -17,7 +16,6 @@ from openpyxl.chart import BarChart, Reference from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, IconSetRule from openpyxl.styles import Alignment, Border, Font, PatternFill, Side -from openpyxl.utils import get_column_letter from openpyxl.worksheet.datavalidation import DataValidation from openpyxl.worksheet.table import Table, TableStyleInfo diff --git a/tests/helpers/invariant_checker.py b/tests/helpers/invariant_checker.py index a077486..89efd1c 100644 --- a/tests/helpers/invariant_checker.py +++ b/tests/helpers/invariant_checker.py @@ -9,7 +9,7 @@ import re -from models.common import EdgeType +from ks_xlsx_parser.models.common import EdgeType def check_invariants(workbook) -> list[str]: diff --git a/tests/test_charts.py b/tests/test_charts.py index 021e793..802e6a6 100644 --- a/tests/test_charts.py +++ b/tests/test_charts.py @@ -5,11 +5,10 @@ type, series, axes, title, and position anchor. """ -import pytest -from charts.chart_extractor import ChartExtractor -from models import ChartType -from parsers import WorkbookParser +from ks_xlsx_parser.charts.chart_extractor import ChartExtractor +from ks_xlsx_parser.models import ChartType +from ks_xlsx_parser.parsers import WorkbookParser class TestChartExtraction: diff --git a/tests/test_corpus_robustness.py b/tests/test_corpus_robustness.py index 2253eea..4263c89 100644 --- a/tests/test_corpus_robustness.py +++ b/tests/test_corpus_robustness.py @@ -1,5 +1,5 @@ """ -Corpus robustness tests for the xlsx_parser. +Corpus robustness tests for the ks_xlsx_parser. Tests the parser against large sets of real-world .xlsx files to catch crashes, invariant violations, and regression. Corpus tests are skipped @@ -15,17 +15,15 @@ import pytest -from pipeline import parse_workbook -from models.common import Severity - -from tests.helpers.invariant_checker import check_invariants +from ks_xlsx_parser.models.common import Severity +from ks_xlsx_parser.pipeline import parse_workbook from tests.helpers.corpus_downloader import ( - download_euses_corpus, download_enron_corpus, + download_euses_corpus, download_github_xlsx_samples, get_corpus_files, ) - +from tests.helpers.invariant_checker import check_invariants CORPUS_DIR = Path(__file__).parent / "fixtures" / "corpus" diff --git a/tests/test_formula_handling.py b/tests/test_formula_handling.py index b3f8342..8402667 100644 --- a/tests/test_formula_handling.py +++ b/tests/test_formula_handling.py @@ -11,14 +11,11 @@ - Renders formula cells in chunk output """ -import pytest - -from formula.dependency_builder import DependencyBuilder -from formula.formula_parser import FormulaParser, ParsedReference -from models.common import BlockType, CellCoord, EdgeType -from parsers import WorkbookParser -from pipeline import parse_workbook +from ks_xlsx_parser.formula.formula_parser import FormulaParser +from ks_xlsx_parser.models.common import CellCoord, EdgeType +from ks_xlsx_parser.parsers import WorkbookParser +from ks_xlsx_parser.pipeline import parse_workbook # --------------------------------------------------------------------------- # Helpers @@ -705,7 +702,7 @@ class TestFormulaInBlocks: def test_block_formula_count(self, simple_formulas): result = parse_workbook(path=simple_formulas) # The block should report formula_count > 0 - from chunking.segmenter import LayoutSegmenter + from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter sheet = result.workbook.sheets[0] tables = [t for t in result.workbook.tables if t.sheet_name == sheet.sheet_name] segmenter = LayoutSegmenter(sheet, tables=tables) @@ -715,7 +712,7 @@ def test_block_formula_count(self, simple_formulas): def test_cross_sheet_block_has_formulas(self, cross_sheet_formulas): result = parse_workbook(path=cross_sheet_formulas) - from chunking.segmenter import LayoutSegmenter + from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter summary = [s for s in result.workbook.sheets if s.sheet_name == "Summary"][0] tables = [t for t in result.workbook.tables if t.sheet_name == summary.sheet_name] segmenter = LayoutSegmenter(summary, tables=tables) diff --git a/tests/test_formula_parser.py b/tests/test_formula_parser.py index 784e171..03c3959 100644 --- a/tests/test_formula_parser.py +++ b/tests/test_formula_parser.py @@ -5,10 +5,9 @@ and circular reference detection. """ -import pytest -from formula.formula_parser import FormulaParser -from models import CellCoord, DependencyGraph, DependencyEdgeDTO, EdgeType +from ks_xlsx_parser.formula.formula_parser import FormulaParser +from ks_xlsx_parser.models import CellCoord, DependencyEdgeDTO, DependencyGraph, EdgeType class TestFormulaParser: diff --git a/tests/test_llm_artifacts.py b/tests/test_llm_artifacts.py index c3247eb..87f6335 100644 --- a/tests/test_llm_artifacts.py +++ b/tests/test_llm_artifacts.py @@ -5,16 +5,15 @@ and reading-order linearization using programmatic fixtures. """ -import pytest -from analysis import ( +from ks_xlsx_parser.analysis import ( EntityIndexBuilder, KpiCatalogBuilder, ReadingOrderLinearizer, SheetSummaryAnalyzer, ) -from models.common import SheetPurpose -from parsers import WorkbookParser +from ks_xlsx_parser.models.common import SheetPurpose +from ks_xlsx_parser.parsers import WorkbookParser class TestSheetSummaryAnalyzer: diff --git a/tests/test_models.py b/tests/test_models.py index bddec00..5393eea 100644 --- a/tests/test_models.py +++ b/tests/test_models.py @@ -4,12 +4,12 @@ Covers CellCoord, CellRange, hashing, and serialization. """ -from models import ( +from ks_xlsx_parser.models import ( CellCoord, CellRange, - compute_hash, col_letter_to_number, col_number_to_letter, + compute_hash, ) diff --git a/tests/test_multi_table_layout.py b/tests/test_multi_table_layout.py index 7681a52..cca2057 100644 --- a/tests/test_multi_table_layout.py +++ b/tests/test_multi_table_layout.py @@ -8,11 +8,10 @@ import pytest -from chunking.segmenter import LayoutSegmenter -from models import BlockType -from parsers import WorkbookParser -from pipeline import parse_workbook - +from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter +from ks_xlsx_parser.models import BlockType +from ks_xlsx_parser.parsers import WorkbookParser +from ks_xlsx_parser.pipeline import parse_workbook # --------------------------------------------------------------------------- # Helpers diff --git a/tests/test_parsers.py b/tests/test_parsers.py index 850a0bb..fe1e911 100644 --- a/tests/test_parsers.py +++ b/tests/test_parsers.py @@ -6,9 +6,8 @@ tables, comments, data validations, and sheet properties. """ -import pytest -from parsers import WorkbookParser +from ks_xlsx_parser.parsers import WorkbookParser class TestSimpleWorkbook: diff --git a/tests/test_pipeline.py b/tests/test_pipeline.py index 534d125..7f1ab0e 100644 --- a/tests/test_pipeline.py +++ b/tests/test_pipeline.py @@ -7,9 +7,7 @@ import json -import pytest - -from pipeline import parse_workbook +from ks_xlsx_parser.pipeline import parse_workbook class TestEndToEndPipeline: diff --git a/tests/test_rendering.py b/tests/test_rendering.py index e24072a..3a757ec 100644 --- a/tests/test_rendering.py +++ b/tests/test_rendering.py @@ -5,12 +5,11 @@ headers, and coordinate annotations. """ -import pytest -from chunking.segmenter import LayoutSegmenter -from parsers import WorkbookParser -from rendering.html_renderer import HtmlRenderer -from rendering.text_renderer import TextRenderer +from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter +from ks_xlsx_parser.parsers import WorkbookParser +from ks_xlsx_parser.rendering.html_renderer import HtmlRenderer +from ks_xlsx_parser.rendering.text_renderer import TextRenderer class TestHtmlRendering: @@ -112,11 +111,10 @@ def test_numeric_cells_render_raw_not_display_formatted(self): sci-notation fallback (``1.272000e+03``) once the ``[=]`` formula marker pushed the rendered string past col_width — this test guards against that regression.""" - from models.sheet import SheetDTO - from models.cell import CellDTO - from models.common import CellCoord, CellRange - from models.block import BlockDTO - from models.common import BlockType + from ks_xlsx_parser.models.block import BlockDTO + from ks_xlsx_parser.models.cell import CellDTO + from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange + from ks_xlsx_parser.models.sheet import SheetDTO coord = CellCoord(row=1, col=1) cell = CellDTO( diff --git a/tests/test_segmentation.py b/tests/test_segmentation.py index 8cb3cfc..45f13b0 100644 --- a/tests/test_segmentation.py +++ b/tests/test_segmentation.py @@ -5,11 +5,10 @@ assumption blocks, result blocks, and text headers. """ -import pytest -from chunking.segmenter import LayoutSegmenter -from models import BlockType -from parsers import WorkbookParser +from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter +from ks_xlsx_parser.models import BlockType +from ks_xlsx_parser.parsers import WorkbookParser class TestSegmentation: diff --git a/tests/test_stage_verification.py b/tests/test_stage_verification.py index 574be13..75dc8bc 100644 --- a/tests/test_stage_verification.py +++ b/tests/test_stage_verification.py @@ -8,9 +8,7 @@ import json -import pytest - -from verification import ( +from ks_xlsx_parser.verification import ( ExcellentStage, ImplementationStatus, StageResult, @@ -18,7 +16,6 @@ VerificationReport, ) - # --------------------------------------------------------------------------- # Model tests # --------------------------------------------------------------------------- diff --git a/tests/test_structural_invariants.py b/tests/test_structural_invariants.py index 612d11a..bbe1460 100644 --- a/tests/test_structural_invariants.py +++ b/tests/test_structural_invariants.py @@ -1,5 +1,5 @@ """ -Structural invariant tests for the xlsx_parser. +Structural invariant tests for the ks_xlsx_parser. Tests properties that must always hold for any valid parse output, regardless of the input file: merge structure, used range bounds, @@ -13,11 +13,8 @@ import pytest -from pipeline import parse_workbook -from models.common import EdgeType - -from tests.helpers.invariant_checker import check_invariants - +from ks_xlsx_parser.models.common import EdgeType +from ks_xlsx_parser.pipeline import parse_workbook # --------------------------------------------------------------------------- # Invariant tests on programmatic fixtures From 43e2109b249c83c510d5aa355771e91a582b3888 Mon Sep 17 00:00:00 2001 From: Arnav Goel Date: Tue, 19 May 2026 17:39:16 -0400 Subject: [PATCH 2/4] docs(bench): add benchmark-local-setup.md walkthrough MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Verified locally on a real 60-instance SpreadsheetBench sample: recall_text@1 = 0.550, recall_text@5 = 0.633, geometric@5 = 0.283 failure_buckets = {answer_absent_from_chunks: 11, ...all others 0} Producer side of --emit-failures runs end-to-end on the real corpus. The doc captures the exact loop (install → corpus → eval → triage → history) plus Docker parity and troubleshooting. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/benchmark-local-setup.md | 235 ++++++++++++++++++++++++++++++++++ 1 file changed, 235 insertions(+) create mode 100644 docs/benchmark-local-setup.md diff --git a/docs/benchmark-local-setup.md b/docs/benchmark-local-setup.md new file mode 100644 index 0000000..ab39803 --- /dev/null +++ b/docs/benchmark-local-setup.md @@ -0,0 +1,235 @@ +# Running the retrieval benchmark locally + +This is the loop you'll use whenever you want to know "is my parser +change actually moving recall?" — the same pipeline CI runs, just on +your laptop. Goal: text recall@5 > 0.90 (currently ~0.70). + +The full SpreadsheetBench corpus is 912 instances and takes ~30–45 min +on a Mac. For an iteration loop you'll mostly use `--sample 60` (≈ 3 min +after the first embedding-model load). + +## TL;DR + +```bash +# One-time +make install-dev # installs the parser + dev deps +pip install -e ".[bench]" # sentence-transformers + numpy (~500 MB) +make corpus-download # downloads SpreadsheetBench → data/corpora/ + # (91 MB tarball, 2,726 .xlsx after extract) + +# Each time you want to score +python scripts/eval_retrieval.py \ + --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \ + --parsers ks \ + --sample 60 \ + --emit-failures +python scripts/triage_recall.py tests/benchmarks/reports/retrieval +python scripts/append_bench_history.py +``` + +Output: `tests/benchmarks/reports/retrieval//` with +`results.ndjson`, `failures.ndjson`, `summary.json`, `summary.md`. + +## Step-by-step + +### 1. Install bench deps + +The benchmark uses `sentence-transformers` (≈ 500 MB with torch) for +embeddings. They're a separate optional group so the parser package +itself stays lean: + +```bash +pip install -e ".[bench]" +``` + +First run also downloads the embedding model (`BAAI/bge-small-en-v1.5`, +≈ 130 MB) into `~/.cache/huggingface/`. + +### 2. Download the corpus + +```bash +make corpus-download +``` + +This is the same `scripts/download_corpora.sh` CI uses. It pulls +SpreadsheetBench v0.1 plus a few legacy XLSX samples; only +SpreadsheetBench matters for retrieval scoring. + +Layout you should end up with: + +``` +data/corpora/spreadsheetbench/ + all_data_912_v0.1/ + dataset.json # 912 (question, answer_position) tuples + spreadsheet/ + / + 1__input.xlsx # test case 1 input + 1__answer.xlsx # test case 1 ground-truth output + 2_..., 3_... # additional test cases per instance +``` + +`data/` is gitignored — never commit corpus files. + +### 3. Run the benchmark + +A typical iteration cycle uses a sample for speed: + +```bash +python scripts/eval_retrieval.py \ + --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \ + --parsers ks \ + --sample 60 \ + --emit-failures +``` + +Flags worth knowing: + +| Flag | What it does | +|-----------------------|-------------------------------------------------------------| +| `--parsers ks,docling`| Score one or both parsers. Docling is heavy; skip unless comparing. | +| `--sample N` | Random N-instance subset (seeded). Omit for the full 912. | +| `--seed 1337` | Random seed for `--sample`. Stays stable across runs. | +| `--emit-failures` | Also write `failures.ndjson` with top-8 chunks per miss. | +| `--test-case 1` | Which of the (usually 3) test cases per instance to score. | +| `--per-parser-timeout`| Wall-clock seconds before a hung parse is killed. Default 60. | + +For a full run, drop `--sample` and add `--per-parser-timeout 120`: + +```bash +python scripts/eval_retrieval.py \ + --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \ + --parsers ks \ + --emit-failures \ + --per-parser-timeout 120 +``` + +`make bench-retrieval` is the same thing with the canonical defaults. + +### 4. Read the triage report + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval +``` + +This auto-finds the most recent run and prints: + +* **Bucket histogram** ranked by count. The top bucket is the + highest-leverage thing to fix next. +* **3 example failures** per bucket showing the question, the + ground-truth answer cell + values, and the top-8 chunks the parser + produced (with a ✓ next to chunks that contain the answer). + +Drill into one bucket: + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval \ + --bucket answer_absent_from_chunks --examples 10 +``` + +Five buckets and what they mean: + +| Bucket | Root cause | Fix lives in | +|---|---|---| +| `answer_absent_from_chunks` | Answer value in NO chunk. Cell dropped or rendered as formula. | `parsers/cell_parser.py`, `rendering/text_renderer.py` | +| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. Chunk too big/heterogeneous. | `chunking/chunker.py` | +| `wrong_sheet` | Answer sheet never chunked. | `parsers/workbook_parser.py` | +| `geometric_no_overlap` | Text matches but the chunk's A1 range doesn't overlap GT. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` | +| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. | + +See `docs/recall-investigation.md` for the named hypotheses behind each +bucket and `.claude/skills/recall-failure-triage/SKILL.md` for the +agent-driven loop. + +### 5. Append to history.jsonl + +```bash +python scripts/append_bench_history.py +``` + +Appends one row per benchmark run to +`tests/benchmarks/reports/history.jsonl` tagged with the current git +commit, and prints the row-over-row delta on the headline metrics: + +``` +appended to tests/benchmarks/reports/history.jsonl: + commit 421783f recall_text@5=0.704 recall_text@1=0.580 + recall_text@5: 0.6800 → 0.7040 ▲ +0.0240 +``` + +That's how "is recall improving?" gets answered. Goal: `recall_text@5 > 0.90`. + +`make bench-track` chains eval + history-append + triage in one go. + +## Docker path (matches CI exactly) + +When you want to make sure local results aren't drifting from CI: + +```bash +docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . + +# Quick sanity (60 instances, ~3 min after image load): +docker run --rm \ + -e BENCH_SAMPLE=60 \ + -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ + -v "$PWD/data:/app/data" \ + ks-xlsx-parser-bench + +# Full corpus: +docker run --rm \ + -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ + -v "$PWD/data:/app/data" \ + ks-xlsx-parser-bench +``` + +The image pre-warms the embedding model at build time so the first +`docker run` doesn't pay the 130 MB download. + +Environment knobs (also work in CI dispatch): + +| Env var | Default | What it does | +|------------------|---------|-------------------------------------------| +| `BENCH_SAMPLE` | `0` | Sample N instances (0 = full 912) | +| `BENCH_PARSERS` | `ks` | Comma list (e.g. `ks,docling`) | +| `BENCH_TIMEOUT` | `120` | Per-file parse timeout in seconds | + +## Adding a new failure bucket + +If you find a recall failure mode that doesn't fit any of the existing +six buckets, add it instead of stuffing it into `answer_absent_from_chunks`: + +1. Append the new bucket name to `FAILURE_BUCKETS` in + `scripts/eval_retrieval.py`. +2. Update `classify_text_failure` so it can return the new name. Keep + the predicate cheap — it runs once per scored instance. +3. Add the bucket + a one-line root cause + fix location to the table + in `docs/recall-investigation.md` and the SKILL file. +4. Re-run `make bench-track`; confirm the histogram shows the new + bucket and counts make sense. + +## Troubleshooting + +* `ModuleNotFoundError: No module named 'sentence_transformers'` + — you skipped `pip install -e ".[bench]"`. +* `dataset.json not found in ...` — your `--corpus` is pointing at + `data/corpora/spreadsheetbench`, not the nested + `data/corpora/spreadsheetbench/all_data_912_v0.1`. The benchmark + expects the leaf directory that contains `dataset.json`. +* `FileNotFoundError: 1__input.xlsx` — the corpus tarball didn't + fully extract. Delete `data/corpora/spreadsheetbench/` and re-run + `make corpus-download`. +* `recall_text@5 = 0.0` on a sample of 5 — small samples have huge + variance because the benchmark seeds. Bump to `--sample 60` minimum + before trusting the number; use `--sample 0` (full corpus) for a + decision-grade comparison. +* MPS / CUDA torch errors on first sentence-transformers import — + re-install with `pip install --upgrade torch torchvision`. The + benchmark runs fine on CPU. + +## See also + +* `scripts/eval_retrieval.py` — the benchmark itself. +* `scripts/triage_recall.py` — the bucket histogram + exemplar dump. +* `scripts/append_bench_history.py` — history.jsonl row writer. +* `Dockerfile.bench` — reproducible benchmark image. +* `docs/recall-investigation.md` — diagnosis framework & hypotheses. +* `.claude/skills/recall-failure-triage/SKILL.md` — agent guide. From 0cd1175fb96833fcde2ca6785b64fe7ab02a7d5d Mon Sep 17 00:00:00 2001 From: Arnav Goel Date: Tue, 19 May 2026 17:58:52 -0400 Subject: [PATCH 3/4] =?UTF-8?q?recall(planning):=20break=20recall=E2=86=92?= =?UTF-8?q?0.90=20into=20parallel=20TODOs=20+=20enrichment=20script?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ran the 200-instance seed=1337 sample with --emit-failures locally, enriched the failure rows with per-instance diagnostics (input.xlsx vs chunk bbox vs answer.xlsx vs sheet enumeration), and clustered the actionable failures by observable pattern. Key empirical finding: 127 of 200 instances are `instruction_requires_execution` — the benchmark asks the system to COMPUTE and WRITE the answer; a retrieval parser cannot satisfy them. The headline recall_text@5 = 0.635 includes those instances in the denominator, dragging the metric down by ~25 pp. Real in-scope recall is **0.59**, and getting to 0.90 means closing 30 actionable failures. Adds: - docs/planning/recall-90/README.md — overview, in-scope vs out-of-scope framing, status table, the iteration loop, parallelism gates by file scope. - docs/planning/recall-90/00..05.md — one cluster per file. Each TODO is self-contained with: cluster pattern, repro instance IDs, diagnostic signature (jq query), file scope, acceptance criteria stated against the 200-sample seed=1337 rerun, a failing-test sketch. Deliberately omits proposed fixes and effort estimates (the picking agent does that work). - scripts/enrich_failures.py — post-hoc enricher. Reads results.ndjson, re-parses each failed instance with openpyxl + ks-xlsx-parser, emits per-row diagnostic columns (gt_sheet, gt_cell_raw, gt_cell_formula, cached value, chunked sheets vs workbook sheets, chunk bbox vs GT range, ten heuristic flags). Avoids re-running the slow benchmark to add diagnostic columns. - .claude/skills/recall-failure-triage/SKILL.md — adds the in-scope filter step (Step 0). Cluster summary (200-sample seed=1337): 00 benchmark spec parser bugs 8 instances 01 array-formula rendering 2 instances 02 chunk range vs text mismatch 7 instances 03 cell-drop / uncached formula ≤10 instances (some unscorable) 04 single-chunk-per-sheet dilution 2 instances 05 out-of-scope execution instances 127 instances (scoring fix) Parallelism: each TODO scoped to a different file slice; 00 and 05 both touch eval_retrieval.py so they're flagged for serialization. Everything else can land independently. Updates docs/recall-investigation.md with the post-200-sample reality: H1 (chunk-size dilution) was largely refuted; H3 (range bookkeeping) landed as TODO 02; H2 (formula rendering) extended into TODO 03 covering both cached-value gaps and array formulas. 1041 tests still passing. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/skills/recall-failure-triage/SKILL.md | 154 ++++++++ .../00-benchmark-spec-parser-bugs.md | 119 ++++++ .../recall-90/01-array-formula-rendering.md | 130 +++++++ .../02-chunk-range-vs-text-mismatch.md | 150 ++++++++ .../03-cell-drop-or-uncached-formula.md | 165 +++++++++ .../04-single-chunk-multi-region-sheet.md | 138 +++++++ .../05-out-of-scope-execution-instances.md | 136 +++++++ docs/planning/recall-90/README.md | 137 +++++++ docs/recall-investigation.md | 33 +- scripts/enrich_failures.py | 348 ++++++++++++++++++ 10 files changed, 1509 insertions(+), 1 deletion(-) create mode 100644 .claude/skills/recall-failure-triage/SKILL.md create mode 100644 docs/planning/recall-90/00-benchmark-spec-parser-bugs.md create mode 100644 docs/planning/recall-90/01-array-formula-rendering.md create mode 100644 docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md create mode 100644 docs/planning/recall-90/03-cell-drop-or-uncached-formula.md create mode 100644 docs/planning/recall-90/04-single-chunk-multi-region-sheet.md create mode 100644 docs/planning/recall-90/05-out-of-scope-execution-instances.md create mode 100644 docs/planning/recall-90/README.md create mode 100755 scripts/enrich_failures.py diff --git a/.claude/skills/recall-failure-triage/SKILL.md b/.claude/skills/recall-failure-triage/SKILL.md new file mode 100644 index 0000000..a9a9263 --- /dev/null +++ b/.claude/skills/recall-failure-triage/SKILL.md @@ -0,0 +1,154 @@ +--- +name: recall-failure-triage +description: Diagnose why ks-xlsx-parser's retrieval recall is below target by reading the failure-bucket histogram and exemplar failures, then propose ranked fixes. Use when investigating low recall@k on SpreadsheetBench, deciding what to fix next, or reviewing a benchmark run. Sister skill to excel-extraction-pipeline-improver. +--- + +# Recall Failure Triage — turn `recall@5 = X` into a ranked worklist + +A single recall number tells you nothing about *what* to fix. Five +distinct failure modes produce identical recall drops, and the fix is +completely different per mode. This skill is the bridge from "recall is +low" to "open `parsers/cell_parser.py` line N, here is the bug". + +Pair with `excel-extraction-pipeline-improver` — that skill implements +fixes; this one decides what to fix. + +## When this skill fires + +* User says "recall is low / dropped / not improving" +* New benchmark run produced a `failures.ndjson` or `summary.json` +* PR-time benchmark sample regressed on the CI step summary +* You're picking the next parser/chunking PR and need to choose what's + highest leverage + +## Inputs you should expect + +| Input | Where it comes from | +|---|---| +| `summary.json` + `summary.md` | latest `tests/benchmarks/reports/retrieval//` | +| `failures.ndjson` | same dir, **only present if** the run used `--emit-failures` | +| `history.jsonl` | `tests/benchmarks/reports/history.jsonl` — recall over commits | +| current branch + recent commits | to know what changed since the last run | + +## The decision procedure + +### 0. Apply the in-scope filter — most "misses" are unfixable + +Before counting buckets, run `scripts/enrich_failures.py` to mark +instances flagged `instruction_requires_execution` — those where the +benchmark's `answer_position` is empty in `input.xlsx` because the +question literally asks the system to *write* the answer. ~63% of +the SpreadsheetBench corpus is this class on the 200-sample. They are +NOT parser bugs and CANNOT be fixed by parser work. Exclude them. + +```bash +python scripts/enrich_failures.py tests/benchmarks/reports/retrieval +jq -c 'select((.flags | contains(["instruction_requires_execution"])) | not)' \ + tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson +``` + +### 1. Read the bucket histogram FIRST + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval +``` + +The output ranks buckets by count. **Spend your time on the largest +bucket only.** The next-largest is the next PR. + +| Bucket | Root cause | Where to fix | +|---|---|---| +| `answer_absent_from_chunks` | The answer value never made it into ANY chunk. The cell was dropped or rendered as a formula expression instead of its computed value. | `parsers/cell_parser.py::_extract_cell_value` and `rendering/text_renderer.py::_cell_render_value`. Check `data_only` plumbing in `parsers/workbook_parser.py`. | +| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. The chunk is too big/heterogeneous — embedding is diluted. | `chunking/chunker.py`. There is no token cap today; large blocks are emitted whole. Cap at ~512 tokens and row-split tall tables. | +| `wrong_sheet` | Answer sheet was never chunked. | `parsers/workbook_parser.py` sheet loop. Check hidden sheets, very-hidden sheets, and `wb.sheetnames` vs `wb.worksheets`. | +| `geometric_no_overlap` | Answer text matches but the chunk's A1 range doesn't overlap ground truth. Range drifts during merge/split. | `annotation/block_splitter.py` + `analysis/pattern_splitter.py`. Add invariant: `block.cell_range` ⊆ bbox of cells that actually rendered text. | +| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. Reproduce with `parse_workbook(path)`. | + +### 2. Read 5 example failures from the top bucket + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval \ + --bucket --examples 5 +``` + +For each example, the script prints: +* the natural-language question +* `answer_position` (the ground-truth cell) +* the ground-truth string values that should appear in a chunk +* the top-8 ranked chunks with sheet, range, and whether each contains + the answer + +**Pattern-match.** What do the 5 examples have in common? E.g. +* All answer cells are formulas → H2 confirmed: render the cached value +* All chunks span the entire sheet → H1 confirmed: cap chunk size +* All answer sheets are hidden → wrong_sheet from hidden-sheet skip + +### 3. State the hypothesis before changing code + +In your reply, write the hypothesis as one sentence: +> "Bucket X dominates because [observable common pattern in examples]. +> Expected fix: [specific file + change]. Expected lift on recall@5: +> [number]." + +The expected lift = (bucket count) ÷ (scoreable instances). If a fix +is supposed to halve the biggest bucket but recall barely moves on the +next run, the hypothesis was wrong — go back to step 1, don't pile on. + +### 4. Wire the fix into the existing pipeline + +Pass the hypothesis to the `excel-extraction-pipeline-improver` skill +(or implement directly) with TDD: write a failing crossval/invariant +test that captures the bug from one example, fix, run `make test`, then +re-run `make bench-track` to confirm the bucket shrank. + +### 5. Verify on the history + +```bash +tail -5 tests/benchmarks/reports/history.jsonl +``` + +The script prints a row-over-row delta. Expected behaviour: +* The targeted bucket count drops +* `recall_text@5` rises by approximately `target_bucket_drop / + scoreable_instances` +* `recall_text@1` improves proportionally OR more (better chunks rank + higher too) +* Other buckets shouldn't grow — if they do, the fix had a side effect + +## Anti-patterns + +* **Don't chase the headline number.** "Recall went up 1 pp" without a + bucket explanation is suspicious — it might be benchmark noise. +* **Don't fix two buckets at once.** You won't know which fix worked. +* **Don't tune chunk size as the first move.** It only helps + `present_but_ranked_low`. If `answer_absent_from_chunks` is bigger, + chunk tuning is wasted effort. +* **Don't trust geometric and text recall independently.** A docling-style + parser with zero geometric overlap can still have high text recall + (because it serializes everything to plain text). That's not the same + thing as good — citation overlays need the geometric metric. Always + report both. + +## Quick reference — running it from scratch + +```bash +make corpus-download # one-time, large +make bench-track # runs + triages +python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5 +cat tests/benchmarks/reports/history.jsonl | tail -10 # commit-over-commit +``` + +CI runs a 60-instance sample on every PR (`.github/workflows/benchmark.yml`) +and posts the bucket histogram to the GitHub job summary. Weekly schedule +runs the full 912-instance corpus. + +## Hand-off contract + +When you finish a triage round, your final message MUST include: +1. The bucket histogram (counts + percentages). +2. The dominant bucket + one-sentence root cause hypothesis. +3. The 1–2 file paths where the fix lives. +4. The expected lift on `recall_text@5` if the fix lands. +5. A pointer to the example instance IDs you read. + +Anything less and the next agent will have to re-do the analysis. diff --git a/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md b/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md new file mode 100644 index 0000000..8d33b7a --- /dev/null +++ b/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md @@ -0,0 +1,119 @@ +# 00 · Benchmark harness fails to parse malformed `answer_position` strings + +**Status:** 🆓 free to claim +**Slice:** A (benchmark harness — `scripts/eval_retrieval.py`) +**Independent of:** all other TODOs in this folder + +## What it looks like + +Eight instances in the 200-sample (seed=1337) score as recall@5 misses +even though the parser produced perfectly reasonable chunks. The +benchmark harness's `parse_position_spec()` returns no regions because +the `answer_position` string in `dataset.json` doesn't match +`SHEET_RANGE_RE`. The score code then sees `data_regions=0`, marks the +instance as "no scorable target", and the recall denominator counts it +as a miss. Net effect: the parser is blamed for benchmark-spec parsing +bugs. + +Concrete malformed strings observed: + +| Instance | `answer_position` string | What's wrong | +|---|---|---| +| `50442` | `RESULTS 1'!G17` | unmatched closing quote, opening quote missing | +| `49490` | `Sheet1'!H3:H58` | same | +| `49036` | `Dashboard'!B8` | same | +| `55427` | `Compiled and located schools da'!B2:B1461` | same + long sheet name | +| `48975` | `Output'!B11:B17` | same | +| `37456` | `G12:J15` | fullwidth Chinese colon `:` instead of `:` | +| `184-6` | `'RAWDATA!'A1:P6,'OUTPUT!'A1:P6'` | quote between sheet and `!`, comma-joined | +| `164-22` | `'YEAR1!'A1:G1478,'YEAR2!'A1:G1480'` | same shape as 184-6 | + +(`374-9` may also belong here — investigate.) + +## Diagnostic signature + +In `enriched_failures.ndjson`, members of this cluster have: + +* `gt_sheet == null` (spec parser couldn't extract the sheet name), OR +* `gt_sheet` wrapped in apostrophes (`"'Sheet1'"`), AND +* the workbook *does* contain a sheet with the bare name. + +```bash +python scripts/enrich_failures.py tests/benchmarks/reports/retrieval +jq -c 'select(.gt_sheet == null or (.gt_sheet | startswith("'")))' \ + tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson +``` + +## File scope + +You may touch: + +* `scripts/eval_retrieval.py` — `SHEET_RANGE_RE`, `parse_position_spec`, + `_normalize_value_for_match`. This is where the bugs live. +* `scripts/enrich_failures.py` — must stay consistent with the harness; + re-run after editing. +* Add new tests under `tests/test_eval_retrieval.py` (or similar — file + doesn't exist yet; create it). + +Do **NOT** touch any `src/ks_xlsx_parser/*` code for this cluster. If +the urge to "make the parser tolerate these patterns too" arises, +that's a separate TODO and a category error here — these are dataset +typos, not parser inputs. + +## Acceptance criteria + +On `python scripts/eval_retrieval.py --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 --parsers ks --sample 200 --seed 1337 --emit-failures`: + +1. All 8 instances listed above stop appearing as recall@5 misses + (text or geometric) on the diagnostic ndjson. +2. `geometric@5` rises by ≥ 4 pp (each fixed instance flips from miss + to hit since the chunks already overlap the correct region). +3. No previously-passing instance becomes a miss. Diff the + `results.ndjson` before and after — `pass → fail` count must be 0. +4. `python scripts/enrich_failures.py` shows zero rows with + `gt_sheet == null` from the eight listed instances. + +## Failing test sketch + +```python +# tests/test_eval_retrieval_spec_parser.py +import pytest +from scripts.eval_retrieval import parse_position_spec + +@pytest.mark.parametrize("spec, expected_sheet", [ + ("Dashboard'!B8", "Dashboard"), + ("Sheet1'!H3:H58", "Sheet1"), + ("'RAWDATA!'A1:P6", "RAWDATA"), + ("G12:J15", None), # this one keeps default_sheet; sheet unaltered +]) +def test_unmatched_quotes_resolve_sheet(spec, expected_sheet): + regions = parse_position_spec(spec, default_sheet="DEFAULT") + assert regions, f"no regions for {spec!r}" + if expected_sheet is not None: + assert regions[0][0] == expected_sheet + +def test_fullwidth_colon_in_range(): + # Excel-China dataset entries occasionally use the fullwidth colon. + regions = parse_position_spec("G12:J15", default_sheet="S") + assert regions and regions[0][1] == (12, 7, 15, 10) +``` + +These should fail on `main`. They pass when the fix lands. + +## Pitfalls + +* `'YEAR1!'A1:G1478,'YEAR2!'A1:G1480'` is multi-region. Make sure the + comma-split path still returns BOTH regions after the fix. +* `_normalize_value_for_match` is not involved here — don't refactor it. +* The current regex `SHEET_RANGE_RE` uses a backreference for matching + quotes; tightening the closing quote to "optional, but only if opening + present" is fragile. Consider a two-pass approach: strip stray + apostrophes, then re-attempt the strict regex. + +## Repro fixtures + +The eight instances live under +`data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet//`. +Quote one as a unit-test data point — do NOT vendor the corpus into +the repo. If a snapshot of the malformed strings is wanted, hand-copy +them into a Python fixture (they're ≤ 100 chars each). diff --git a/docs/planning/recall-90/01-array-formula-rendering.md b/docs/planning/recall-90/01-array-formula-rendering.md new file mode 100644 index 0000000..a9edd72 --- /dev/null +++ b/docs/planning/recall-90/01-array-formula-rendering.md @@ -0,0 +1,130 @@ +# 01 · Array-formula cells render as `` + +**Status:** 🆓 free to claim +**Slice:** B + D (`parsers/cell_parser.py`, `rendering/text_renderer.py`) +**Independent of:** 00, 02, 04 — partially overlapping with 03 on +`text_renderer.py`; coordinate via the README table. + +## What it looks like + +Two instances (`43026`, `59969`) hit this exact pattern. The +ground-truth answer cell contains an Excel **array formula** (an +`{= ...}` formula entered with Ctrl-Shift-Enter that returns multiple +cells). openpyxl returns these as an `ArrayFormula` Python object, NOT +a string and NOT a number. Our cell extractor stuffs the repr of that +object into the cell's value, so the rendered chunk text contains +literal: + +``` + +``` + +This obviously kills both text-match (the answer values are nowhere in +the chunk) and any downstream LLM consumption. + +Instance `43026` (`summary!D10`): + +``` +gt_cell_raw = '' +gt_cell_formula = None ← our heuristic missed it because raw_value isn't a str starting with '=' +gt_cell_data_only = (cached value, if any) +``` + +## Diagnostic signature + +```bash +jq -c 'select(.gt_cell_raw | tostring | contains("ArrayFormula"))' \ + tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson +``` + +Likely a much larger cluster at full-corpus scale — most "tabular +report" workbooks use array formulas. Two on the 200-sample is the floor. + +## File scope + +You may touch: + +* `src/ks_xlsx_parser/parsers/cell_parser.py` — wherever the raw value + is extracted from `openpyxl.cell.Cell.value`. Detect `ArrayFormula`, + pull the formula text out of it (`obj.text` on openpyxl) and treat it + as a formula cell. +* `src/ks_xlsx_parser/rendering/text_renderer.py::_cell_render_value` + — make sure array-formula cells render their *cached value* (from the + `data_only` workbook pass) when one exists; fall back to the formula + expression as a string only if no cached value. +* `src/ks_xlsx_parser/models/cell.py` — add an `is_array_formula: bool` + field if useful for downstream consumers. + +Do NOT add a new evaluator for array formulas — out of scope (and +covered indirectly by the cached-value path). + +## Acceptance criteria + +1. Build a tiny array-formula fixture inside `tests/conftest.py` (the + existing programmatic-fixture pattern). Use `openpyxl`'s + `ArrayFormula` constructor; populate the `data_only` workbook with a + plausible cached value. +2. Add a test that asserts the chunk's `render_text` for that fixture + contains the cached value AND does **not** contain the substring + `ArrayFormula object`. +3. On the 200-sample seed=1337 rerun, instances `43026` and `59969` + move out of the failure set on the text-match metric IF the answer + values exist as cached data in the input (verify with openpyxl + first; if they don't, the instance is `instruction_requires_execution` + and you can't move it — note it in the PR). +4. No previously-passing crossval test regresses (`make test`). + +## Failing test sketch + +```python +# tests/test_array_formula_rendering.py +import openpyxl +from openpyxl.worksheet.formula import ArrayFormula +from ks_xlsx_parser.pipeline import parse_workbook + +def test_array_formula_renders_cached_value(tmp_path): + p = tmp_path / "af.xlsx" + wb = openpyxl.Workbook() + ws = wb.active + ws["A1"] = 10 + ws["A2"] = 20 + ws["A3"] = 30 + # Array formula in B1:B3 summing the row; cache a plausible value. + af = ArrayFormula("B1:B3", "=A1:A3*2") + ws["B1"] = af + # openpyxl writes the cached value separately when data_only=True is read; + # for fixture purposes, write the file once and patch cached values via + # a second openpyxl pass. + wb.save(p) + + chunks = parse_workbook(path=str(p)).chunks + assert chunks, "no chunks produced" + text = chunks[0].render_text + assert "ArrayFormula" not in text, "raw ArrayFormula object leaked" + # Once the fix lands, expect cached values 20 / 40 / 60 to render. +``` + +## Pitfalls + +* openpyxl's behaviour for `ArrayFormula` differs between read-only + and normal mode. Make sure both paths in `workbook_parser.py` agree + on how the cell is surfaced. +* If the input was saved by LibreOffice or generated programmatically + without computing array results, `cached_value` will be `None`. In + that case there's no useful retrieval signal — surfacing the formula + expression itself is acceptable as a fallback, but it must be a + string, not `repr()`. +* "Spilled" dynamic arrays (Excel 365's new `=A1:A3*2` without + Ctrl-Shift-Enter) are a related but distinct case. Note in your PR + if you handled both or only the classic array formula. + +## Repro + +```bash +python -c " +from openpyxl import load_workbook +wb = load_workbook('data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/43026/1_43026_input.xlsx') +ws = wb['summary'] +print(type(ws['D10'].value), repr(ws['D10'].value)) +" +``` diff --git a/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md b/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md new file mode 100644 index 0000000..bd7500b --- /dev/null +++ b/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md @@ -0,0 +1,150 @@ +# 02 · `text_hit_geom_miss` — chunk contains answer text but range doesn't overlap GT + +**Status:** 🆓 free to claim +**Slice:** E (`annotation/block_splitter.py`, `analysis/pattern_splitter.py`, `analysis/light_block_detector.py`) +**Independent of:** 00, 01, 03, 04 + +## What it looks like + +Seven actionable instances on the 200-sample +(`61-4`, `353-29`, `382-29`, `80-42`, `334-11`, `495-31`, `462-45`) +where: + +* `rank_of_text_match ≤ 5` — some chunk's `render_text` contains the + answer value(s). +* `rank_of_first_overlap is None` — but no chunk's claimed A1 range + overlaps the ground-truth range. +* `n_chunks_on_gt_sheet ≥ 1` — we DID emit chunks on the correct sheet. + +So the parser knows what data exists on the sheet (text is there), but +the **range bookkeeping** on the chunk it emitted is wrong: the claimed +`top_left_cell` / `bottom_right_cell` excludes the answer cell even +though the chunk's text includes it. + +This is the **citation-grade killer**. ks-backend's UI will highlight a +region that doesn't actually contain the answer the LLM cited — looks +buggy to the user. + +Example — instance `61-4`: +``` +ans = 'output'!A2:G15 +chunks on 'output': 1 chunk +chunk text contains: the answer string (text_rank=1) +chunk's reported A1 range: NOT A2:G15 +``` + +## Hypothesis (deliberately under-specified) + +Block merge/split paths widen the *text* (by absorbing adjacent cells) +faster than they update `cell_range.bottom_right`, OR a pattern-split +narrows the `cell_range` to exclude the rows the splitter just lifted +into a child block while the rendered text still contains them. Find +the actual code path empirically. + +## Diagnostic signature + +```bash +jq -c 'select(.bucket_combined == "text_hit_geom_miss" + and (.flags | contains(["instruction_requires_execution"]) | not) + and .n_chunks_on_gt_sheet > 0)' \ + tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson +``` + +For each match, the comparison that matters is `gt_range_bbox` vs each +chunk on the GT sheet's `top_left_cell`/`bottom_right_cell`. Print the +chunk's `render_text` and confirm it DOES contain the answer values +that openpyxl reads at `answer_position`. + +## File scope + +You may touch: + +* `src/ks_xlsx_parser/models/block.py` — invariants on `BlockDTO.cell_range`. +* `src/ks_xlsx_parser/annotation/block_splitter.py` +* `src/ks_xlsx_parser/analysis/pattern_splitter.py` +* `src/ks_xlsx_parser/analysis/light_block_detector.py` +* `src/ks_xlsx_parser/chunking/chunker.py::_block_to_chunk` — at chunk + creation time, you can defensively clip `cell_range` to the bounding + box of cells that actually contributed text. That's a safety net even + if the upstream bug isn't located. + +Do NOT touch parsing (`parsers/*`) — the inputs are correct; +downstream block bookkeeping is wrong. + +## Acceptance criteria + +1. Add an invariant test (`tests/test_structural_invariants.py` + pattern): for every chunk, the cells whose values appear in + `render_text` MUST lie within + `[chunk.top_left_cell, chunk.bottom_right_cell]`. Test should fail + on `main` for at least 3 of the 7 cluster instances. +2. After the fix, all 7 instances flip from `text_hit_geom_miss` to + `both_hit` (`rank_of_first_overlap ≤ 5`). +3. `geometric@5` on the 200-sample seed=1337 rises by ≥ 3 pp. +4. `recall_text@5` does NOT drop (tightening ranges shouldn't drop + text — if it does, the chunker is also dropping text in step with + the range clip, which is a separate bug). +5. `table_fragmentation_rate` does NOT rise — if you fixed range drift + by splitting one chunk into many, fragmentation will spike. That + would solve cluster 02 by creating cluster 04 — net zero. Either + keep fragmentation flat or coordinate with whoever owns 04. + +## Failing test sketch + +```python +# tests/test_chunk_range_invariants.py +import re +from openpyxl.utils import column_index_from_string +from ks_xlsx_parser.pipeline import parse_workbook +from ks_xlsx_parser.models.common import col_letter_to_number + +A1_RE = re.compile(r"^([A-Z]+)(\d+)$") + +def _parse_a1(a1): + m = A1_RE.match(a1) + return (int(m.group(2)), col_letter_to_number(m.group(1))) + +def test_chunk_range_covers_all_rendered_cells(tmp_path): + # Use one of the cluster instances or a hand-built fixture: + # the assertion is "for every chunk, the cell coordinates that appear + # in render_text fall inside the claimed range box." + chunks = parse_workbook( + path="data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/61-4/1_61-4_input.xlsx" + ).chunks + for c in chunks: + if not c.top_left_cell or not c.bottom_right_cell: + continue + r0, col0 = _parse_a1(c.top_left_cell) + r1, col1 = _parse_a1(c.bottom_right_cell) + # Pull A1 references that appear in the chunk header (e.g. "[Sheet1!A2:G15]") + # and assert each is inside the claimed range. Adapt to whatever + # renderer markers your current text actually uses. + for ref in re.findall(r"\b([A-Z]+)(\d+)\b", c.render_text or ""): + col = col_letter_to_number(ref[0]) + row = int(ref[1]) + assert r0 <= row <= r1, f"row {row} outside {c.top_left_cell}:{c.bottom_right_cell}" + assert col0 <= col <= col1, f"col {ref[0]} outside {c.top_left_cell}:{c.bottom_right_cell}" +``` + +## Pitfalls + +* The renderer prints A1 refs in its block header (e.g. + `[Sheet1!A2:G15] (table)`) — those are the chunk's CLAIMED range, not + a cell reference. The invariant test should look at rendered cell + contents (table rows), not at the block header. +* Some text rendered into a chunk doesn't have an A1 reference (e.g. + pure values inside a table grid). The invariant has to use the + block's underlying `cells` collection, not regex on text. +* If the merge widens the chunk's range past adjacent empty rows, you + may LOSE geometric overlap on those by clipping. That's fine here — + the cluster is text_hit/geom_miss, so clipping to actual content is + what's needed. + +## Repro + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval \ + --bucket present_but_ranked_low --examples 0 # to ensure no overlap +# Then manually read enriched_failures.ndjson and pick instance 61-4 +# or 353-29 — both are pure text_hit_geom_miss cases. +``` diff --git a/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md b/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md new file mode 100644 index 0000000..ef7161d --- /dev/null +++ b/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md @@ -0,0 +1,165 @@ +# 03 · Chunk's range covers GT but rendered text lacks the answer values + +**Status:** 🆓 free to claim +**Slice:** B + C + D (`parsers/cell_parser.py`, `parsers/sheet_parser.py`, `rendering/text_renderer.py`) +**Independent of:** 00, 02, 04 — partial overlap with 01 on `text_renderer.py`; coordinate via README. + +## What it looks like + +Up to ten actionable instances on the 200-sample (`CF_3712`, `57033`, +`53-12`, `55794`, `45063`, `36842`, `384-4`, `262-17`, `189-9`, `48799`) +where: + +* `rank_of_first_overlap ≤ 5` — chunk's claimed range overlaps GT. +* `rank_of_text_match is None` — but the chunk's rendered text does NOT + contain the answer values. + +The parser knows the block exists (range is right) but it dropped or +mis-rendered the cells whose values should appear in the chunk text. + +**Two root causes nested inside this cluster — confirm by inspection:** + +### Sub-cluster 3a: cell-drop within block + +The block's `cells` collection is missing the answer cell, or the +renderer's grid-construction loop skips it (e.g. when the cell sits in +a row otherwise mostly empty, and a sparsity filter culls it). + +Instance `CF_3712`: chunk on `Purchases` covers M3:M5 geometrically; +the answer header value (`gt_cell_raw = "Product "`) doesn't render. + +### Sub-cluster 3b: uncached formula renders as `=A1+B1`, not the value + +Cell stores a formula; the workbook was saved without computing it +(common for LibreOffice / programmatic generation). `data_only` load +returns `None` for that cell. The renderer falls back to the formula +source. Retrieval for the *value* (e.g. "1272") fails because the chunk +text contains "=A1+B1" instead. + +Instance `48799` (`Sheet2!Z2` — a 24-deep nested IF/VLOOKUP). Same +shape: formula text in chunk, value missing. + +> ⚠️ **Some of the named instances may be unscorable** — e.g. `189-9` +> and `53-12` have answer cells that are all `None` in `answer.xlsx`, +> meaning the benchmark expects the system to write nothing. In that +> case `answer_cell_values=[]` and `rank_text=None` is automatic, +> regardless of what the chunk renders. The first thing to do in this +> TODO is to filter the ten instances through openpyxl to find the +> truly unfixable ones; report that as part of the PR. + +## Diagnostic signature + +```bash +jq -c 'select(.bucket_combined == "text_miss_geom_hit" + and (.flags | contains(["instruction_requires_execution"]) | not) + and .n_chunks_on_gt_sheet >= 1)' \ + tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson +``` + +Then run for each candidate: + +```python +from openpyxl import load_workbook +wb = load_workbook(input_path, data_only=True) +ws = wb[answer_sheet] +# read answer.xlsx cells at answer_position — if all None, drop this instance. +``` + +Instances whose `answer.xlsx` GT cells are all `None` move to TODO 05 +(out-of-scope; benchmark-scoring issue). + +## File scope + +You may touch: + +* `src/ks_xlsx_parser/parsers/cell_parser.py` — formula vs cached-value + selection. Add a code path: if cell has a formula AND no cached + value in the `data_only` pass, mark as `formula_uncached` and emit + both the formula text AND any reachable proxy (e.g. blank cell with + formula reference, instead of literal formula in the grid). +* `src/ks_xlsx_parser/parsers/sheet_parser.py` — make sure cells with + string values that look like formulas (`startswith('=')`) get + classified as formula cells. +* `src/ks_xlsx_parser/rendering/text_renderer.py::_cell_render_value` + — for formula cells, prefer cached value; if cached is None, render + the formula source ONLY (not `None`, not the formula display). + +Do NOT add a formula evaluator. That's a separate, larger effort. + +## Acceptance criteria + +1. After culling unscorable instances (those with all-None + `answer.xlsx` cells), at least **5 of the remaining instances** flip + from `text_miss_geom_hit` to `both_hit` (rank_of_text_match ≤ 5). +2. `recall_text@5` on the 200-sample seed=1337 rises by ≥ 2 pp. +3. The unscorable instances are documented in the PR description with + a one-line "this instance has no answer values in answer.xlsx, + moving to TODO 05". +4. A new test fixture in `tests/conftest.py` (a workbook with a + `=SUM(A1:A3)` cell saved WITHOUT cached values) plus a regression + test that asserts the chunk render_text contains `=SUM(A1:A3)` + verbatim (not the openpyxl repr of the formula object). + +## Failing test sketch + +```python +# tests/test_formula_rendering_unfilled.py +import openpyxl +from ks_xlsx_parser.pipeline import parse_workbook + +def test_uncached_formula_renders_formula_source(tmp_path): + p = tmp_path / "uncached.xlsx" + wb = openpyxl.Workbook() + ws = wb.active + ws["A1"] = 10 + ws["A2"] = 20 + ws["A3"] = 30 + ws["B1"] = "=SUM(A1:A3)" + wb.save(p) # NO data_only refresh — B1's cached value remains None. + + chunks = parse_workbook(path=str(p)).chunks + assert chunks, "no chunks" + text = chunks[0].render_text + # Whichever is acceptable: the formula source verbatim, OR the + # computed value via the formula engine. NOT a None/empty cell. + assert ("=SUM(A1:A3)" in text) or ("60" in text), ( + "uncached formula cell dropped from render_text:\n" + text + ) +``` + +## Pitfalls + +* `_cell_render_value` already has a branch for booleans, dates, + integer-valued floats. Don't break that path while adding the formula + fallback. +* If you add a formula-source string to render_text, make sure the + retrieval text-match logic in `eval_retrieval.py::_normalize_value_for_match` + still works on numeric values that DO have cached results — adding + formula sources should AUGMENT, not replace, the cached-value path. +* Some "cell drop" cases are actually merged-cell handling: a value is + on the top-left of a merged region but the renderer surfaces only an + empty placeholder for the merged cells. Distinct from this cluster — + if you find one, file it as a new TODO. +* Don't try to evaluate formulas — Excel's calc engine is a swamp and + `formula/formula_parser.py` only parses, doesn't compute. The + acceptable fallback is "render the formula source so an LLM can read + it and reason." + +## Repro + +```bash +# Inspect one candidate end-to-end. +python << 'EOF' +from openpyxl import load_workbook +from ks_xlsx_parser.pipeline import parse_workbook +p = "data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/CF_3712/1_CF_3712_input.xlsx" +wb = load_workbook(p, data_only=True) +print("M3:M5 on", wb.sheetnames[0]) +for r in range(3,6): + print(" ", wb.active.cell(row=r,column=13).value) +r = parse_workbook(path=p) +for c in r.chunks: + print(c.sheet_name, c.top_left_cell, c.bottom_right_cell) + print((c.render_text or "")[:600]) +EOF +``` diff --git a/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md b/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md new file mode 100644 index 0000000..97cd879 --- /dev/null +++ b/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md @@ -0,0 +1,138 @@ +# 04 · Whole-sheet single chunk dilutes the embedding + +**Status:** 🆓 free to claim +**Slice:** F (`chunking/chunker.py`, `chunking/segmenter.py`, `analysis/light_block_detector.py`, `analysis/table_grouper.py`) +**Independent of:** 00, 01, 02, 03 + +## What it looks like + +Two clear instances on the 200-sample (`184-6`, `374-9`); the pattern +extends to many more on the full corpus where a sheet contains a +single large data table with a few headers and the chunker emits ONE +chunk covering A1:end. Note: TODO 02 also has instances where the +single-chunk-per-sheet pattern shows up (`53-12`, `189-9`, +`334-11`, `353-29`, `382-29`, `462-45`, `495-31`, `CF_3712`). Treat +those as **second-order beneficiaries**: when 02 lands first, the +chunk's range tightens and may already split the over-broad chunk. +When 04 lands first, the chunks get smaller and 02's range invariant +becomes easier to satisfy. Order is reviewer's choice. + +Symptoms: + +* `n_chunks_on_gt_sheet == 1` and the chunk spans most of the sheet. +* The answer values are present in the chunk's render_text BUT rank + >5 (or rank=1 by luck — at recall@1 the model picks based on the + bag-of-everything embedding). The recall@1 vs recall@5 gap of 12.4 pp + in the v0.2.0 numbers is consistent with this. +* `summary.md` shows `table_fragmentation_rate ≈ 0` on these sheets + (which sounds good but means we're under-splitting, not perfectly + segmenting). + +This is the **chunk-granularity** problem. The block-detection pipeline +fuses everything into one block; the chunker has no token cap and +emits the block verbatim. + +## Diagnostic signature + +```bash +jq -c 'select(.n_chunks_on_gt_sheet == 1 and .n_chunks_total <= 2 + and (.flags | contains(["instruction_requires_execution"]) | not))' \ + tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson +``` + +Also: dump the per-chunk `token_count` (already computed but unused in +chunker.py). Any chunk over 800 tokens is over-broad on this corpus. + +## File scope + +You may touch: + +* `src/ks_xlsx_parser/chunking/chunker.py` — introduce a chunk-size + cap. Suggested target: ~512 tokens. When a block exceeds, split the + block into row-groups (preserving headers in each group), with each + group becoming its own chunk. `prev_chunk_id` / `next_chunk_id` is + already wired for navigation. +* `src/ks_xlsx_parser/chunking/segmenter.py` — segmenter logic if any + granularity decisions live there. +* `src/ks_xlsx_parser/analysis/light_block_detector.py` and + `analysis/table_grouper.py` — investigate whether *over*-merging + upstream is causing the one-block-per-sheet symptom. If the table + grouper is collapsing 3 logical tables into 1, the upstream fix is + better than splitting at chunk-emit time. + +Do NOT touch the ranking/embedding step (`scripts/eval_retrieval.py`) +or any retrieval logic. + +## Acceptance criteria + +1. After landing, no chunk on the 200-sample seed=1337 run exceeds + ~800 tokens (= ~3200 characters) by default. Allow override. +2. `table_fragmentation_rate` may rise (this is desired up to ~0.5 — + purposeful row-group splits). It must NOT cause a regression in + geometric@5 — each row-group chunk's claimed range must cover only + its rows. +3. `recall_text@5` rises by ≥ 3 pp on the 200-sample. +4. `recall_text@1` rises by ≥ 4 pp (smaller chunks = sharper embeddings + = top-1 sharpens). +5. The two named instances (`184-6`, `374-9`) flip to `both_hit`. Note + that `184-6` is also subject to TODO 00's harness fix — order may + matter. + +## Failing test sketch + +```python +# tests/test_chunk_size_cap.py +from ks_xlsx_parser.pipeline import parse_workbook + +def test_no_chunk_exceeds_token_budget(tmp_path): + # Generate a workbook with a 1000-row table. Without a cap the + # chunker emits one giant chunk; with the cap it should emit several. + import openpyxl + p = tmp_path / "big.xlsx" + wb = openpyxl.Workbook() + ws = wb.active + ws.append(["id", "name", "value"]) + for i in range(1, 1001): + ws.append([i, f"name-{i}", i * 7]) + wb.save(p) + + chunks = parse_workbook(path=str(p)).chunks + assert len(chunks) >= 2, "1000-row table should produce ≥2 chunks after the cap lands" + # Each chunk's claimed range should be a contiguous row block on the same sheet. + for c in chunks: + assert c.sheet_name and c.top_left_cell and c.bottom_right_cell + # No chunk should be larger than the budget (in render-text characters). + for c in chunks: + assert len(c.render_text or "") <= 4000, f"chunk too big: {len(c.render_text)} chars" +``` + +## Pitfalls + +* **Don't strand headers.** The current chunker preserves table headers + in render_text. When splitting a table into row groups, every group + needs the header rows OR a header-summary block so an LLM can + interpret column meanings. +* **Range bookkeeping** — when you split, the child chunks' A1 ranges + MUST be tight to their row subset. Otherwise you create + geometric_no_overlap failures (cluster 02 in reverse). +* **`token_count` is character-based** (`len(text) // 4`). Use the + same metric for the cap to avoid a circular dependency on a tokenizer. +* **Don't split horizontally** (column subsets). Excel tables are + usually wider than embeddings need but column-split chunks have no + natural boundary and reading order breaks. + +## Coordinating with TODO 02 + +If both 02 and 04 are in flight: +- 02 is *range-tightening* (semantically correct ranges). +- 04 is *granularity* (more chunks per sheet). +- The risk is they fight over `_block_to_chunk`. Decide which one + emits the final A1 range and document the contract. Suggested: + 04 splits blocks into sub-blocks WITH tight ranges, then 02's + invariant test passes for the sub-blocks. + +## Repro + +Use a synthetic 1000-row workbook (see test sketch) — much faster to +iterate on than the corpus instances. Once your local fixture passes, +run the 200-sample to confirm corpus-wide impact. diff --git a/docs/planning/recall-90/05-out-of-scope-execution-instances.md b/docs/planning/recall-90/05-out-of-scope-execution-instances.md new file mode 100644 index 0000000..0133d3a --- /dev/null +++ b/docs/planning/recall-90/05-out-of-scope-execution-instances.md @@ -0,0 +1,136 @@ +# 05 · Out-of-scope: benchmark instances that require execution, not retrieval + +**Status:** 🆓 free to claim (scoring/harness work — **NOT a parser fix**) +**Slice:** A (`scripts/eval_retrieval.py`, `scripts/enrich_failures.py`) +**Independent of:** 01, 02, 03, 04 — partial overlap with 00 on +`eval_retrieval.py`; coordinate via README. + +## Why this exists + +127 of 200 instances in the seed=1337 sample (≈ 63%) have +`instruction_requires_execution = True` — the `answer_position` in the +INPUT spreadsheet is empty because the question literally asks the +system to *compute and write* a value there. + +Example instructions from this class: + +* "Round the value 367.5 to the nearest $5 and put the result in C2." +* "Sum each row of B:F and fill column G." +* "Fix the broken formulas in D2:D613." + +A retrieval-grade parser cannot satisfy these instances. The answer +doesn't exist in `input.xlsx` to be retrieved. Including these +instances in the headline `recall_text@5` number drags it down ~25 +points without representing any parser quality signal. + +**This TODO is not a parser improvement.** It's a benchmark-scoring +correction so that subsequent parser work has a clean signal. + +## What the headline numbers actually mean today + +On the 200-sample seed=1337: + +| | All 200 | In-scope (73) | +|-------------------------|---------|---------------| +| Text recall@5 | 0.635 | 0.59 | +| Geometric recall@5 | 0.310 | (compute it) | +| `instruction_requires_execution` | 127 of 200 | 0 (excluded) | + +The "in-scope" column is the metric the planning roadmap targets. Today +it's hidden inside the eval output; this TODO surfaces it as a +first-class metric. + +## What to ship + +1. **A scoring filter switch** in `scripts/eval_retrieval.py`. + - New optional `--exclude-execution-instances` flag (or compute both + filtered and unfiltered metrics every run). + - Definition of "execution instance": at run time, peek at + `input.xlsx[answer_sheet][answer_position]`. If the union of + non-empty cells in that range is empty, mark the instance as + `instruction_requires_execution` and report it in a separate + bucket. **Do this once per instance, not per parser** — the + classification is parser-independent. +2. **Two recall numbers** in `summary.json` / `summary.md`: + - `recall_text@5_all` (current behaviour — count every instance). + - `recall_text@5_in_scope` (denominator excludes + execution instances). +3. **History tracking**: include both numbers in + `scripts/append_bench_history.py` so the trend chart shows the + in-scope number — that's the target for the 0.90 goal. +4. **README + recall-investigation.md update** so the in-scope number + is named and called out as the gate. The 0.90 target is on + *in-scope* recall, not all-instances recall. +5. **Out-of-scope summary file** alongside the run reports: + `tests/benchmarks/reports/retrieval//out_of_scope.txt` listing + the instance IDs that were filtered, so the filter's behaviour is + auditable. If somebody disagrees with the classification, they can + diff the list. + +## File scope + +You may touch: + +* `scripts/eval_retrieval.py` — add classification step BEFORE + `score_instance` (so unfiltered metric still works). +* `scripts/enrich_failures.py` — make `instruction_requires_execution` + the canonical source of truth (it already detects this). +* `scripts/append_bench_history.py` — record both numbers per run. +* `scripts/triage_recall.py` — by default exclude out-of-scope rows. +* `docs/planning/recall-90/README.md` — update target row. +* `docs/recall-investigation.md` — call out the in-scope number. + +Do NOT touch any `src/ks_xlsx_parser/*` code for this cluster. + +## Acceptance criteria + +1. On the same 200-sample seed=1337 run, the new + `recall_text@5_in_scope` number is emitted alongside the existing + one and printed in `summary.md`. +2. The classifier reproducibly identifies ≥ 120 of the 127 currently- + flagged instances (some borderline cases — a single non-empty + header in the answer range — are debatable). +3. `history.jsonl` rows after the change have both fields populated. +4. A unit test under `tests/test_eval_retrieval_classify.py` covers + the classifier with hand-built fixtures: empty range → out-of-scope, + range with one value → in-scope, range with only string headers + (no numeric data) → in-scope, range with formula cells that have + cached values → in-scope. + +## Why this isn't TODO 00 + +* TODO 00 fixes the harness's *spec parser* — wrong sheet names get + resolved. +* TODO 05 fixes the harness's *scoring* — instances the parser can't + affect get a separate bucket. + +Both touch `eval_retrieval.py` but in different functions. If they're +worked simultaneously, coordinate over `parse_position_spec` — +TODO 00 owns the regex; TODO 05 owns the classifier. + +## Pitfalls + +* Don't over-exclude. An instance where `answer_position` covers + `A2:G15` and only `A1` is non-empty (a single header row above the + empty area) is still execution-required, but use the right cell + range — read the actual `answer_position`, not `data_position`. +* The classifier needs the openpyxl read, which costs ~50ms per + instance. Cache it or do it once at script start, not inside the + ranking loop. +* Don't delete the unfiltered metric. Both numbers are useful: the + unfiltered one tells you how a downstream consumer experiences + ks-xlsx-parser on real spreadsheet questions (most of which DO + require execution); the in-scope one tells you whether the parser + is doing its job. + +## Sample of what gets filtered + +Random sampling of `instruction_requires_execution = True` instances: + +| Instance | Instruction (first 60 chars) | Answer range | +|---|---|---| +| ... | "Calculate the sum of column B for products where..." | `Sheet1!E2:E10` | +| ... | "For each row in column A, look up the corresponding..."| `Sheet1!C2:C50` | +| ... | "Pivot the data in A1:D100 by month..." | `Sheet2!A1:E13` | + +Run the enricher to see the full list. diff --git a/docs/planning/recall-90/README.md b/docs/planning/recall-90/README.md new file mode 100644 index 0000000..4f50b46 --- /dev/null +++ b/docs/planning/recall-90/README.md @@ -0,0 +1,137 @@ +# Roadmap: ks-xlsx-parser retrieval recall → 0.90 + +This directory is the worklist for getting SpreadsheetBench recall on the +canonical ks-xlsx-parser benchmark from the v0.2.0 baseline up to: + +| Metric | v0.2.0 (912) | 200-sample baseline | Target | Stretch | +|------------------------|--------------|---------------------|--------|---------| +| Text recall@5 | 0.704 | 0.635 | ≥ 0.90 | 0.95 | +| Geometric recall@5 | 0.369 | 0.310 | ≥ 0.70 | 0.85 | +| In-scope text recall@5 | unmeasured | **0.59** | ≥ 0.90 | — | + +Each `NN-*.md` in this folder is one **independent, parallel-safe TODO**. +Pick one, claim it (table below), execute, hand off the next agent. + +## Read this first — the headline number is misleading + +The benchmark dataset has 912 instances. **~63% of them require the +system to *compute and write* the answer** (e.g. "fill column C with +ROUND(B*5)") — the answer literally doesn't exist in the input +spreadsheet. A parser cannot retrieve what isn't there. We call this +class `instruction_requires_execution` and treat them as **out of scope +for parser work**. See [05-out-of-scope-execution-instances.md](./05-out-of-scope-execution-instances.md). + +When the scoring is filtered to in-scope instances only, the real +ks-xlsx-parser recall is closer to **0.59**, and improving it past +0.90 means closing ~22 named failures out of ~30 actionable cases on the +200-instance seed=1337 sample. That's the actual work. + +## Worklist + +Status legend: 🆓 free to claim · 🔵 in progress · ✅ landed · ⛔ blocked + +| # | File | Cluster | Instances | Primary slice | Status | +|---|---|---|---|---|---| +| 00 | [00-benchmark-spec-parser-bugs.md](./00-benchmark-spec-parser-bugs.md) | Benchmark harness fails to parse malformed `data_position`/`answer_position` (`Dashboard'!B8`, fullwidth `G12:J15`) — 8 instances scored as wrong even when the parser was correct. | 8 | `scripts/eval_retrieval.py` | 🆓 | +| 01 | [01-array-formula-rendering.md](./01-array-formula-rendering.md) | Array-formula cells surface as `` — value never lands in `render_text`. | 2 | `parsers/cell_parser.py`, `rendering/text_renderer.py` | 🆓 | +| 02 | [02-chunk-range-vs-text-mismatch.md](./02-chunk-range-vs-text-mismatch.md) | `text_hit_geom_miss` with chunks on the GT sheet — answer text is in *some* chunk's text but no chunk's claimed A1 range overlaps GT. Range bookkeeping drift during block merge/split. | 7 | `annotation/block_splitter.py`, `analysis/pattern_splitter.py`, `chunking/chunker.py` | 🆓 | +| 03 | [03-cell-drop-or-uncached-formula.md](./03-cell-drop-or-uncached-formula.md) | Chunk's range covers GT geometrically but its rendered text lacks the answer values — either cells dropped in render, or formula cells with no cached value rendered as the formula source. | up to 10 (some unscorable) | `parsers/cell_parser.py`, `parsers/sheet_parser.py`, `rendering/text_renderer.py` | 🆓 | +| 04 | [04-single-chunk-multi-region-sheet.md](./04-single-chunk-multi-region-sheet.md) | Sheets containing distinct logical regions emerge as one giant chunk; the answer text dilutes against the rest of the sheet at embedding time. | 2 (+ likely more at full corpus) | `chunking/chunker.py`, `analysis/light_block_detector.py`, `analysis/table_grouper.py` | 🆓 | +| 05 | [05-out-of-scope-execution-instances.md](./05-out-of-scope-execution-instances.md) | Informational + scoring filter. **Not a parser fix.** Surfaces benchmark instances where the parser fundamentally can't help; suggests filtering them from headline recall. | 127 / 200 | `scripts/eval_retrieval.py` (scoring), `scripts/enrich_failures.py` | 🆓 | + +## How parallelism works here + +Each TODO is scoped to a different file slice so agents won't collide: + +``` +slice A benchmark harness 00, 05 +slice B parsers/cell_parser 01, 03 +slice C parsers/sheet_parser 03 +slice D rendering/text_renderer 01, 03 +slice E annotation/ + analysis/ 02 +slice F chunking/ 04 +``` + +TODOs 00 and 05 touch the same file (`eval_retrieval.py`) — **serialize** +those two. Everything else can run truly in parallel. + +If you discover a cluster mid-flight that doesn't fit anywhere here, +add a new `NN-*.md` and append a row to the table above. Do NOT bury +findings inside an existing TODO. + +## Working a TODO + +Each cluster file has the same structure: + +1. **What the cluster looks like** — a real failure example in full. +2. **Repro instance IDs** — the 200-sample seed=1337 instance IDs that + match this cluster. +3. **Diagnostic signature** — exact `enriched_failures.ndjson` columns + that identify a member. +4. **File scope** — what you can change; what you can't. +5. **Acceptance criteria** — measurable on the 200-sample seed=1337 + rerun. "The N instances listed in §2 all flip from miss → hit on + `text_match_rank` AND no other previously-passing instance regresses." +6. **Failing test sketch** — one-line idea for a regression test. + +What's **deliberately missing** from each file: the proposed fix and an +effort estimate. The agent that picks up the task does that diagnosis; +the doc just specifies success. + +### The loop, end-to-end + +```bash +# 1. Claim the task (edit README, change 🆓 → 🔵 with your name). +# 2. Read the cluster file + the named example instances. +# 3. Reproduce locally: +python scripts/eval_retrieval.py \ + --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \ + --parsers ks --sample 200 --seed 1337 --emit-failures +python scripts/enrich_failures.py tests/benchmarks/reports/retrieval +python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5 + +# 4. Write a regression test that fails today (see test sketch in cluster file). +# 5. Fix until the test passes. +# 6. Re-run the eval + enrichment. Confirm the cluster count dropped, no regressions. +# 7. Append to history; the delta should be visible: +python scripts/append_bench_history.py + +# 8. PR the change. Title: `recall(NN): — N→M`. +# PR body links to the cluster file and pastes the before/after histogram. +# 9. Update README table: 🔵 → ✅ with PR link. +``` + +## Why the 200-sample seed=1337 is the gate, not the full 912 + +912 takes ~40 min on a laptop. PR validation needs to be fast. The +seed=1337 200-sample is deterministic and contains roughly proportional +representation of every cluster. Acceptance criteria are stated against +it. **Cross-check with the full corpus before declaring a release**: + +```bash +python scripts/eval_retrieval.py --parsers ks --emit-failures +``` + +The weekly benchmark workflow (`.github/workflows/benchmark.yml`) runs +the full 912 on schedule + workflow_dispatch. + +## Tracking progress + +`tests/benchmarks/reports/history.jsonl` is appended to on each +`make bench-track` run, one row per commit. Tail it to see the trend: + +```bash +tail -10 tests/benchmarks/reports/history.jsonl +``` + +Each row carries the failure_buckets histogram, so per-cluster progress +is in the data, not just the headline number. + +## See also + +- `docs/recall-investigation.md` — diagnosis framework + hypotheses + (H1 chunk-size, H2 formula-rendering, H3 range-bookkeeping) +- `docs/benchmark-local-setup.md` — install → corpus → eval loop +- `scripts/triage_recall.py` — bucket histogram + exemplar dump +- `scripts/enrich_failures.py` — per-failure diagnostic columns +- `.claude/skills/recall-failure-triage/SKILL.md` — agent guide diff --git a/docs/recall-investigation.md b/docs/recall-investigation.md index d3576c8..ee630fd 100644 --- a/docs/recall-investigation.md +++ b/docs/recall-investigation.md @@ -32,7 +32,38 @@ fixing the biggest one. | `geometric_no_overlap` | No chunk's A1 range overlaps ground truth. Range bookkeeping drifts during merge/split. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` | | `no_chunks` / `parse_error` | Upstream parser failure. | The parse exception — fix the crash. | -## A priori hypotheses (to be confirmed by the next benchmark run) +## First-run findings (200-sample seed=1337, May 2026) + +Headline numbers refuted my **H1** ("present_but_ranked_low dominates") +and surfaced something the original analysis missed entirely. Of 157 +failures in the seed=1337 sample: + +* **127 (81%) are `instruction_requires_execution`** — the benchmark + is asking the system to *compute and write* the answer. The + `answer_position` cell range in `input.xlsx` is empty by design. A + parser cannot retrieve what isn't there. These instances need to be + filtered from the headline metric (TODO 05). +* **30 are truly actionable** parser failures, clustering into 4 named + buckets (TODOs 00, 01, 02, 03, 04 in + [docs/planning/recall-90/](./planning/recall-90/)). +* Zero of the 30 are `present_but_ranked_low`. Chunk-size dilution + may matter on the full 912-corpus but is NOT the dominant problem + on the 200-sample. + +The dominant **citation-grade killer** turned out to be +`text_hit_geom_miss` (84 of 200, mostly inside the 127 execution +instances; 9 are actionable parser bugs). The chunk text contains +the answer string, but the chunk's claimed A1 range doesn't overlap +ground truth — exactly **H3** ("range-bookkeeping drift"). + +When the in-scope filter is applied, real `recall_text@5` is +**0.59**, not 0.635. Closing the 30 actionable failures gets us +near 0.90 on the in-scope metric. + +See [docs/planning/recall-90/](./planning/recall-90/) for the +ranked TODO list. + +## A priori hypotheses (now confirmed/refuted; left as history) ### H1 — `present_but_ranked_low` is the biggest bucket diff --git a/scripts/enrich_failures.py b/scripts/enrich_failures.py new file mode 100755 index 0000000..a9e4452 --- /dev/null +++ b/scripts/enrich_failures.py @@ -0,0 +1,348 @@ +#!/usr/bin/env python3 +"""Enrich a benchmark run with per-instance diagnostics for failure clustering. + +`eval_retrieval.py --emit-failures` only sees *text-match* misses. For +citation-grade scoring (geometric recall@5 = 0.283) we also need to +classify instances where the answer text is in some chunk but no chunk's +A1 range covers the ground truth — those don't show up in failures.ndjson +at all. + +This script re-parses each instance's input.xlsx with both ks-xlsx-parser +and openpyxl, then emits one row per FAILED instance (text-miss OR +geometric-miss) with diagnostic columns chosen so post-hoc clustering is +easy: + + instance_id question id + bucket_combined both_miss / text_hit_geom_miss / text_miss_geom_hit + answer_position the GT spec from dataset.json + gt_sheet ground-truth sheet name (default: answer_sheet) + gt_cell_raw openpyxl raw value at the first GT cell + gt_cell_formula formula string if any + gt_cell_data_only cached computed value if any + gt_in_workbook_sheets is gt_sheet in wb.sheetnames? + gt_in_chunked_sheets did the parser produce any chunk on gt_sheet? + n_workbook_sheets total sheets in the workbook (incl hidden) + n_chunked_sheets distinct sheets we emitted chunks for + n_workbook_cells_in_gt non-empty openpyxl cells in the GT range + chunks_on_gt_sheet how many chunks we emitted for gt_sheet + chunk_bbox_on_gt_sheet bbox (min_r,min_c,max_r,max_c) over all chunks on gt_sheet + gt_range_bbox GT range as (r0,c0,r1,c1) + text_match_rank rank of first chunk whose text contains a GT value + geom_match_rank rank of first chunk whose A1 range overlaps GT + +Usage: + python scripts/enrich_failures.py +""" +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path +from typing import Any + +# Hoist eval_retrieval helpers so we share the *exact* normalization +# logic — clustering by "answer present in chunk" must be apples-to-apples. +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) +from scripts.eval_retrieval import ( # noqa: E402 + _matches_chunk_text, + parse_a1, + parse_position_spec, + parse_range, +) + +REPO_ROOT = Path(__file__).resolve().parent.parent + + +def find_run(arg: Path) -> Path: + if arg.is_file(): + return arg.parent + if arg.is_dir(): + if (arg / "results.ndjson").exists(): + return arg + runs = sorted(p for p in arg.glob("*/results.ndjson")) + if runs: + return runs[-1].parent + sys.exit(f"ERROR: no results.ndjson at {arg}") + + +def overlaps(box1: tuple[int, int, int, int], box2: tuple[int, int, int, int]) -> bool: + r0a, c0a, r1a, c1a = box1 + r0b, c0b, r1b, c1b = box2 + return not (r1a < r0b or r0a > r1b or c1a < c0b or c0a > c1b) + + +def chunk_bbox(chunks) -> tuple[int, int, int, int] | None: + boxes = [] + for c in chunks: + tl = parse_a1(c.top_left_cell) if c.top_left_cell else None + br = parse_a1(c.bottom_right_cell) if c.bottom_right_cell else None + if tl and br: + boxes.append((tl[0], tl[1], br[0], br[1])) + if not boxes: + return None + return ( + min(b[0] for b in boxes), + min(b[1] for b in boxes), + max(b[2] for b in boxes), + max(b[3] for b in boxes), + ) + + +def enrich(run_dir: Path, corpus: Path, out_path: Path) -> None: + from openpyxl import load_workbook + + from ks_xlsx_parser.pipeline import parse_workbook + + # Load dataset.json once — we need question text + the original + # answer_sheet attribution for instances where the rank scoring + # already considers the spec parse-resolved. + dataset = {str(d["id"]): d for d in + json.loads((corpus / "dataset.json").read_text())} + + results_path = run_dir / "results.ndjson" + rows: list[dict[str, Any]] = [] + for line in results_path.read_text().splitlines(): + if line.strip(): + rows.append(json.loads(line)) + + out_rows: list[dict[str, Any]] = [] + n_failed = 0 + for rec in rows: + if rec.get("error"): + continue + text_rank = rec.get("rank_of_text_match") + geom_rank = rec.get("rank_of_first_overlap") + text_hit = text_rank is not None and text_rank <= 5 + geom_hit = geom_rank is not None and geom_rank <= 5 + if text_hit and geom_hit: + continue # not a failure either way + n_failed += 1 + + inst_id = rec["instance_id"] + meta = dataset.get(inst_id, {}) + instruction = meta.get("instruction", "") + answer_sheet = meta.get("answer_sheet") or None + if answer_sheet and "," in answer_sheet: + answer_sheet = answer_sheet.split(",")[0].strip() + answer_position = rec.get("answer_position") or meta.get("answer_position") or "" + data_position = meta.get("data_position") or "" + + # Re-parse input.xlsx — cheap (50 ms median). + inst_dir = corpus / "spreadsheet" / inst_id + input_path = inst_dir / f"1_{inst_id}_input.xlsx" + if not input_path.exists(): + continue + + try: + result = parse_workbook(path=str(input_path)) + chunks = list(result.chunks) + except Exception as exc: + out_rows.append({ + "instance_id": inst_id, + "bucket_combined": "parse_error", + "error": f"{type(exc).__name__}: {exc}", + }) + continue + + chunked_sheets = sorted({c.sheet_name for c in chunks if c.sheet_name}) + + # openpyxl view — both formula and data_only passes so we can + # tell "formula uncached" from "cell genuinely empty". + try: + wb_f = load_workbook(str(input_path), data_only=False, read_only=False) + wb_d = load_workbook(str(input_path), data_only=True, read_only=False) + wb_sheets = list(wb_f.sheetnames) + hidden_sheets = [s for s in wb_sheets + if getattr(wb_f[s], "sheet_state", "visible") != "visible"] + except Exception as exc: + wb_sheets = [] + hidden_sheets = [] + wb_f = wb_d = None + + # Geometric overlap is scored against data_position (the input data + # region the question is asking about); falls back to answer_position + # when the dataset didn't fill data_position in (561 of 912 instances). + # This mirrors eval_retrieval.py's geom_spec = data_pos or answer_pos. + geom_spec = data_position or answer_position + regions = parse_position_spec(geom_spec, answer_sheet) + gt_sheet = regions[0][0] if regions else answer_sheet + gt_range_bbox = regions[0][1] if regions else None + + # Separately track whether the *answer* region is empty in input.xlsx. + # If so, the question is "compute X and write here" — the parser + # cannot possibly contain the answer text, so this is a benchmark + # construct, not a parser bug. We flag it as instruction_requires_execution. + answer_regions = parse_position_spec(answer_position, answer_sheet) + answer_sheet_resolved = (answer_regions[0][0] + if answer_regions else answer_sheet) + answer_range_bbox = answer_regions[0][1] if answer_regions else None + n_input_cells_in_answer_range = 0 + if (wb_d and answer_sheet_resolved and answer_range_bbox + and answer_sheet_resolved in wb_d.sheetnames): + try: + ws = wb_d[answer_sheet_resolved] + r0, c0, r1, c1 = answer_range_bbox + for row in ws.iter_rows(min_row=r0, max_row=r1, min_col=c0, + max_col=c1, values_only=True): + for v in row: + if v is not None and str(v).strip(): + n_input_cells_in_answer_range += 1 + except Exception: + pass + + gt_cell_raw = None + gt_cell_formula = None + gt_cell_data_only = None + n_workbook_cells_in_gt = 0 + if wb_f and gt_sheet and gt_sheet in wb_f.sheetnames and gt_range_bbox: + ws_f = wb_f[gt_sheet] + ws_d = wb_d[gt_sheet] + r0, c0, r1, c1 = gt_range_bbox + # First cell only — enough to know "formula vs. value". + try: + tl_cell_f = ws_f.cell(row=r0, column=c0) + tl_cell_d = ws_d.cell(row=r0, column=c0) + gt_cell_raw = tl_cell_f.value + if isinstance(gt_cell_raw, str) and gt_cell_raw.startswith("="): + gt_cell_formula = gt_cell_raw + gt_cell_data_only = tl_cell_d.value + except Exception: + pass + # Count non-empty cells across the range + try: + for row in ws_d.iter_rows(min_row=r0, max_row=r1, min_col=c0, + max_col=c1, values_only=True): + for v in row: + if v is not None and str(v).strip(): + n_workbook_cells_in_gt += 1 + except Exception: + pass + + chunks_on_gt = [c for c in chunks if gt_sheet and c.sheet_name == gt_sheet] + gt_chunk_bbox = chunk_bbox(chunks_on_gt) + + if not text_hit and not geom_hit: + bucket = "both_miss" + elif text_hit and not geom_hit: + bucket = "text_hit_geom_miss" + elif geom_hit and not text_hit: + bucket = "text_miss_geom_hit" + else: + bucket = "other" + + sheet_chunked = (gt_sheet in chunked_sheets) if gt_sheet else None + sheet_in_wb = (gt_sheet in wb_sheets) if (gt_sheet and wb_sheets) else None + sheet_hidden = (gt_sheet in hidden_sheets) if gt_sheet else False + + # Pre-named signal heuristics — informational only; clustering is + # still done by reading. These flags help spot patterns FAST. + flags: list[str] = [] + if sheet_in_wb is False: + flags.append("gt_sheet_missing_from_workbook") + elif sheet_chunked is False: + flags.append("gt_sheet_present_but_not_chunked") + if sheet_hidden: + flags.append("gt_sheet_hidden") + if gt_cell_formula and gt_cell_data_only in (None, ""): + flags.append("gt_cell_uncached_formula") + elif gt_cell_formula: + flags.append("gt_cell_is_formula") + if (gt_range_bbox and gt_chunk_bbox and + not overlaps(gt_range_bbox, gt_chunk_bbox)): + flags.append("gt_range_outside_chunk_bbox") + if (gt_range_bbox and gt_chunk_bbox and + overlaps(gt_range_bbox, gt_chunk_bbox)): + # Inside the chunk bbox but no individual chunk overlaps? + # That's "the parser saw the right area but split it wrong". + any_overlap = False + for c in chunks_on_gt: + tl = parse_a1(c.top_left_cell) if c.top_left_cell else None + br = parse_a1(c.bottom_right_cell) if c.bottom_right_cell else None + if tl and br and overlaps( + gt_range_bbox, (tl[0], tl[1], br[0], br[1])): + any_overlap = True + break + if not any_overlap: + flags.append("gt_inside_bbox_but_no_chunk_overlap") + if (n_workbook_cells_in_gt == 0 and gt_range_bbox): + flags.append("gt_range_empty_in_workbook") + # The big one: if answer_position is empty in input, the benchmark + # is asking the system to WRITE the answer. Not a parser bug. + if (answer_range_bbox and n_input_cells_in_answer_range == 0): + flags.append("instruction_requires_execution") + # cell rendered but truncated to a sub-range? + if (gt_chunk_bbox and gt_range_bbox and + gt_chunk_bbox[2] < gt_range_bbox[2]): + flags.append("chunk_bbox_rows_truncated") + if (gt_chunk_bbox and gt_range_bbox and + gt_chunk_bbox[3] < gt_range_bbox[3]): + flags.append("chunk_bbox_cols_truncated") + + out_rows.append({ + "instance_id": inst_id, + "bucket_combined": bucket, + "instruction": instruction[:200], + "answer_position": answer_position, + "answer_sheet": answer_sheet, + "gt_sheet": gt_sheet, + "gt_range_bbox": list(gt_range_bbox) if gt_range_bbox else None, + "gt_cell_raw": str(gt_cell_raw)[:120] if gt_cell_raw is not None else None, + "gt_cell_formula": gt_cell_formula, + "gt_cell_data_only": + str(gt_cell_data_only)[:120] if gt_cell_data_only is not None else None, + "n_workbook_sheets": len(wb_sheets), + "n_chunked_sheets": len(chunked_sheets), + "wb_sheets": wb_sheets, + "hidden_sheets": hidden_sheets, + "chunked_sheets": chunked_sheets, + "n_chunks_total": len(chunks), + "n_chunks_on_gt_sheet": len(chunks_on_gt), + "n_workbook_cells_in_gt": n_workbook_cells_in_gt, + "chunk_bbox_on_gt_sheet": list(gt_chunk_bbox) if gt_chunk_bbox else None, + "rank_of_text_match": text_rank, + "rank_of_first_overlap": geom_rank, + "flags": flags, + "data_position": data_position, + "answer_range_bbox": list(answer_range_bbox) if answer_range_bbox else None, + "n_input_cells_in_answer_range": n_input_cells_in_answer_range, + }) + + out_path.write_text("\n".join(json.dumps(r, separators=(",", ":")) for r in out_rows) + "\n") + print(f"Examined {len(rows)} instances, {n_failed} failed (text OR geom).") + print(f"Wrote {len(out_rows)} enriched rows to {out_path}") + + # Quick histogram for sanity + from collections import Counter + bc = Counter(r["bucket_combined"] for r in out_rows) + fc = Counter() + for r in out_rows: + for f in r.get("flags", []): + fc[f] += 1 + print("\nCombined bucket counts:") + for b, n in bc.most_common(): + print(f" {b:<30s} {n}") + print("\nDiagnostic flags (rows can have multiple):") + for f, n in fc.most_common(): + print(f" {f:<40s} {n}") + + +def main(argv: list[str] | None = None) -> int: + ap = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("path", type=Path, + help="run dir, results.ndjson, or parent reports dir") + ap.add_argument("--corpus", type=Path, + default=REPO_ROOT / "data/corpora/spreadsheetbench/all_data_912_v0.1") + ap.add_argument("--out", type=Path, default=None, + help="output path (default: /enriched_failures.ndjson)") + args = ap.parse_args(argv) + + run_dir = find_run(args.path) + out = args.out or (run_dir / "enriched_failures.ndjson") + enrich(run_dir, args.corpus, out) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) From 186d468839cf581926f2bb7ee6aa474e17668e19 Mon Sep 17 00:00:00 2001 From: Arnav Goel Date: Tue, 19 May 2026 18:09:12 -0400 Subject: [PATCH 4/4] chore: gitignore docs/planning/ (internal recall worklist) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The recall→0.90 cluster TODO files were coordination notes for internal agents, not material the open-source community needs. Keep them locally (disk-side) but stop tracking them so they don't leak via the public repo. docs/recall-investigation.md + docs/benchmark-local-setup.md stay — those have value for external contributors reproducing benchmark results. Removed two now-broken cross-references from recall-investigation.md that pointed at the planning dir. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../00-benchmark-spec-parser-bugs.md | 119 ------------- .../recall-90/01-array-formula-rendering.md | 130 -------------- .../02-chunk-range-vs-text-mismatch.md | 150 ---------------- .../03-cell-drop-or-uncached-formula.md | 165 ------------------ .../04-single-chunk-multi-region-sheet.md | 138 --------------- .../05-out-of-scope-execution-instances.md | 136 --------------- docs/planning/recall-90/README.md | 137 --------------- 7 files changed, 975 deletions(-) delete mode 100644 docs/planning/recall-90/00-benchmark-spec-parser-bugs.md delete mode 100644 docs/planning/recall-90/01-array-formula-rendering.md delete mode 100644 docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md delete mode 100644 docs/planning/recall-90/03-cell-drop-or-uncached-formula.md delete mode 100644 docs/planning/recall-90/04-single-chunk-multi-region-sheet.md delete mode 100644 docs/planning/recall-90/05-out-of-scope-execution-instances.md delete mode 100644 docs/planning/recall-90/README.md diff --git a/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md b/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md deleted file mode 100644 index 8d33b7a..0000000 --- a/docs/planning/recall-90/00-benchmark-spec-parser-bugs.md +++ /dev/null @@ -1,119 +0,0 @@ -# 00 · Benchmark harness fails to parse malformed `answer_position` strings - -**Status:** 🆓 free to claim -**Slice:** A (benchmark harness — `scripts/eval_retrieval.py`) -**Independent of:** all other TODOs in this folder - -## What it looks like - -Eight instances in the 200-sample (seed=1337) score as recall@5 misses -even though the parser produced perfectly reasonable chunks. The -benchmark harness's `parse_position_spec()` returns no regions because -the `answer_position` string in `dataset.json` doesn't match -`SHEET_RANGE_RE`. The score code then sees `data_regions=0`, marks the -instance as "no scorable target", and the recall denominator counts it -as a miss. Net effect: the parser is blamed for benchmark-spec parsing -bugs. - -Concrete malformed strings observed: - -| Instance | `answer_position` string | What's wrong | -|---|---|---| -| `50442` | `RESULTS 1'!G17` | unmatched closing quote, opening quote missing | -| `49490` | `Sheet1'!H3:H58` | same | -| `49036` | `Dashboard'!B8` | same | -| `55427` | `Compiled and located schools da'!B2:B1461` | same + long sheet name | -| `48975` | `Output'!B11:B17` | same | -| `37456` | `G12:J15` | fullwidth Chinese colon `:` instead of `:` | -| `184-6` | `'RAWDATA!'A1:P6,'OUTPUT!'A1:P6'` | quote between sheet and `!`, comma-joined | -| `164-22` | `'YEAR1!'A1:G1478,'YEAR2!'A1:G1480'` | same shape as 184-6 | - -(`374-9` may also belong here — investigate.) - -## Diagnostic signature - -In `enriched_failures.ndjson`, members of this cluster have: - -* `gt_sheet == null` (spec parser couldn't extract the sheet name), OR -* `gt_sheet` wrapped in apostrophes (`"'Sheet1'"`), AND -* the workbook *does* contain a sheet with the bare name. - -```bash -python scripts/enrich_failures.py tests/benchmarks/reports/retrieval -jq -c 'select(.gt_sheet == null or (.gt_sheet | startswith("'")))' \ - tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson -``` - -## File scope - -You may touch: - -* `scripts/eval_retrieval.py` — `SHEET_RANGE_RE`, `parse_position_spec`, - `_normalize_value_for_match`. This is where the bugs live. -* `scripts/enrich_failures.py` — must stay consistent with the harness; - re-run after editing. -* Add new tests under `tests/test_eval_retrieval.py` (or similar — file - doesn't exist yet; create it). - -Do **NOT** touch any `src/ks_xlsx_parser/*` code for this cluster. If -the urge to "make the parser tolerate these patterns too" arises, -that's a separate TODO and a category error here — these are dataset -typos, not parser inputs. - -## Acceptance criteria - -On `python scripts/eval_retrieval.py --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 --parsers ks --sample 200 --seed 1337 --emit-failures`: - -1. All 8 instances listed above stop appearing as recall@5 misses - (text or geometric) on the diagnostic ndjson. -2. `geometric@5` rises by ≥ 4 pp (each fixed instance flips from miss - to hit since the chunks already overlap the correct region). -3. No previously-passing instance becomes a miss. Diff the - `results.ndjson` before and after — `pass → fail` count must be 0. -4. `python scripts/enrich_failures.py` shows zero rows with - `gt_sheet == null` from the eight listed instances. - -## Failing test sketch - -```python -# tests/test_eval_retrieval_spec_parser.py -import pytest -from scripts.eval_retrieval import parse_position_spec - -@pytest.mark.parametrize("spec, expected_sheet", [ - ("Dashboard'!B8", "Dashboard"), - ("Sheet1'!H3:H58", "Sheet1"), - ("'RAWDATA!'A1:P6", "RAWDATA"), - ("G12:J15", None), # this one keeps default_sheet; sheet unaltered -]) -def test_unmatched_quotes_resolve_sheet(spec, expected_sheet): - regions = parse_position_spec(spec, default_sheet="DEFAULT") - assert regions, f"no regions for {spec!r}" - if expected_sheet is not None: - assert regions[0][0] == expected_sheet - -def test_fullwidth_colon_in_range(): - # Excel-China dataset entries occasionally use the fullwidth colon. - regions = parse_position_spec("G12:J15", default_sheet="S") - assert regions and regions[0][1] == (12, 7, 15, 10) -``` - -These should fail on `main`. They pass when the fix lands. - -## Pitfalls - -* `'YEAR1!'A1:G1478,'YEAR2!'A1:G1480'` is multi-region. Make sure the - comma-split path still returns BOTH regions after the fix. -* `_normalize_value_for_match` is not involved here — don't refactor it. -* The current regex `SHEET_RANGE_RE` uses a backreference for matching - quotes; tightening the closing quote to "optional, but only if opening - present" is fragile. Consider a two-pass approach: strip stray - apostrophes, then re-attempt the strict regex. - -## Repro fixtures - -The eight instances live under -`data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet//`. -Quote one as a unit-test data point — do NOT vendor the corpus into -the repo. If a snapshot of the malformed strings is wanted, hand-copy -them into a Python fixture (they're ≤ 100 chars each). diff --git a/docs/planning/recall-90/01-array-formula-rendering.md b/docs/planning/recall-90/01-array-formula-rendering.md deleted file mode 100644 index a9edd72..0000000 --- a/docs/planning/recall-90/01-array-formula-rendering.md +++ /dev/null @@ -1,130 +0,0 @@ -# 01 · Array-formula cells render as `` - -**Status:** 🆓 free to claim -**Slice:** B + D (`parsers/cell_parser.py`, `rendering/text_renderer.py`) -**Independent of:** 00, 02, 04 — partially overlapping with 03 on -`text_renderer.py`; coordinate via the README table. - -## What it looks like - -Two instances (`43026`, `59969`) hit this exact pattern. The -ground-truth answer cell contains an Excel **array formula** (an -`{= ...}` formula entered with Ctrl-Shift-Enter that returns multiple -cells). openpyxl returns these as an `ArrayFormula` Python object, NOT -a string and NOT a number. Our cell extractor stuffs the repr of that -object into the cell's value, so the rendered chunk text contains -literal: - -``` - -``` - -This obviously kills both text-match (the answer values are nowhere in -the chunk) and any downstream LLM consumption. - -Instance `43026` (`summary!D10`): - -``` -gt_cell_raw = '' -gt_cell_formula = None ← our heuristic missed it because raw_value isn't a str starting with '=' -gt_cell_data_only = (cached value, if any) -``` - -## Diagnostic signature - -```bash -jq -c 'select(.gt_cell_raw | tostring | contains("ArrayFormula"))' \ - tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson -``` - -Likely a much larger cluster at full-corpus scale — most "tabular -report" workbooks use array formulas. Two on the 200-sample is the floor. - -## File scope - -You may touch: - -* `src/ks_xlsx_parser/parsers/cell_parser.py` — wherever the raw value - is extracted from `openpyxl.cell.Cell.value`. Detect `ArrayFormula`, - pull the formula text out of it (`obj.text` on openpyxl) and treat it - as a formula cell. -* `src/ks_xlsx_parser/rendering/text_renderer.py::_cell_render_value` - — make sure array-formula cells render their *cached value* (from the - `data_only` workbook pass) when one exists; fall back to the formula - expression as a string only if no cached value. -* `src/ks_xlsx_parser/models/cell.py` — add an `is_array_formula: bool` - field if useful for downstream consumers. - -Do NOT add a new evaluator for array formulas — out of scope (and -covered indirectly by the cached-value path). - -## Acceptance criteria - -1. Build a tiny array-formula fixture inside `tests/conftest.py` (the - existing programmatic-fixture pattern). Use `openpyxl`'s - `ArrayFormula` constructor; populate the `data_only` workbook with a - plausible cached value. -2. Add a test that asserts the chunk's `render_text` for that fixture - contains the cached value AND does **not** contain the substring - `ArrayFormula object`. -3. On the 200-sample seed=1337 rerun, instances `43026` and `59969` - move out of the failure set on the text-match metric IF the answer - values exist as cached data in the input (verify with openpyxl - first; if they don't, the instance is `instruction_requires_execution` - and you can't move it — note it in the PR). -4. No previously-passing crossval test regresses (`make test`). - -## Failing test sketch - -```python -# tests/test_array_formula_rendering.py -import openpyxl -from openpyxl.worksheet.formula import ArrayFormula -from ks_xlsx_parser.pipeline import parse_workbook - -def test_array_formula_renders_cached_value(tmp_path): - p = tmp_path / "af.xlsx" - wb = openpyxl.Workbook() - ws = wb.active - ws["A1"] = 10 - ws["A2"] = 20 - ws["A3"] = 30 - # Array formula in B1:B3 summing the row; cache a plausible value. - af = ArrayFormula("B1:B3", "=A1:A3*2") - ws["B1"] = af - # openpyxl writes the cached value separately when data_only=True is read; - # for fixture purposes, write the file once and patch cached values via - # a second openpyxl pass. - wb.save(p) - - chunks = parse_workbook(path=str(p)).chunks - assert chunks, "no chunks produced" - text = chunks[0].render_text - assert "ArrayFormula" not in text, "raw ArrayFormula object leaked" - # Once the fix lands, expect cached values 20 / 40 / 60 to render. -``` - -## Pitfalls - -* openpyxl's behaviour for `ArrayFormula` differs between read-only - and normal mode. Make sure both paths in `workbook_parser.py` agree - on how the cell is surfaced. -* If the input was saved by LibreOffice or generated programmatically - without computing array results, `cached_value` will be `None`. In - that case there's no useful retrieval signal — surfacing the formula - expression itself is acceptable as a fallback, but it must be a - string, not `repr()`. -* "Spilled" dynamic arrays (Excel 365's new `=A1:A3*2` without - Ctrl-Shift-Enter) are a related but distinct case. Note in your PR - if you handled both or only the classic array formula. - -## Repro - -```bash -python -c " -from openpyxl import load_workbook -wb = load_workbook('data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/43026/1_43026_input.xlsx') -ws = wb['summary'] -print(type(ws['D10'].value), repr(ws['D10'].value)) -" -``` diff --git a/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md b/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md deleted file mode 100644 index bd7500b..0000000 --- a/docs/planning/recall-90/02-chunk-range-vs-text-mismatch.md +++ /dev/null @@ -1,150 +0,0 @@ -# 02 · `text_hit_geom_miss` — chunk contains answer text but range doesn't overlap GT - -**Status:** 🆓 free to claim -**Slice:** E (`annotation/block_splitter.py`, `analysis/pattern_splitter.py`, `analysis/light_block_detector.py`) -**Independent of:** 00, 01, 03, 04 - -## What it looks like - -Seven actionable instances on the 200-sample -(`61-4`, `353-29`, `382-29`, `80-42`, `334-11`, `495-31`, `462-45`) -where: - -* `rank_of_text_match ≤ 5` — some chunk's `render_text` contains the - answer value(s). -* `rank_of_first_overlap is None` — but no chunk's claimed A1 range - overlaps the ground-truth range. -* `n_chunks_on_gt_sheet ≥ 1` — we DID emit chunks on the correct sheet. - -So the parser knows what data exists on the sheet (text is there), but -the **range bookkeeping** on the chunk it emitted is wrong: the claimed -`top_left_cell` / `bottom_right_cell` excludes the answer cell even -though the chunk's text includes it. - -This is the **citation-grade killer**. ks-backend's UI will highlight a -region that doesn't actually contain the answer the LLM cited — looks -buggy to the user. - -Example — instance `61-4`: -``` -ans = 'output'!A2:G15 -chunks on 'output': 1 chunk -chunk text contains: the answer string (text_rank=1) -chunk's reported A1 range: NOT A2:G15 -``` - -## Hypothesis (deliberately under-specified) - -Block merge/split paths widen the *text* (by absorbing adjacent cells) -faster than they update `cell_range.bottom_right`, OR a pattern-split -narrows the `cell_range` to exclude the rows the splitter just lifted -into a child block while the rendered text still contains them. Find -the actual code path empirically. - -## Diagnostic signature - -```bash -jq -c 'select(.bucket_combined == "text_hit_geom_miss" - and (.flags | contains(["instruction_requires_execution"]) | not) - and .n_chunks_on_gt_sheet > 0)' \ - tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson -``` - -For each match, the comparison that matters is `gt_range_bbox` vs each -chunk on the GT sheet's `top_left_cell`/`bottom_right_cell`. Print the -chunk's `render_text` and confirm it DOES contain the answer values -that openpyxl reads at `answer_position`. - -## File scope - -You may touch: - -* `src/ks_xlsx_parser/models/block.py` — invariants on `BlockDTO.cell_range`. -* `src/ks_xlsx_parser/annotation/block_splitter.py` -* `src/ks_xlsx_parser/analysis/pattern_splitter.py` -* `src/ks_xlsx_parser/analysis/light_block_detector.py` -* `src/ks_xlsx_parser/chunking/chunker.py::_block_to_chunk` — at chunk - creation time, you can defensively clip `cell_range` to the bounding - box of cells that actually contributed text. That's a safety net even - if the upstream bug isn't located. - -Do NOT touch parsing (`parsers/*`) — the inputs are correct; -downstream block bookkeeping is wrong. - -## Acceptance criteria - -1. Add an invariant test (`tests/test_structural_invariants.py` - pattern): for every chunk, the cells whose values appear in - `render_text` MUST lie within - `[chunk.top_left_cell, chunk.bottom_right_cell]`. Test should fail - on `main` for at least 3 of the 7 cluster instances. -2. After the fix, all 7 instances flip from `text_hit_geom_miss` to - `both_hit` (`rank_of_first_overlap ≤ 5`). -3. `geometric@5` on the 200-sample seed=1337 rises by ≥ 3 pp. -4. `recall_text@5` does NOT drop (tightening ranges shouldn't drop - text — if it does, the chunker is also dropping text in step with - the range clip, which is a separate bug). -5. `table_fragmentation_rate` does NOT rise — if you fixed range drift - by splitting one chunk into many, fragmentation will spike. That - would solve cluster 02 by creating cluster 04 — net zero. Either - keep fragmentation flat or coordinate with whoever owns 04. - -## Failing test sketch - -```python -# tests/test_chunk_range_invariants.py -import re -from openpyxl.utils import column_index_from_string -from ks_xlsx_parser.pipeline import parse_workbook -from ks_xlsx_parser.models.common import col_letter_to_number - -A1_RE = re.compile(r"^([A-Z]+)(\d+)$") - -def _parse_a1(a1): - m = A1_RE.match(a1) - return (int(m.group(2)), col_letter_to_number(m.group(1))) - -def test_chunk_range_covers_all_rendered_cells(tmp_path): - # Use one of the cluster instances or a hand-built fixture: - # the assertion is "for every chunk, the cell coordinates that appear - # in render_text fall inside the claimed range box." - chunks = parse_workbook( - path="data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/61-4/1_61-4_input.xlsx" - ).chunks - for c in chunks: - if not c.top_left_cell or not c.bottom_right_cell: - continue - r0, col0 = _parse_a1(c.top_left_cell) - r1, col1 = _parse_a1(c.bottom_right_cell) - # Pull A1 references that appear in the chunk header (e.g. "[Sheet1!A2:G15]") - # and assert each is inside the claimed range. Adapt to whatever - # renderer markers your current text actually uses. - for ref in re.findall(r"\b([A-Z]+)(\d+)\b", c.render_text or ""): - col = col_letter_to_number(ref[0]) - row = int(ref[1]) - assert r0 <= row <= r1, f"row {row} outside {c.top_left_cell}:{c.bottom_right_cell}" - assert col0 <= col <= col1, f"col {ref[0]} outside {c.top_left_cell}:{c.bottom_right_cell}" -``` - -## Pitfalls - -* The renderer prints A1 refs in its block header (e.g. - `[Sheet1!A2:G15] (table)`) — those are the chunk's CLAIMED range, not - a cell reference. The invariant test should look at rendered cell - contents (table rows), not at the block header. -* Some text rendered into a chunk doesn't have an A1 reference (e.g. - pure values inside a table grid). The invariant has to use the - block's underlying `cells` collection, not regex on text. -* If the merge widens the chunk's range past adjacent empty rows, you - may LOSE geometric overlap on those by clipping. That's fine here — - the cluster is text_hit/geom_miss, so clipping to actual content is - what's needed. - -## Repro - -```bash -python scripts/triage_recall.py tests/benchmarks/reports/retrieval \ - --bucket present_but_ranked_low --examples 0 # to ensure no overlap -# Then manually read enriched_failures.ndjson and pick instance 61-4 -# or 353-29 — both are pure text_hit_geom_miss cases. -``` diff --git a/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md b/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md deleted file mode 100644 index ef7161d..0000000 --- a/docs/planning/recall-90/03-cell-drop-or-uncached-formula.md +++ /dev/null @@ -1,165 +0,0 @@ -# 03 · Chunk's range covers GT but rendered text lacks the answer values - -**Status:** 🆓 free to claim -**Slice:** B + C + D (`parsers/cell_parser.py`, `parsers/sheet_parser.py`, `rendering/text_renderer.py`) -**Independent of:** 00, 02, 04 — partial overlap with 01 on `text_renderer.py`; coordinate via README. - -## What it looks like - -Up to ten actionable instances on the 200-sample (`CF_3712`, `57033`, -`53-12`, `55794`, `45063`, `36842`, `384-4`, `262-17`, `189-9`, `48799`) -where: - -* `rank_of_first_overlap ≤ 5` — chunk's claimed range overlaps GT. -* `rank_of_text_match is None` — but the chunk's rendered text does NOT - contain the answer values. - -The parser knows the block exists (range is right) but it dropped or -mis-rendered the cells whose values should appear in the chunk text. - -**Two root causes nested inside this cluster — confirm by inspection:** - -### Sub-cluster 3a: cell-drop within block - -The block's `cells` collection is missing the answer cell, or the -renderer's grid-construction loop skips it (e.g. when the cell sits in -a row otherwise mostly empty, and a sparsity filter culls it). - -Instance `CF_3712`: chunk on `Purchases` covers M3:M5 geometrically; -the answer header value (`gt_cell_raw = "Product "`) doesn't render. - -### Sub-cluster 3b: uncached formula renders as `=A1+B1`, not the value - -Cell stores a formula; the workbook was saved without computing it -(common for LibreOffice / programmatic generation). `data_only` load -returns `None` for that cell. The renderer falls back to the formula -source. Retrieval for the *value* (e.g. "1272") fails because the chunk -text contains "=A1+B1" instead. - -Instance `48799` (`Sheet2!Z2` — a 24-deep nested IF/VLOOKUP). Same -shape: formula text in chunk, value missing. - -> ⚠️ **Some of the named instances may be unscorable** — e.g. `189-9` -> and `53-12` have answer cells that are all `None` in `answer.xlsx`, -> meaning the benchmark expects the system to write nothing. In that -> case `answer_cell_values=[]` and `rank_text=None` is automatic, -> regardless of what the chunk renders. The first thing to do in this -> TODO is to filter the ten instances through openpyxl to find the -> truly unfixable ones; report that as part of the PR. - -## Diagnostic signature - -```bash -jq -c 'select(.bucket_combined == "text_miss_geom_hit" - and (.flags | contains(["instruction_requires_execution"]) | not) - and .n_chunks_on_gt_sheet >= 1)' \ - tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson -``` - -Then run for each candidate: - -```python -from openpyxl import load_workbook -wb = load_workbook(input_path, data_only=True) -ws = wb[answer_sheet] -# read answer.xlsx cells at answer_position — if all None, drop this instance. -``` - -Instances whose `answer.xlsx` GT cells are all `None` move to TODO 05 -(out-of-scope; benchmark-scoring issue). - -## File scope - -You may touch: - -* `src/ks_xlsx_parser/parsers/cell_parser.py` — formula vs cached-value - selection. Add a code path: if cell has a formula AND no cached - value in the `data_only` pass, mark as `formula_uncached` and emit - both the formula text AND any reachable proxy (e.g. blank cell with - formula reference, instead of literal formula in the grid). -* `src/ks_xlsx_parser/parsers/sheet_parser.py` — make sure cells with - string values that look like formulas (`startswith('=')`) get - classified as formula cells. -* `src/ks_xlsx_parser/rendering/text_renderer.py::_cell_render_value` - — for formula cells, prefer cached value; if cached is None, render - the formula source ONLY (not `None`, not the formula display). - -Do NOT add a formula evaluator. That's a separate, larger effort. - -## Acceptance criteria - -1. After culling unscorable instances (those with all-None - `answer.xlsx` cells), at least **5 of the remaining instances** flip - from `text_miss_geom_hit` to `both_hit` (rank_of_text_match ≤ 5). -2. `recall_text@5` on the 200-sample seed=1337 rises by ≥ 2 pp. -3. The unscorable instances are documented in the PR description with - a one-line "this instance has no answer values in answer.xlsx, - moving to TODO 05". -4. A new test fixture in `tests/conftest.py` (a workbook with a - `=SUM(A1:A3)` cell saved WITHOUT cached values) plus a regression - test that asserts the chunk render_text contains `=SUM(A1:A3)` - verbatim (not the openpyxl repr of the formula object). - -## Failing test sketch - -```python -# tests/test_formula_rendering_unfilled.py -import openpyxl -from ks_xlsx_parser.pipeline import parse_workbook - -def test_uncached_formula_renders_formula_source(tmp_path): - p = tmp_path / "uncached.xlsx" - wb = openpyxl.Workbook() - ws = wb.active - ws["A1"] = 10 - ws["A2"] = 20 - ws["A3"] = 30 - ws["B1"] = "=SUM(A1:A3)" - wb.save(p) # NO data_only refresh — B1's cached value remains None. - - chunks = parse_workbook(path=str(p)).chunks - assert chunks, "no chunks" - text = chunks[0].render_text - # Whichever is acceptable: the formula source verbatim, OR the - # computed value via the formula engine. NOT a None/empty cell. - assert ("=SUM(A1:A3)" in text) or ("60" in text), ( - "uncached formula cell dropped from render_text:\n" + text - ) -``` - -## Pitfalls - -* `_cell_render_value` already has a branch for booleans, dates, - integer-valued floats. Don't break that path while adding the formula - fallback. -* If you add a formula-source string to render_text, make sure the - retrieval text-match logic in `eval_retrieval.py::_normalize_value_for_match` - still works on numeric values that DO have cached results — adding - formula sources should AUGMENT, not replace, the cached-value path. -* Some "cell drop" cases are actually merged-cell handling: a value is - on the top-left of a merged region but the renderer surfaces only an - empty placeholder for the merged cells. Distinct from this cluster — - if you find one, file it as a new TODO. -* Don't try to evaluate formulas — Excel's calc engine is a swamp and - `formula/formula_parser.py` only parses, doesn't compute. The - acceptable fallback is "render the formula source so an LLM can read - it and reason." - -## Repro - -```bash -# Inspect one candidate end-to-end. -python << 'EOF' -from openpyxl import load_workbook -from ks_xlsx_parser.pipeline import parse_workbook -p = "data/corpora/spreadsheetbench/all_data_912_v0.1/spreadsheet/CF_3712/1_CF_3712_input.xlsx" -wb = load_workbook(p, data_only=True) -print("M3:M5 on", wb.sheetnames[0]) -for r in range(3,6): - print(" ", wb.active.cell(row=r,column=13).value) -r = parse_workbook(path=p) -for c in r.chunks: - print(c.sheet_name, c.top_left_cell, c.bottom_right_cell) - print((c.render_text or "")[:600]) -EOF -``` diff --git a/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md b/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md deleted file mode 100644 index 97cd879..0000000 --- a/docs/planning/recall-90/04-single-chunk-multi-region-sheet.md +++ /dev/null @@ -1,138 +0,0 @@ -# 04 · Whole-sheet single chunk dilutes the embedding - -**Status:** 🆓 free to claim -**Slice:** F (`chunking/chunker.py`, `chunking/segmenter.py`, `analysis/light_block_detector.py`, `analysis/table_grouper.py`) -**Independent of:** 00, 01, 02, 03 - -## What it looks like - -Two clear instances on the 200-sample (`184-6`, `374-9`); the pattern -extends to many more on the full corpus where a sheet contains a -single large data table with a few headers and the chunker emits ONE -chunk covering A1:end. Note: TODO 02 also has instances where the -single-chunk-per-sheet pattern shows up (`53-12`, `189-9`, -`334-11`, `353-29`, `382-29`, `462-45`, `495-31`, `CF_3712`). Treat -those as **second-order beneficiaries**: when 02 lands first, the -chunk's range tightens and may already split the over-broad chunk. -When 04 lands first, the chunks get smaller and 02's range invariant -becomes easier to satisfy. Order is reviewer's choice. - -Symptoms: - -* `n_chunks_on_gt_sheet == 1` and the chunk spans most of the sheet. -* The answer values are present in the chunk's render_text BUT rank - >5 (or rank=1 by luck — at recall@1 the model picks based on the - bag-of-everything embedding). The recall@1 vs recall@5 gap of 12.4 pp - in the v0.2.0 numbers is consistent with this. -* `summary.md` shows `table_fragmentation_rate ≈ 0` on these sheets - (which sounds good but means we're under-splitting, not perfectly - segmenting). - -This is the **chunk-granularity** problem. The block-detection pipeline -fuses everything into one block; the chunker has no token cap and -emits the block verbatim. - -## Diagnostic signature - -```bash -jq -c 'select(.n_chunks_on_gt_sheet == 1 and .n_chunks_total <= 2 - and (.flags | contains(["instruction_requires_execution"]) | not))' \ - tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson -``` - -Also: dump the per-chunk `token_count` (already computed but unused in -chunker.py). Any chunk over 800 tokens is over-broad on this corpus. - -## File scope - -You may touch: - -* `src/ks_xlsx_parser/chunking/chunker.py` — introduce a chunk-size - cap. Suggested target: ~512 tokens. When a block exceeds, split the - block into row-groups (preserving headers in each group), with each - group becoming its own chunk. `prev_chunk_id` / `next_chunk_id` is - already wired for navigation. -* `src/ks_xlsx_parser/chunking/segmenter.py` — segmenter logic if any - granularity decisions live there. -* `src/ks_xlsx_parser/analysis/light_block_detector.py` and - `analysis/table_grouper.py` — investigate whether *over*-merging - upstream is causing the one-block-per-sheet symptom. If the table - grouper is collapsing 3 logical tables into 1, the upstream fix is - better than splitting at chunk-emit time. - -Do NOT touch the ranking/embedding step (`scripts/eval_retrieval.py`) -or any retrieval logic. - -## Acceptance criteria - -1. After landing, no chunk on the 200-sample seed=1337 run exceeds - ~800 tokens (= ~3200 characters) by default. Allow override. -2. `table_fragmentation_rate` may rise (this is desired up to ~0.5 — - purposeful row-group splits). It must NOT cause a regression in - geometric@5 — each row-group chunk's claimed range must cover only - its rows. -3. `recall_text@5` rises by ≥ 3 pp on the 200-sample. -4. `recall_text@1` rises by ≥ 4 pp (smaller chunks = sharper embeddings - = top-1 sharpens). -5. The two named instances (`184-6`, `374-9`) flip to `both_hit`. Note - that `184-6` is also subject to TODO 00's harness fix — order may - matter. - -## Failing test sketch - -```python -# tests/test_chunk_size_cap.py -from ks_xlsx_parser.pipeline import parse_workbook - -def test_no_chunk_exceeds_token_budget(tmp_path): - # Generate a workbook with a 1000-row table. Without a cap the - # chunker emits one giant chunk; with the cap it should emit several. - import openpyxl - p = tmp_path / "big.xlsx" - wb = openpyxl.Workbook() - ws = wb.active - ws.append(["id", "name", "value"]) - for i in range(1, 1001): - ws.append([i, f"name-{i}", i * 7]) - wb.save(p) - - chunks = parse_workbook(path=str(p)).chunks - assert len(chunks) >= 2, "1000-row table should produce ≥2 chunks after the cap lands" - # Each chunk's claimed range should be a contiguous row block on the same sheet. - for c in chunks: - assert c.sheet_name and c.top_left_cell and c.bottom_right_cell - # No chunk should be larger than the budget (in render-text characters). - for c in chunks: - assert len(c.render_text or "") <= 4000, f"chunk too big: {len(c.render_text)} chars" -``` - -## Pitfalls - -* **Don't strand headers.** The current chunker preserves table headers - in render_text. When splitting a table into row groups, every group - needs the header rows OR a header-summary block so an LLM can - interpret column meanings. -* **Range bookkeeping** — when you split, the child chunks' A1 ranges - MUST be tight to their row subset. Otherwise you create - geometric_no_overlap failures (cluster 02 in reverse). -* **`token_count` is character-based** (`len(text) // 4`). Use the - same metric for the cap to avoid a circular dependency on a tokenizer. -* **Don't split horizontally** (column subsets). Excel tables are - usually wider than embeddings need but column-split chunks have no - natural boundary and reading order breaks. - -## Coordinating with TODO 02 - -If both 02 and 04 are in flight: -- 02 is *range-tightening* (semantically correct ranges). -- 04 is *granularity* (more chunks per sheet). -- The risk is they fight over `_block_to_chunk`. Decide which one - emits the final A1 range and document the contract. Suggested: - 04 splits blocks into sub-blocks WITH tight ranges, then 02's - invariant test passes for the sub-blocks. - -## Repro - -Use a synthetic 1000-row workbook (see test sketch) — much faster to -iterate on than the corpus instances. Once your local fixture passes, -run the 200-sample to confirm corpus-wide impact. diff --git a/docs/planning/recall-90/05-out-of-scope-execution-instances.md b/docs/planning/recall-90/05-out-of-scope-execution-instances.md deleted file mode 100644 index 0133d3a..0000000 --- a/docs/planning/recall-90/05-out-of-scope-execution-instances.md +++ /dev/null @@ -1,136 +0,0 @@ -# 05 · Out-of-scope: benchmark instances that require execution, not retrieval - -**Status:** 🆓 free to claim (scoring/harness work — **NOT a parser fix**) -**Slice:** A (`scripts/eval_retrieval.py`, `scripts/enrich_failures.py`) -**Independent of:** 01, 02, 03, 04 — partial overlap with 00 on -`eval_retrieval.py`; coordinate via README. - -## Why this exists - -127 of 200 instances in the seed=1337 sample (≈ 63%) have -`instruction_requires_execution = True` — the `answer_position` in the -INPUT spreadsheet is empty because the question literally asks the -system to *compute and write* a value there. - -Example instructions from this class: - -* "Round the value 367.5 to the nearest $5 and put the result in C2." -* "Sum each row of B:F and fill column G." -* "Fix the broken formulas in D2:D613." - -A retrieval-grade parser cannot satisfy these instances. The answer -doesn't exist in `input.xlsx` to be retrieved. Including these -instances in the headline `recall_text@5` number drags it down ~25 -points without representing any parser quality signal. - -**This TODO is not a parser improvement.** It's a benchmark-scoring -correction so that subsequent parser work has a clean signal. - -## What the headline numbers actually mean today - -On the 200-sample seed=1337: - -| | All 200 | In-scope (73) | -|-------------------------|---------|---------------| -| Text recall@5 | 0.635 | 0.59 | -| Geometric recall@5 | 0.310 | (compute it) | -| `instruction_requires_execution` | 127 of 200 | 0 (excluded) | - -The "in-scope" column is the metric the planning roadmap targets. Today -it's hidden inside the eval output; this TODO surfaces it as a -first-class metric. - -## What to ship - -1. **A scoring filter switch** in `scripts/eval_retrieval.py`. - - New optional `--exclude-execution-instances` flag (or compute both - filtered and unfiltered metrics every run). - - Definition of "execution instance": at run time, peek at - `input.xlsx[answer_sheet][answer_position]`. If the union of - non-empty cells in that range is empty, mark the instance as - `instruction_requires_execution` and report it in a separate - bucket. **Do this once per instance, not per parser** — the - classification is parser-independent. -2. **Two recall numbers** in `summary.json` / `summary.md`: - - `recall_text@5_all` (current behaviour — count every instance). - - `recall_text@5_in_scope` (denominator excludes - execution instances). -3. **History tracking**: include both numbers in - `scripts/append_bench_history.py` so the trend chart shows the - in-scope number — that's the target for the 0.90 goal. -4. **README + recall-investigation.md update** so the in-scope number - is named and called out as the gate. The 0.90 target is on - *in-scope* recall, not all-instances recall. -5. **Out-of-scope summary file** alongside the run reports: - `tests/benchmarks/reports/retrieval//out_of_scope.txt` listing - the instance IDs that were filtered, so the filter's behaviour is - auditable. If somebody disagrees with the classification, they can - diff the list. - -## File scope - -You may touch: - -* `scripts/eval_retrieval.py` — add classification step BEFORE - `score_instance` (so unfiltered metric still works). -* `scripts/enrich_failures.py` — make `instruction_requires_execution` - the canonical source of truth (it already detects this). -* `scripts/append_bench_history.py` — record both numbers per run. -* `scripts/triage_recall.py` — by default exclude out-of-scope rows. -* `docs/planning/recall-90/README.md` — update target row. -* `docs/recall-investigation.md` — call out the in-scope number. - -Do NOT touch any `src/ks_xlsx_parser/*` code for this cluster. - -## Acceptance criteria - -1. On the same 200-sample seed=1337 run, the new - `recall_text@5_in_scope` number is emitted alongside the existing - one and printed in `summary.md`. -2. The classifier reproducibly identifies ≥ 120 of the 127 currently- - flagged instances (some borderline cases — a single non-empty - header in the answer range — are debatable). -3. `history.jsonl` rows after the change have both fields populated. -4. A unit test under `tests/test_eval_retrieval_classify.py` covers - the classifier with hand-built fixtures: empty range → out-of-scope, - range with one value → in-scope, range with only string headers - (no numeric data) → in-scope, range with formula cells that have - cached values → in-scope. - -## Why this isn't TODO 00 - -* TODO 00 fixes the harness's *spec parser* — wrong sheet names get - resolved. -* TODO 05 fixes the harness's *scoring* — instances the parser can't - affect get a separate bucket. - -Both touch `eval_retrieval.py` but in different functions. If they're -worked simultaneously, coordinate over `parse_position_spec` — -TODO 00 owns the regex; TODO 05 owns the classifier. - -## Pitfalls - -* Don't over-exclude. An instance where `answer_position` covers - `A2:G15` and only `A1` is non-empty (a single header row above the - empty area) is still execution-required, but use the right cell - range — read the actual `answer_position`, not `data_position`. -* The classifier needs the openpyxl read, which costs ~50ms per - instance. Cache it or do it once at script start, not inside the - ranking loop. -* Don't delete the unfiltered metric. Both numbers are useful: the - unfiltered one tells you how a downstream consumer experiences - ks-xlsx-parser on real spreadsheet questions (most of which DO - require execution); the in-scope one tells you whether the parser - is doing its job. - -## Sample of what gets filtered - -Random sampling of `instruction_requires_execution = True` instances: - -| Instance | Instruction (first 60 chars) | Answer range | -|---|---|---| -| ... | "Calculate the sum of column B for products where..." | `Sheet1!E2:E10` | -| ... | "For each row in column A, look up the corresponding..."| `Sheet1!C2:C50` | -| ... | "Pivot the data in A1:D100 by month..." | `Sheet2!A1:E13` | - -Run the enricher to see the full list. diff --git a/docs/planning/recall-90/README.md b/docs/planning/recall-90/README.md deleted file mode 100644 index 4f50b46..0000000 --- a/docs/planning/recall-90/README.md +++ /dev/null @@ -1,137 +0,0 @@ -# Roadmap: ks-xlsx-parser retrieval recall → 0.90 - -This directory is the worklist for getting SpreadsheetBench recall on the -canonical ks-xlsx-parser benchmark from the v0.2.0 baseline up to: - -| Metric | v0.2.0 (912) | 200-sample baseline | Target | Stretch | -|------------------------|--------------|---------------------|--------|---------| -| Text recall@5 | 0.704 | 0.635 | ≥ 0.90 | 0.95 | -| Geometric recall@5 | 0.369 | 0.310 | ≥ 0.70 | 0.85 | -| In-scope text recall@5 | unmeasured | **0.59** | ≥ 0.90 | — | - -Each `NN-*.md` in this folder is one **independent, parallel-safe TODO**. -Pick one, claim it (table below), execute, hand off the next agent. - -## Read this first — the headline number is misleading - -The benchmark dataset has 912 instances. **~63% of them require the -system to *compute and write* the answer** (e.g. "fill column C with -ROUND(B*5)") — the answer literally doesn't exist in the input -spreadsheet. A parser cannot retrieve what isn't there. We call this -class `instruction_requires_execution` and treat them as **out of scope -for parser work**. See [05-out-of-scope-execution-instances.md](./05-out-of-scope-execution-instances.md). - -When the scoring is filtered to in-scope instances only, the real -ks-xlsx-parser recall is closer to **0.59**, and improving it past -0.90 means closing ~22 named failures out of ~30 actionable cases on the -200-instance seed=1337 sample. That's the actual work. - -## Worklist - -Status legend: 🆓 free to claim · 🔵 in progress · ✅ landed · ⛔ blocked - -| # | File | Cluster | Instances | Primary slice | Status | -|---|---|---|---|---|---| -| 00 | [00-benchmark-spec-parser-bugs.md](./00-benchmark-spec-parser-bugs.md) | Benchmark harness fails to parse malformed `data_position`/`answer_position` (`Dashboard'!B8`, fullwidth `G12:J15`) — 8 instances scored as wrong even when the parser was correct. | 8 | `scripts/eval_retrieval.py` | 🆓 | -| 01 | [01-array-formula-rendering.md](./01-array-formula-rendering.md) | Array-formula cells surface as `` — value never lands in `render_text`. | 2 | `parsers/cell_parser.py`, `rendering/text_renderer.py` | 🆓 | -| 02 | [02-chunk-range-vs-text-mismatch.md](./02-chunk-range-vs-text-mismatch.md) | `text_hit_geom_miss` with chunks on the GT sheet — answer text is in *some* chunk's text but no chunk's claimed A1 range overlaps GT. Range bookkeeping drift during block merge/split. | 7 | `annotation/block_splitter.py`, `analysis/pattern_splitter.py`, `chunking/chunker.py` | 🆓 | -| 03 | [03-cell-drop-or-uncached-formula.md](./03-cell-drop-or-uncached-formula.md) | Chunk's range covers GT geometrically but its rendered text lacks the answer values — either cells dropped in render, or formula cells with no cached value rendered as the formula source. | up to 10 (some unscorable) | `parsers/cell_parser.py`, `parsers/sheet_parser.py`, `rendering/text_renderer.py` | 🆓 | -| 04 | [04-single-chunk-multi-region-sheet.md](./04-single-chunk-multi-region-sheet.md) | Sheets containing distinct logical regions emerge as one giant chunk; the answer text dilutes against the rest of the sheet at embedding time. | 2 (+ likely more at full corpus) | `chunking/chunker.py`, `analysis/light_block_detector.py`, `analysis/table_grouper.py` | 🆓 | -| 05 | [05-out-of-scope-execution-instances.md](./05-out-of-scope-execution-instances.md) | Informational + scoring filter. **Not a parser fix.** Surfaces benchmark instances where the parser fundamentally can't help; suggests filtering them from headline recall. | 127 / 200 | `scripts/eval_retrieval.py` (scoring), `scripts/enrich_failures.py` | 🆓 | - -## How parallelism works here - -Each TODO is scoped to a different file slice so agents won't collide: - -``` -slice A benchmark harness 00, 05 -slice B parsers/cell_parser 01, 03 -slice C parsers/sheet_parser 03 -slice D rendering/text_renderer 01, 03 -slice E annotation/ + analysis/ 02 -slice F chunking/ 04 -``` - -TODOs 00 and 05 touch the same file (`eval_retrieval.py`) — **serialize** -those two. Everything else can run truly in parallel. - -If you discover a cluster mid-flight that doesn't fit anywhere here, -add a new `NN-*.md` and append a row to the table above. Do NOT bury -findings inside an existing TODO. - -## Working a TODO - -Each cluster file has the same structure: - -1. **What the cluster looks like** — a real failure example in full. -2. **Repro instance IDs** — the 200-sample seed=1337 instance IDs that - match this cluster. -3. **Diagnostic signature** — exact `enriched_failures.ndjson` columns - that identify a member. -4. **File scope** — what you can change; what you can't. -5. **Acceptance criteria** — measurable on the 200-sample seed=1337 - rerun. "The N instances listed in §2 all flip from miss → hit on - `text_match_rank` AND no other previously-passing instance regresses." -6. **Failing test sketch** — one-line idea for a regression test. - -What's **deliberately missing** from each file: the proposed fix and an -effort estimate. The agent that picks up the task does that diagnosis; -the doc just specifies success. - -### The loop, end-to-end - -```bash -# 1. Claim the task (edit README, change 🆓 → 🔵 with your name). -# 2. Read the cluster file + the named example instances. -# 3. Reproduce locally: -python scripts/eval_retrieval.py \ - --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \ - --parsers ks --sample 200 --seed 1337 --emit-failures -python scripts/enrich_failures.py tests/benchmarks/reports/retrieval -python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5 - -# 4. Write a regression test that fails today (see test sketch in cluster file). -# 5. Fix until the test passes. -# 6. Re-run the eval + enrichment. Confirm the cluster count dropped, no regressions. -# 7. Append to history; the delta should be visible: -python scripts/append_bench_history.py - -# 8. PR the change. Title: `recall(NN): — N→M`. -# PR body links to the cluster file and pastes the before/after histogram. -# 9. Update README table: 🔵 → ✅ with PR link. -``` - -## Why the 200-sample seed=1337 is the gate, not the full 912 - -912 takes ~40 min on a laptop. PR validation needs to be fast. The -seed=1337 200-sample is deterministic and contains roughly proportional -representation of every cluster. Acceptance criteria are stated against -it. **Cross-check with the full corpus before declaring a release**: - -```bash -python scripts/eval_retrieval.py --parsers ks --emit-failures -``` - -The weekly benchmark workflow (`.github/workflows/benchmark.yml`) runs -the full 912 on schedule + workflow_dispatch. - -## Tracking progress - -`tests/benchmarks/reports/history.jsonl` is appended to on each -`make bench-track` run, one row per commit. Tail it to see the trend: - -```bash -tail -10 tests/benchmarks/reports/history.jsonl -``` - -Each row carries the failure_buckets histogram, so per-cluster progress -is in the data, not just the headline number. - -## See also - -- `docs/recall-investigation.md` — diagnosis framework + hypotheses - (H1 chunk-size, H2 formula-rendering, H3 range-bookkeeping) -- `docs/benchmark-local-setup.md` — install → corpus → eval loop -- `scripts/triage_recall.py` — bucket histogram + exemplar dump -- `scripts/enrich_failures.py` — per-failure diagnostic columns -- `.claude/skills/recall-failure-triage/SKILL.md` — agent guide