diff --git a/.claude/skills/recall-failure-triage/SKILL.md b/.claude/skills/recall-failure-triage/SKILL.md new file mode 100644 index 0000000..a9a9263 --- /dev/null +++ b/.claude/skills/recall-failure-triage/SKILL.md @@ -0,0 +1,154 @@ +--- +name: recall-failure-triage +description: Diagnose why ks-xlsx-parser's retrieval recall is below target by reading the failure-bucket histogram and exemplar failures, then propose ranked fixes. Use when investigating low recall@k on SpreadsheetBench, deciding what to fix next, or reviewing a benchmark run. Sister skill to excel-extraction-pipeline-improver. +--- + +# Recall Failure Triage — turn `recall@5 = X` into a ranked worklist + +A single recall number tells you nothing about *what* to fix. Five +distinct failure modes produce identical recall drops, and the fix is +completely different per mode. This skill is the bridge from "recall is +low" to "open `parsers/cell_parser.py` line N, here is the bug". + +Pair with `excel-extraction-pipeline-improver` — that skill implements +fixes; this one decides what to fix. + +## When this skill fires + +* User says "recall is low / dropped / not improving" +* New benchmark run produced a `failures.ndjson` or `summary.json` +* PR-time benchmark sample regressed on the CI step summary +* You're picking the next parser/chunking PR and need to choose what's + highest leverage + +## Inputs you should expect + +| Input | Where it comes from | +|---|---| +| `summary.json` + `summary.md` | latest `tests/benchmarks/reports/retrieval//` | +| `failures.ndjson` | same dir, **only present if** the run used `--emit-failures` | +| `history.jsonl` | `tests/benchmarks/reports/history.jsonl` — recall over commits | +| current branch + recent commits | to know what changed since the last run | + +## The decision procedure + +### 0. Apply the in-scope filter — most "misses" are unfixable + +Before counting buckets, run `scripts/enrich_failures.py` to mark +instances flagged `instruction_requires_execution` — those where the +benchmark's `answer_position` is empty in `input.xlsx` because the +question literally asks the system to *write* the answer. ~63% of +the SpreadsheetBench corpus is this class on the 200-sample. They are +NOT parser bugs and CANNOT be fixed by parser work. Exclude them. + +```bash +python scripts/enrich_failures.py tests/benchmarks/reports/retrieval +jq -c 'select((.flags | contains(["instruction_requires_execution"])) | not)' \ + tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson +``` + +### 1. Read the bucket histogram FIRST + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval +``` + +The output ranks buckets by count. **Spend your time on the largest +bucket only.** The next-largest is the next PR. + +| Bucket | Root cause | Where to fix | +|---|---|---| +| `answer_absent_from_chunks` | The answer value never made it into ANY chunk. The cell was dropped or rendered as a formula expression instead of its computed value. | `parsers/cell_parser.py::_extract_cell_value` and `rendering/text_renderer.py::_cell_render_value`. Check `data_only` plumbing in `parsers/workbook_parser.py`. | +| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. The chunk is too big/heterogeneous — embedding is diluted. | `chunking/chunker.py`. There is no token cap today; large blocks are emitted whole. Cap at ~512 tokens and row-split tall tables. | +| `wrong_sheet` | Answer sheet was never chunked. | `parsers/workbook_parser.py` sheet loop. Check hidden sheets, very-hidden sheets, and `wb.sheetnames` vs `wb.worksheets`. | +| `geometric_no_overlap` | Answer text matches but the chunk's A1 range doesn't overlap ground truth. Range drifts during merge/split. | `annotation/block_splitter.py` + `analysis/pattern_splitter.py`. Add invariant: `block.cell_range` ⊆ bbox of cells that actually rendered text. | +| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. Reproduce with `parse_workbook(path)`. | + +### 2. Read 5 example failures from the top bucket + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval \ + --bucket --examples 5 +``` + +For each example, the script prints: +* the natural-language question +* `answer_position` (the ground-truth cell) +* the ground-truth string values that should appear in a chunk +* the top-8 ranked chunks with sheet, range, and whether each contains + the answer + +**Pattern-match.** What do the 5 examples have in common? E.g. +* All answer cells are formulas → H2 confirmed: render the cached value +* All chunks span the entire sheet → H1 confirmed: cap chunk size +* All answer sheets are hidden → wrong_sheet from hidden-sheet skip + +### 3. State the hypothesis before changing code + +In your reply, write the hypothesis as one sentence: +> "Bucket X dominates because [observable common pattern in examples]. +> Expected fix: [specific file + change]. Expected lift on recall@5: +> [number]." + +The expected lift = (bucket count) ÷ (scoreable instances). If a fix +is supposed to halve the biggest bucket but recall barely moves on the +next run, the hypothesis was wrong — go back to step 1, don't pile on. + +### 4. Wire the fix into the existing pipeline + +Pass the hypothesis to the `excel-extraction-pipeline-improver` skill +(or implement directly) with TDD: write a failing crossval/invariant +test that captures the bug from one example, fix, run `make test`, then +re-run `make bench-track` to confirm the bucket shrank. + +### 5. Verify on the history + +```bash +tail -5 tests/benchmarks/reports/history.jsonl +``` + +The script prints a row-over-row delta. Expected behaviour: +* The targeted bucket count drops +* `recall_text@5` rises by approximately `target_bucket_drop / + scoreable_instances` +* `recall_text@1` improves proportionally OR more (better chunks rank + higher too) +* Other buckets shouldn't grow — if they do, the fix had a side effect + +## Anti-patterns + +* **Don't chase the headline number.** "Recall went up 1 pp" without a + bucket explanation is suspicious — it might be benchmark noise. +* **Don't fix two buckets at once.** You won't know which fix worked. +* **Don't tune chunk size as the first move.** It only helps + `present_but_ranked_low`. If `answer_absent_from_chunks` is bigger, + chunk tuning is wasted effort. +* **Don't trust geometric and text recall independently.** A docling-style + parser with zero geometric overlap can still have high text recall + (because it serializes everything to plain text). That's not the same + thing as good — citation overlays need the geometric metric. Always + report both. + +## Quick reference — running it from scratch + +```bash +make corpus-download # one-time, large +make bench-track # runs + triages +python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5 +cat tests/benchmarks/reports/history.jsonl | tail -10 # commit-over-commit +``` + +CI runs a 60-instance sample on every PR (`.github/workflows/benchmark.yml`) +and posts the bucket histogram to the GitHub job summary. Weekly schedule +runs the full 912-instance corpus. + +## Hand-off contract + +When you finish a triage round, your final message MUST include: +1. The bucket histogram (counts + percentages). +2. The dominant bucket + one-sentence root cause hypothesis. +3. The 1–2 file paths where the fix lives. +4. The expected lift on `recall_text@5` if the fix lands. +5. A pointer to the example instance IDs you read. + +Anything less and the next agent will have to re-do the analysis. diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml new file mode 100644 index 0000000..026b535 --- /dev/null +++ b/.github/workflows/benchmark.yml @@ -0,0 +1,85 @@ +name: Benchmark + +# Tracks ks-xlsx-parser retrieval recall on SpreadsheetBench over time. +# The headline goal: text recall@5 > 0.90 (currently ~0.70). +# +# * Pull requests run a fast SAMPLE (60 instances) as a regression smoke +# test — keeps the signal without a 40-minute wait. +# * The weekly schedule + manual dispatch run the FULL 912-instance +# corpus and publish the recall trend. + +on: + pull_request: + branches: [main] + paths: + - "src/**" + - "scripts/eval_retrieval.py" + - "scripts/triage_recall.py" + - "Dockerfile.bench" + - ".github/workflows/benchmark.yml" + schedule: + - cron: "0 6 * * 1" # Mondays 06:00 UTC + workflow_dispatch: + inputs: + sample: + description: "Instances to sample (0 = full 912 corpus)" + default: "0" + +concurrency: + group: benchmark-${{ github.ref }} + cancel-in-progress: true + +jobs: + benchmark: + runs-on: ubuntu-latest + timeout-minutes: 90 + steps: + - uses: actions/checkout@v4 + + # PRs use a 60-instance sample; scheduled/dispatch runs use the full + # corpus (or whatever the dispatch input requests). + - name: Resolve sample size + id: cfg + run: | + if [ "${{ github.event_name }}" = "pull_request" ]; then + echo "sample=60" >> "$GITHUB_OUTPUT" + else + echo "sample=${{ github.event.inputs.sample || 0 }}" >> "$GITHUB_OUTPUT" + fi + + - name: Cache SpreadsheetBench corpus + uses: actions/cache@v4 + with: + path: data/corpora/spreadsheetbench + key: spreadsheetbench-912-v0.1 + + - name: Build benchmark image + run: docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . + + - name: Run benchmark + run: | + mkdir -p tests/benchmarks/reports data + docker run --rm \ + -e BENCH_SAMPLE=${{ steps.cfg.outputs.sample }} \ + -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ + -v "$PWD/data:/app/data" \ + ks-xlsx-parser-bench | tee bench.log + + - name: Publish recall to job summary + if: always() + run: | + { + echo '## ks-xlsx-parser retrieval benchmark' + echo '' + echo '```' + tail -n 40 bench.log || true + echo '```' + } >> "$GITHUB_STEP_SUMMARY" + + - name: Upload benchmark reports + if: always() + uses: actions/upload-artifact@v4 + with: + name: benchmark-reports-${{ github.run_number }} + path: tests/benchmarks/reports/ + if-no-files-found: warn diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 1168a01..ba7b9e2 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -23,22 +23,20 @@ jobs: steps: - uses: actions/checkout@v4 + - name: Install uv + uses: astral-sh/setup-uv@v5 + with: + version: "latest" + enable-cache: true + cache-dependency-glob: "uv.lock" + - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v5 with: python-version: ${{ matrix.python-version }} - cache: pip - cache-dependency-path: pyproject.toml - name: Install - run: | - python -m pip install --upgrade pip - pip install -e ".[dev,api]" - - - name: Ruff lint - run: | - pip install ruff - ruff check src/ tests/ scripts/ || true # non-blocking until cleanup PR lands + run: uv pip install --system -e ".[dev,api]" - name: Run test suite run: make test-ci @@ -50,3 +48,56 @@ jobs: name: junit-${{ matrix.os }}-py${{ matrix.python-version }} path: reports/junit.xml if-no-files-found: ignore + + # Builds the wheel and proves it installs + imports in a CLEAN venv. + # The matrix `test` job runs against an editable install, which exposes the + # whole src/ tree on sys.path and therefore CANNOT catch a broken wheel — + # exactly how the v0.2.0 "pipeline.py missing from wheel" bug shipped. + # This job is the regression guard. Keep it required. + wheel-check: + name: wheel install smoke test + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + + - name: Build wheel + run: | + python -m pip install --upgrade pip build + python -m build --wheel + + - name: Verify wheel installs and imports in a clean venv + run: python scripts/verify_wheel.py + + lint: + name: ruff lint + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + - name: Ruff lint + run: | + python -m pip install --upgrade pip ruff + # TODO(cleanup): 45 pre-existing findings (E402/B905/SIM*). Drop the + # `|| true` once the lint-cleanup PR lands so this job gates merges. + ruff check src/ tests/ scripts/ || true + + typecheck: + name: mypy + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + - name: mypy + run: | + python -m pip install --upgrade pip + python -m pip install -e ".[dev,api]" + # TODO(cleanup): 59 pre-existing findings. Drop `|| true` once typed. + python -m mypy src/ks_xlsx_parser || true diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 5cf43b6..1da9bcc 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -37,6 +37,12 @@ jobs: - name: Build wheel + sdist run: python -m build + # Gate the release on a clean-venv install of the freshly built wheel. + # Prevents shipping a wheel that drops modules or leaks top-level + # packages (the v0.2.0 packaging regression). + - name: Verify wheel + run: python scripts/verify_wheel.py + - name: Upload distribution artifacts uses: actions/upload-artifact@v4 with: diff --git a/CHANGELOG.md b/CHANGELOG.md index d7527b2..99d5832 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -45,6 +45,52 @@ Template for a new release (copy this block, fill in, move Unreleased items in): ## [Unreleased] +### ⚠️ BREAKING (Fixed — see also #ks-xlsx-parser channel report) +- Repository layout flattened on `src/` was leaking 13 generic top-level + packages (`models`, `utils`, `parsers`, …) into installed wheels and + silently dropping `pipeline.py` and `api.py` (setuptools `packages.find` + only finds *packages*, not top-level modules). Users hitting + `from ks_xlsx_parser.pipeline import ...` on 0.2.0 from PyPI got + `ModuleNotFoundError`. **All modules now live under + `src/ks_xlsx_parser/`**; the wheel's `top_level.txt` contains only + `ks_xlsx_parser`. Imports inside the package switched from + `from pipeline import` to `from ks_xlsx_parser.pipeline import`. + Downstream code that imported the leaked generics + (`from models import …`) MUST migrate to `from ks_xlsx_parser.models …`. + +### Added +- `scripts/verify_wheel.py` — builds the wheel, installs it in a fresh + venv, and asserts the public import surface resolves. Wired into a + new `wheel-check` job in `.github/workflows/ci.yml` and a `Verify wheel` + step in `release.yml`. Regression guard for the packaging bug above. +- `scripts/triage_recall.py` + `scripts/append_bench_history.py` — turn + `failures.ndjson` into a ranked bucket histogram with exemplar + failures, and append each benchmark run to + `tests/benchmarks/reports/history.jsonl` so recall is tracked + commit-over-commit. Goal: text recall@5 > 0.90. +- `eval_retrieval.py --emit-failures` — dumps top-8 ranked chunks per + miss with a `failure_bucket` (answer_absent_from_chunks / + present_but_ranked_low / wrong_sheet / geometric_no_overlap / …) for + triage. Summary JSON gains a `failure_buckets` histogram. +- `Dockerfile.bench` + `.github/workflows/benchmark.yml` — reproducible + benchmark image; PR sample run (60 instances), weekly full corpus run. +- `make install-dev` alias and `make wheel-check` / `make bench-track` + / `make docker-bench` targets. +- New `bench` optional-dependency group (`sentence-transformers`, + `numpy`) — only the benchmark needs these. +- `docs/recall-investigation.md` documenting the diagnosis framework and + three named hypotheses (chunk-size dilution, formula-expression + rendering, range-bookkeeping drift). +- `.claude/skills/recall-failure-triage/SKILL.md` — agent skill that + consumes the bucket output and proposes ranked fixes. + +### Changed +- Dropped `PYTHONPATH=src` from Makefile benchmark targets — the + package is now properly installable so callers don't need it. +- `pyproject.toml`: `packages.find` constrained to `ks_xlsx_parser*`, + `py.typed` declared as package data, `xlsx-parser-api` console script + updated to `ks_xlsx_parser.api:main`. + ### ⚠️ BREAKING - Retired the in-tree `testBench/` corpus. The 1054-workbook stress dataset and `make testbench*` targets are gone — benchmarks now run against the diff --git a/Dockerfile.bench b/Dockerfile.bench new file mode 100644 index 0000000..cf2755f --- /dev/null +++ b/Dockerfile.bench @@ -0,0 +1,49 @@ +# Benchmark image for ks-xlsx-parser. +# +# Builds once, then on each run downloads SpreadsheetBench (if not cached), +# parses the corpus, embeds chunks with a small sentence-transformer, and +# emits a recall@k report + failure-bucket triage. The output lands in +# tests/benchmarks/reports/ — mount that path as a volume to persist results. +# +# Usage: +# docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . +# docker run --rm \ +# -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ +# -v "$PWD/data:/app/data" \ +# ks-xlsx-parser-bench +# +# # Quick sanity run on 20 instances: +# docker run --rm -e BENCH_SAMPLE=20 ks-xlsx-parser-bench + +FROM python:3.12-slim + +ENV PYTHONDONTWRITEBYTECODE=1 \ + PYTHONUNBUFFERED=1 \ + PIP_NO_CACHE_DIR=1 \ + PIP_DISABLE_PIP_VERSION_CHECK=1 + +WORKDIR /app + +RUN apt-get update && apt-get install -y --no-install-recommends \ + curl unzip ca-certificates git \ + && rm -rf /var/lib/apt/lists/* + +# Install deps first to keep layers cacheable across code edits. +COPY pyproject.toml README.md ./ +COPY src ./src +RUN pip install -e ".[dev,bench]" + +# Pre-warm the embedding model so the first ``docker run`` doesn't pay the +# ~80 MB download. Same model name eval_retrieval.py defaults to. +RUN python -c "from sentence_transformers import SentenceTransformer; \ +SentenceTransformer('BAAI/bge-small-en-v1.5')" + +COPY scripts ./scripts +COPY tests ./tests +COPY Makefile ./ + +ENV BENCH_SAMPLE=0 \ + BENCH_PARSERS=ks \ + BENCH_TIMEOUT=120 + +ENTRYPOINT ["bash", "scripts/run_bench.sh"] diff --git a/Makefile b/Makefile index 9bedb5b..5331c52 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,5 @@ -.PHONY: help install test test-ci lint format typecheck clean corpus-download bench-robust bench-retrieval bench +.PHONY: help install install-dev test test-ci lint format typecheck wheel-check clean \ + corpus-download bench-robust bench-retrieval bench bench-track docker-bench PYTHON ?= python PKG_VERSION := $(shell $(PYTHON) -c "import tomllib, pathlib; print(tomllib.loads(pathlib.Path('pyproject.toml').read_text())['project']['version'])") @@ -7,22 +8,29 @@ help: @echo "ks-xlsx-parser — common targets" @echo "" @echo " make install Install package and dev deps (editable)" + @echo " make install-dev Alias for install (matches ks-backend)" @echo " make test Run the default test suite" @echo " make test-ci Run the suite with verbose output for CI" @echo "" @echo " make lint Ruff lint" @echo " make format Ruff format" @echo " make typecheck mypy" + @echo " make wheel-check Build wheel + verify it imports in a clean venv" @echo "" @echo " make corpus-download Fetch SpreadsheetBench for benchmark runs" @echo "" @echo " make bench-robust Robustness on SpreadsheetBench (ks vs docling, ~20 min)" @echo " make bench-retrieval Retrieval recall on SpreadsheetBench (ks vs docling, ~40 min)" @echo " make bench Run both benchmarks back-to-back" + @echo " make bench-track Run retrieval bench + append metrics to history" + @echo " make docker-bench Build + run the benchmark Docker image" install: $(PYTHON) -m pip install -e ".[dev,api]" +# Alias — junior devs pattern-match off ks-backend's `make install-dev`. +install-dev: install + test: $(PYTHON) -m pytest tests/ -v --tb=short -W ignore::UserWarning @@ -36,7 +44,15 @@ format: $(PYTHON) -m ruff format src/ tests/ scripts/ typecheck: - $(PYTHON) -m mypy src/xlsx_parser + $(PYTHON) -m mypy src/ks_xlsx_parser + +# Build the wheel and prove it imports outside the editable source tree. +# This is the regression guard for the v0.2.0 packaging bug (pipeline.py +# missing from the wheel because it was a top-level module, not a package). +wheel-check: + rm -rf dist build + $(PYTHON) -m build --wheel + $(PYTHON) scripts/verify_wheel.py clean: rm -rf build/ dist/ *.egg-info src/*.egg-info .pytest_cache .ruff_cache .mypy_cache @@ -47,13 +63,26 @@ corpus-download: bench-robust: @test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1) - PYTHONPATH=src $(PYTHON) -m tests.benchmarks.vs_hucre \ + $(PYTHON) -m tests.benchmarks.vs_hucre \ --corpus data/corpora/spreadsheetbench --parsers ks,docling \ --per-file-timeout 120 \ --out tests/benchmarks/reports/spreadsheetbench bench-retrieval: @test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1) - PYTHONPATH=src $(PYTHON) scripts/eval_retrieval.py --parsers ks,docling + $(PYTHON) scripts/eval_retrieval.py --parsers ks,docling bench: bench-robust bench-retrieval + +# Run the retrieval benchmark and append a row to history.jsonl so +# accuracy can be tracked commit-over-commit. Goal: text recall@5 > 0.90. +bench-track: + @test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1) + $(PYTHON) scripts/eval_retrieval.py --parsers ks --emit-failures \ + --out tests/benchmarks/reports/retrieval + $(PYTHON) scripts/append_bench_history.py + $(PYTHON) scripts/triage_recall.py tests/benchmarks/reports/retrieval + +docker-bench: + docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . + docker run --rm -v "$(PWD)/tests/benchmarks/reports:/app/tests/benchmarks/reports" ks-xlsx-parser-bench diff --git a/docs/benchmark-local-setup.md b/docs/benchmark-local-setup.md new file mode 100644 index 0000000..ab39803 --- /dev/null +++ b/docs/benchmark-local-setup.md @@ -0,0 +1,235 @@ +# Running the retrieval benchmark locally + +This is the loop you'll use whenever you want to know "is my parser +change actually moving recall?" — the same pipeline CI runs, just on +your laptop. Goal: text recall@5 > 0.90 (currently ~0.70). + +The full SpreadsheetBench corpus is 912 instances and takes ~30–45 min +on a Mac. For an iteration loop you'll mostly use `--sample 60` (≈ 3 min +after the first embedding-model load). + +## TL;DR + +```bash +# One-time +make install-dev # installs the parser + dev deps +pip install -e ".[bench]" # sentence-transformers + numpy (~500 MB) +make corpus-download # downloads SpreadsheetBench → data/corpora/ + # (91 MB tarball, 2,726 .xlsx after extract) + +# Each time you want to score +python scripts/eval_retrieval.py \ + --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \ + --parsers ks \ + --sample 60 \ + --emit-failures +python scripts/triage_recall.py tests/benchmarks/reports/retrieval +python scripts/append_bench_history.py +``` + +Output: `tests/benchmarks/reports/retrieval//` with +`results.ndjson`, `failures.ndjson`, `summary.json`, `summary.md`. + +## Step-by-step + +### 1. Install bench deps + +The benchmark uses `sentence-transformers` (≈ 500 MB with torch) for +embeddings. They're a separate optional group so the parser package +itself stays lean: + +```bash +pip install -e ".[bench]" +``` + +First run also downloads the embedding model (`BAAI/bge-small-en-v1.5`, +≈ 130 MB) into `~/.cache/huggingface/`. + +### 2. Download the corpus + +```bash +make corpus-download +``` + +This is the same `scripts/download_corpora.sh` CI uses. It pulls +SpreadsheetBench v0.1 plus a few legacy XLSX samples; only +SpreadsheetBench matters for retrieval scoring. + +Layout you should end up with: + +``` +data/corpora/spreadsheetbench/ + all_data_912_v0.1/ + dataset.json # 912 (question, answer_position) tuples + spreadsheet/ + / + 1__input.xlsx # test case 1 input + 1__answer.xlsx # test case 1 ground-truth output + 2_..., 3_... # additional test cases per instance +``` + +`data/` is gitignored — never commit corpus files. + +### 3. Run the benchmark + +A typical iteration cycle uses a sample for speed: + +```bash +python scripts/eval_retrieval.py \ + --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \ + --parsers ks \ + --sample 60 \ + --emit-failures +``` + +Flags worth knowing: + +| Flag | What it does | +|-----------------------|-------------------------------------------------------------| +| `--parsers ks,docling`| Score one or both parsers. Docling is heavy; skip unless comparing. | +| `--sample N` | Random N-instance subset (seeded). Omit for the full 912. | +| `--seed 1337` | Random seed for `--sample`. Stays stable across runs. | +| `--emit-failures` | Also write `failures.ndjson` with top-8 chunks per miss. | +| `--test-case 1` | Which of the (usually 3) test cases per instance to score. | +| `--per-parser-timeout`| Wall-clock seconds before a hung parse is killed. Default 60. | + +For a full run, drop `--sample` and add `--per-parser-timeout 120`: + +```bash +python scripts/eval_retrieval.py \ + --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \ + --parsers ks \ + --emit-failures \ + --per-parser-timeout 120 +``` + +`make bench-retrieval` is the same thing with the canonical defaults. + +### 4. Read the triage report + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval +``` + +This auto-finds the most recent run and prints: + +* **Bucket histogram** ranked by count. The top bucket is the + highest-leverage thing to fix next. +* **3 example failures** per bucket showing the question, the + ground-truth answer cell + values, and the top-8 chunks the parser + produced (with a ✓ next to chunks that contain the answer). + +Drill into one bucket: + +```bash +python scripts/triage_recall.py tests/benchmarks/reports/retrieval \ + --bucket answer_absent_from_chunks --examples 10 +``` + +Five buckets and what they mean: + +| Bucket | Root cause | Fix lives in | +|---|---|---| +| `answer_absent_from_chunks` | Answer value in NO chunk. Cell dropped or rendered as formula. | `parsers/cell_parser.py`, `rendering/text_renderer.py` | +| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. Chunk too big/heterogeneous. | `chunking/chunker.py` | +| `wrong_sheet` | Answer sheet never chunked. | `parsers/workbook_parser.py` | +| `geometric_no_overlap` | Text matches but the chunk's A1 range doesn't overlap GT. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` | +| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. | + +See `docs/recall-investigation.md` for the named hypotheses behind each +bucket and `.claude/skills/recall-failure-triage/SKILL.md` for the +agent-driven loop. + +### 5. Append to history.jsonl + +```bash +python scripts/append_bench_history.py +``` + +Appends one row per benchmark run to +`tests/benchmarks/reports/history.jsonl` tagged with the current git +commit, and prints the row-over-row delta on the headline metrics: + +``` +appended to tests/benchmarks/reports/history.jsonl: + commit 421783f recall_text@5=0.704 recall_text@1=0.580 + recall_text@5: 0.6800 → 0.7040 ▲ +0.0240 +``` + +That's how "is recall improving?" gets answered. Goal: `recall_text@5 > 0.90`. + +`make bench-track` chains eval + history-append + triage in one go. + +## Docker path (matches CI exactly) + +When you want to make sure local results aren't drifting from CI: + +```bash +docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . + +# Quick sanity (60 instances, ~3 min after image load): +docker run --rm \ + -e BENCH_SAMPLE=60 \ + -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ + -v "$PWD/data:/app/data" \ + ks-xlsx-parser-bench + +# Full corpus: +docker run --rm \ + -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ + -v "$PWD/data:/app/data" \ + ks-xlsx-parser-bench +``` + +The image pre-warms the embedding model at build time so the first +`docker run` doesn't pay the 130 MB download. + +Environment knobs (also work in CI dispatch): + +| Env var | Default | What it does | +|------------------|---------|-------------------------------------------| +| `BENCH_SAMPLE` | `0` | Sample N instances (0 = full 912) | +| `BENCH_PARSERS` | `ks` | Comma list (e.g. `ks,docling`) | +| `BENCH_TIMEOUT` | `120` | Per-file parse timeout in seconds | + +## Adding a new failure bucket + +If you find a recall failure mode that doesn't fit any of the existing +six buckets, add it instead of stuffing it into `answer_absent_from_chunks`: + +1. Append the new bucket name to `FAILURE_BUCKETS` in + `scripts/eval_retrieval.py`. +2. Update `classify_text_failure` so it can return the new name. Keep + the predicate cheap — it runs once per scored instance. +3. Add the bucket + a one-line root cause + fix location to the table + in `docs/recall-investigation.md` and the SKILL file. +4. Re-run `make bench-track`; confirm the histogram shows the new + bucket and counts make sense. + +## Troubleshooting + +* `ModuleNotFoundError: No module named 'sentence_transformers'` + — you skipped `pip install -e ".[bench]"`. +* `dataset.json not found in ...` — your `--corpus` is pointing at + `data/corpora/spreadsheetbench`, not the nested + `data/corpora/spreadsheetbench/all_data_912_v0.1`. The benchmark + expects the leaf directory that contains `dataset.json`. +* `FileNotFoundError: 1__input.xlsx` — the corpus tarball didn't + fully extract. Delete `data/corpora/spreadsheetbench/` and re-run + `make corpus-download`. +* `recall_text@5 = 0.0` on a sample of 5 — small samples have huge + variance because the benchmark seeds. Bump to `--sample 60` minimum + before trusting the number; use `--sample 0` (full corpus) for a + decision-grade comparison. +* MPS / CUDA torch errors on first sentence-transformers import — + re-install with `pip install --upgrade torch torchvision`. The + benchmark runs fine on CPU. + +## See also + +* `scripts/eval_retrieval.py` — the benchmark itself. +* `scripts/triage_recall.py` — the bucket histogram + exemplar dump. +* `scripts/append_bench_history.py` — history.jsonl row writer. +* `Dockerfile.bench` — reproducible benchmark image. +* `docs/recall-investigation.md` — diagnosis framework & hypotheses. +* `.claude/skills/recall-failure-triage/SKILL.md` — agent guide. diff --git a/docs/recall-investigation.md b/docs/recall-investigation.md new file mode 100644 index 0000000..ee630fd --- /dev/null +++ b/docs/recall-investigation.md @@ -0,0 +1,176 @@ +# Retrieval-recall investigation — getting ks-xlsx-parser to >0.90 + +## Where we are (v0.2.0 on SpreadsheetBench, 912 instances) + +| Metric | ks-xlsx-parser | docling 2.93 | +|------------------------|----------------|--------------| +| Parse success | 99.945% | not run at scale | +| Recall@1 (text-match) | 0.580 | 0.579 | +| Recall@3 (text-match) | 0.697 | 0.670 | +| Recall@5 (text-match) | **0.704** | 0.686 | +| Recall@5 (geometric) | 0.369 | 0.000 (no A1 anchors) | +| Mean parse time | 251 ms | 265 ms | + +Recall@5 = 0.704 means **~30% of questions miss** with k=5. To reach 0.90 +we need to roughly cut the miss rate to a third. A single recall number +hides which lever to pull, which is why this branch ships failure +bucketing (`scripts/eval_retrieval.py --emit-failures` + +`scripts/triage_recall.py`). + +## The diagnosis framework — why the bucket histogram is the answer + +Every recall@5 miss falls into one of these buckets. The fix is +completely different per bucket, and only one or two will dominate. The +job of the investigator is to read the histogram FIRST, then commit to +fixing the biggest one. + +| Bucket | What it means | Where to look | +|---|---|---| +| `answer_absent_from_chunks` | Answer value is in NO chunk. Cell was dropped or garbled. | `parsers/cell_parser.py`, `rendering/text_renderer.py::_cell_render_value` | +| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. Chunk is too large/heterogeneous; the embedding is diluted. | `chunking/chunker.py` (no token cap), `analysis/table_assembler.py` (over-merging) | +| `wrong_sheet` | Answer sheet was never chunked. Sheet enumeration missed it. | `parsers/workbook_parser.py` sheet loop | +| `geometric_no_overlap` | No chunk's A1 range overlaps ground truth. Range bookkeeping drifts during merge/split. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` | +| `no_chunks` / `parse_error` | Upstream parser failure. | The parse exception — fix the crash. | + +## First-run findings (200-sample seed=1337, May 2026) + +Headline numbers refuted my **H1** ("present_but_ranked_low dominates") +and surfaced something the original analysis missed entirely. Of 157 +failures in the seed=1337 sample: + +* **127 (81%) are `instruction_requires_execution`** — the benchmark + is asking the system to *compute and write* the answer. The + `answer_position` cell range in `input.xlsx` is empty by design. A + parser cannot retrieve what isn't there. These instances need to be + filtered from the headline metric (TODO 05). +* **30 are truly actionable** parser failures, clustering into 4 named + buckets (TODOs 00, 01, 02, 03, 04 in + [docs/planning/recall-90/](./planning/recall-90/)). +* Zero of the 30 are `present_but_ranked_low`. Chunk-size dilution + may matter on the full 912-corpus but is NOT the dominant problem + on the 200-sample. + +The dominant **citation-grade killer** turned out to be +`text_hit_geom_miss` (84 of 200, mostly inside the 127 execution +instances; 9 are actionable parser bugs). The chunk text contains +the answer string, but the chunk's claimed A1 range doesn't overlap +ground truth — exactly **H3** ("range-bookkeeping drift"). + +When the in-scope filter is applied, real `recall_text@5` is +**0.59**, not 0.635. Closing the 30 actionable failures gets us +near 0.90 on the in-scope metric. + +See [docs/planning/recall-90/](./planning/recall-90/) for the +ranked TODO list. + +## A priori hypotheses (now confirmed/refuted; left as history) + +### H1 — `present_but_ranked_low` is the biggest bucket + +There is no per-chunk token cap in `chunking/chunker.py` (`CHARS_PER_TOKEN` +is only used to *report* `token_count`, never to split). On +SpreadsheetBench many input files are single-sheet ledgers where the +block-assembler collapses the whole sheet into one chunk. The +sentence-transformer query embedding then has to compete against ~2k +tokens of mostly irrelevant text; the relevant ~5 tokens get washed out. + +If H1 is right, the histogram will show `present_but_ranked_low` ≫ the +others, and recall@1 (0.580) will be much worse than recall@5 (0.704) +— exactly what we observe (Δ = 12.4 pp, vs typical Δ ≈ 5–6 pp when +chunks are right-sized). + +**Fix**: hard cap chunks at ~512 tokens by row-splitting tables and add +a "row group" sub-chunk for tall tables. This is a 1–2 day surgical +change in `chunking/chunker.py`. + +### H2 — `answer_absent_from_chunks` dominates the geometric gap + +`parsers/workbook_parser.py` loads both `data_only=False` (formula +expressions) and `data_only=True` (computed values). But what flows into +`render_text` is whichever `_cell_render_value` picks. If the cell is a +formula like `=SUM(B2:B10)`, `display_value` may be the *expression* +when the workbook was saved without cached values (LibreOffice and some +generated files do this). Those answer cells become unfindable by text +match even though the data IS in the spreadsheet. + +**Diagnostic**: count failure rows where every `top_chunks[*].text` +matches the formula expression pattern (`=`, function name) but not the +expected numeric value. The bucket emits the top-8 chunks for inspection. + +**Fix**: when the cached value is missing for a formula cell, evaluate it +with our own formula engine (`formula/formula_parser.py` already exists) +or use python-calamine's value-only pass as the source of truth for +render text — never the formula source. + +### H3 — `geometric_no_overlap` is high because block ranges over-extend + +Geometric recall@5 = 0.369 means **only ~37% of the time** does the +chunk a parser surfaces actually cover the ground-truth answer cell — +even when the text match works. The block-detection pipeline merges +sparse blocks (`analysis/light_block_detector.py`) and groups by +similarity (`analysis/table_grouper.py`). Each merge widens the +top-left/bottom-right anchors. If the anchors are widened past the +sheet's true content, downstream citation overlays in ks-backend will +highlight whitespace, and the geometric metric registers the chunk as +"not overlapping" because its claimed range is so large it's not useful. + +**Fix**: after every merge/split, clip `cell_range` to the tight bounding +box of the cells that actually contributed text. Add an invariant test +that `block.cell_range` ⊆ `bounding_box(block.cells)`. + +## How to confirm — the next benchmark run + +1. `make corpus-download` (one-time, ~hundreds of MB). +2. `make bench-track` — runs the full benchmark, appends to + `tests/benchmarks/reports/history.jsonl`, prints the bucket triage. +3. Read the histogram. Pick the biggest bucket. Open 3–5 example + failures with `python scripts/triage_recall.py + --bucket --examples 5`. Each row shows: + * the natural-language question + * the ground-truth answer cell + values + * the top-8 ranked chunks we produced (sheet, A1 range, text snippet) + * whether each chunk contains the answer +4. Pattern-match across 5 examples — what's the common parser behaviour? + That tells you the fix. +5. Implement, re-run `make bench-track`. The script prints the delta + row-over-row so improvement is visible immediately. + +## How to use the Docker image (CI + reproducibility) + +```bash +# Build once +docker build -f Dockerfile.bench -t ks-xlsx-parser-bench . + +# Quick smoke (60 instances, < 2 min) +docker run --rm -e BENCH_SAMPLE=60 ks-xlsx-parser-bench + +# Full corpus, persist reports + corpus cache +docker run --rm \ + -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \ + -v "$PWD/data:/app/data" \ + ks-xlsx-parser-bench +``` + +The `Benchmark` GitHub workflow: +* Runs a 60-instance smoke on every PR that touches `src/` or the + benchmark scripts. +* Runs the full 912-instance corpus weekly (Monday 06:00 UTC) and on + manual dispatch. +* Uploads `tests/benchmarks/reports/*` as a build artifact and posts the + recall summary to the job step summary. + +## Goal & cadence + +Target: **text recall@5 ≥ 0.90** by end of the current quarter. + +Track in `tests/benchmarks/reports/history.jsonl` (commit-over-commit +row append). Refuse merges that drop recall@5 by ≥ 2 pp on the sample +run (planned gate; today the PR job is reporting-only). + +## See also + +* `scripts/eval_retrieval.py` — the benchmark itself. +* `scripts/triage_recall.py` — bucket histogram + example dump. +* `scripts/append_bench_history.py` — history.jsonl row writer. +* `.claude/skills/recall-failure-triage/SKILL.md` — agent guide. +* `Dockerfile.bench` — reproducible benchmark image. diff --git a/examples/demo.py b/examples/demo.py index dbf6118..de59151 100644 --- a/examples/demo.py +++ b/examples/demo.py @@ -12,8 +12,8 @@ # Add src to path for development sys.path.insert(0, str(Path(__file__).parent.parent / "src")) -from xlsx_parser.pipeline import parse_workbook -from xlsx_parser.utils.logging_config import configure_logging +from ks_xlsx_parser.pipeline import parse_workbook +from ks_xlsx_parser.utils.logging_config import configure_logging EXAMPLES_DIR = Path(__file__).parent / "fixtures" @@ -136,7 +136,7 @@ def demo_engineering_calcs(): print(f" Named Ranges: {[nr.name for nr in wb.named_ranges]}") # Show dependency chain for Design Moment (C15) - from xlsx_parser.models import CellCoord + from ks_xlsx_parser.models import CellCoord upstream = wb.dependency_graph.get_upstream( "Beam Design", CellCoord(row=15, col=3), max_depth=3 ) diff --git a/examples/generate_examples.py b/examples/generate_examples.py index 8d25e01..245caa2 100644 --- a/examples/generate_examples.py +++ b/examples/generate_examples.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 """ -Generate example Excel workbooks for demonstrating the xlsx_parser. +Generate example Excel workbooks for demonstrating the ks_xlsx_parser. Creates several representative workbooks in the examples/ folder that showcase the parser's capabilities across different Excel features. diff --git a/examples/stress_test/stress_test_runner.py b/examples/stress_test/stress_test_runner.py index 2551765..98128fa 100644 --- a/examples/stress_test/stress_test_runner.py +++ b/examples/stress_test/stress_test_runner.py @@ -18,7 +18,7 @@ PROJECT_ROOT = Path(__file__).parent.parent.parent sys.path.insert(0, str(PROJECT_ROOT / "src")) -from xlsx_parser.pipeline import parse_workbook +from ks_xlsx_parser.pipeline import parse_workbook STRESS_DIR = Path(__file__).parent diff --git a/pyproject.toml b/pyproject.toml index a425853..bbaa8ac 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -46,7 +46,7 @@ Repository = "https://github.com/knowledgestack/ks-xlsx-parser" Documentation = "https://github.com/knowledgestack/ks-xlsx-parser#readme" [project.scripts] -xlsx-parser-api = "api:main" +xlsx-parser-api = "ks_xlsx_parser.api:main" [project.optional-dependencies] api = [ @@ -64,6 +64,12 @@ dev = [ "ruff>=0.6.0", "mypy>=1.0", ] +# Retrieval-recall benchmark (scripts/eval_retrieval.py). Heavy — only the +# benchmark Docker image and `make bench-retrieval` need these. +bench = [ + "sentence-transformers>=2.2.0", + "numpy>=1.24.0", +] [tool.pytest.ini_options] testpaths = ["tests"] @@ -80,6 +86,10 @@ addopts = "-m 'not corpus'" [tool.setuptools.packages.find] where = ["src"] +include = ["ks_xlsx_parser*"] + +[tool.setuptools.package-data] +ks_xlsx_parser = ["py.typed"] [tool.ruff] line-length = 110 diff --git a/scripts/append_bench_history.py b/scripts/append_bench_history.py new file mode 100755 index 0000000..01def1f --- /dev/null +++ b/scripts/append_bench_history.py @@ -0,0 +1,101 @@ +#!/usr/bin/env python3 +"""Append the latest retrieval-benchmark run to a commit-over-commit history. + +``eval_retrieval.py`` writes a timestamped ``summary.json`` per run. This +script picks the most recent one, flattens the headline metrics, tags it +with the current git commit, and appends one JSON line to +``tests/benchmarks/reports/history.jsonl``. + +That history file is what makes "is recall improving over time?" answerable +— plot it, diff it in CI, or just ``tail`` it. Goal: text recall@5 > 0.90. + +Usage: + python scripts/append_bench_history.py + python scripts/append_bench_history.py --reports-dir tests/benchmarks/reports/retrieval +""" +from __future__ import annotations + +import argparse +import json +import subprocess +from datetime import UTC, datetime +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent +HISTORY = ROOT / "tests" / "benchmarks" / "reports" / "history.jsonl" + + +def git_commit() -> str: + try: + return subprocess.check_output( + ["git", "rev-parse", "--short", "HEAD"], cwd=ROOT, text=True + ).strip() + except Exception: + return "unknown" + + +def latest_summary(reports_dir: Path) -> Path: + summaries = sorted(reports_dir.glob("*/summary.json")) + if not summaries: + raise SystemExit( + f"no summary.json under {reports_dir} — run `make bench-retrieval` first" + ) + return summaries[-1] + + +def main(argv: list[str] | None = None) -> int: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--reports-dir", type=Path, + default=ROOT / "tests" / "benchmarks" / "reports" / "retrieval") + ap.add_argument("--parser", default="ks-xlsx-parser", + help="which parser's metrics to record") + args = ap.parse_args(argv) + + summary_path = latest_summary(args.reports_dir) + summary = json.loads(summary_path.read_text()) + metrics = summary.get(args.parser) + if metrics is None: + raise SystemExit( + f"parser {args.parser!r} not in {summary_path}; " + f"have: {list(summary)}" + ) + + row = { + "timestamp": datetime.now(UTC).isoformat(), + "commit": git_commit(), + "parser": args.parser, + "run": summary_path.parent.name, + "instances": metrics.get("instances"), + "recall_text@1": metrics.get("recall_text@1"), + "recall_text@3": metrics.get("recall_text@3"), + "recall_text@5": metrics.get("recall_text@5"), + "recall_geometric@5": metrics.get("recall_geometric@5"), + "table_fragmentation_rate": metrics.get("table_fragmentation_rate"), + "mean_parse_ms": metrics.get("mean_parse_ms"), + "errors": metrics.get("errors"), + "failure_buckets": metrics.get("failure_buckets"), + } + + HISTORY.parent.mkdir(parents=True, exist_ok=True) + with HISTORY.open("a") as f: + f.write(json.dumps(row, separators=(",", ":")) + "\n") + + print(f"appended to {HISTORY.relative_to(ROOT)}:") + print(f" commit {row['commit']} recall_text@5={row['recall_text@5']} " + f"recall_text@1={row['recall_text@1']}") + + # Show the trend if there's history to compare against. + rows = [json.loads(ln) for ln in HISTORY.read_text().splitlines() if ln.strip()] + if len(rows) >= 2: + prev, cur = rows[-2], rows[-1] + for k in ("recall_text@5", "recall_text@1"): + p, c = prev.get(k), cur.get(k) + if isinstance(p, int | float) and isinstance(c, int | float): + delta = c - p + arrow = "▲" if delta > 0 else ("▼" if delta < 0 else "—") + print(f" {k}: {p:.4f} → {c:.4f} {arrow} {delta:+.4f}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/enrich_failures.py b/scripts/enrich_failures.py new file mode 100755 index 0000000..a9e4452 --- /dev/null +++ b/scripts/enrich_failures.py @@ -0,0 +1,348 @@ +#!/usr/bin/env python3 +"""Enrich a benchmark run with per-instance diagnostics for failure clustering. + +`eval_retrieval.py --emit-failures` only sees *text-match* misses. For +citation-grade scoring (geometric recall@5 = 0.283) we also need to +classify instances where the answer text is in some chunk but no chunk's +A1 range covers the ground truth — those don't show up in failures.ndjson +at all. + +This script re-parses each instance's input.xlsx with both ks-xlsx-parser +and openpyxl, then emits one row per FAILED instance (text-miss OR +geometric-miss) with diagnostic columns chosen so post-hoc clustering is +easy: + + instance_id question id + bucket_combined both_miss / text_hit_geom_miss / text_miss_geom_hit + answer_position the GT spec from dataset.json + gt_sheet ground-truth sheet name (default: answer_sheet) + gt_cell_raw openpyxl raw value at the first GT cell + gt_cell_formula formula string if any + gt_cell_data_only cached computed value if any + gt_in_workbook_sheets is gt_sheet in wb.sheetnames? + gt_in_chunked_sheets did the parser produce any chunk on gt_sheet? + n_workbook_sheets total sheets in the workbook (incl hidden) + n_chunked_sheets distinct sheets we emitted chunks for + n_workbook_cells_in_gt non-empty openpyxl cells in the GT range + chunks_on_gt_sheet how many chunks we emitted for gt_sheet + chunk_bbox_on_gt_sheet bbox (min_r,min_c,max_r,max_c) over all chunks on gt_sheet + gt_range_bbox GT range as (r0,c0,r1,c1) + text_match_rank rank of first chunk whose text contains a GT value + geom_match_rank rank of first chunk whose A1 range overlaps GT + +Usage: + python scripts/enrich_failures.py +""" +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path +from typing import Any + +# Hoist eval_retrieval helpers so we share the *exact* normalization +# logic — clustering by "answer present in chunk" must be apples-to-apples. +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) +from scripts.eval_retrieval import ( # noqa: E402 + _matches_chunk_text, + parse_a1, + parse_position_spec, + parse_range, +) + +REPO_ROOT = Path(__file__).resolve().parent.parent + + +def find_run(arg: Path) -> Path: + if arg.is_file(): + return arg.parent + if arg.is_dir(): + if (arg / "results.ndjson").exists(): + return arg + runs = sorted(p for p in arg.glob("*/results.ndjson")) + if runs: + return runs[-1].parent + sys.exit(f"ERROR: no results.ndjson at {arg}") + + +def overlaps(box1: tuple[int, int, int, int], box2: tuple[int, int, int, int]) -> bool: + r0a, c0a, r1a, c1a = box1 + r0b, c0b, r1b, c1b = box2 + return not (r1a < r0b or r0a > r1b or c1a < c0b or c0a > c1b) + + +def chunk_bbox(chunks) -> tuple[int, int, int, int] | None: + boxes = [] + for c in chunks: + tl = parse_a1(c.top_left_cell) if c.top_left_cell else None + br = parse_a1(c.bottom_right_cell) if c.bottom_right_cell else None + if tl and br: + boxes.append((tl[0], tl[1], br[0], br[1])) + if not boxes: + return None + return ( + min(b[0] for b in boxes), + min(b[1] for b in boxes), + max(b[2] for b in boxes), + max(b[3] for b in boxes), + ) + + +def enrich(run_dir: Path, corpus: Path, out_path: Path) -> None: + from openpyxl import load_workbook + + from ks_xlsx_parser.pipeline import parse_workbook + + # Load dataset.json once — we need question text + the original + # answer_sheet attribution for instances where the rank scoring + # already considers the spec parse-resolved. + dataset = {str(d["id"]): d for d in + json.loads((corpus / "dataset.json").read_text())} + + results_path = run_dir / "results.ndjson" + rows: list[dict[str, Any]] = [] + for line in results_path.read_text().splitlines(): + if line.strip(): + rows.append(json.loads(line)) + + out_rows: list[dict[str, Any]] = [] + n_failed = 0 + for rec in rows: + if rec.get("error"): + continue + text_rank = rec.get("rank_of_text_match") + geom_rank = rec.get("rank_of_first_overlap") + text_hit = text_rank is not None and text_rank <= 5 + geom_hit = geom_rank is not None and geom_rank <= 5 + if text_hit and geom_hit: + continue # not a failure either way + n_failed += 1 + + inst_id = rec["instance_id"] + meta = dataset.get(inst_id, {}) + instruction = meta.get("instruction", "") + answer_sheet = meta.get("answer_sheet") or None + if answer_sheet and "," in answer_sheet: + answer_sheet = answer_sheet.split(",")[0].strip() + answer_position = rec.get("answer_position") or meta.get("answer_position") or "" + data_position = meta.get("data_position") or "" + + # Re-parse input.xlsx — cheap (50 ms median). + inst_dir = corpus / "spreadsheet" / inst_id + input_path = inst_dir / f"1_{inst_id}_input.xlsx" + if not input_path.exists(): + continue + + try: + result = parse_workbook(path=str(input_path)) + chunks = list(result.chunks) + except Exception as exc: + out_rows.append({ + "instance_id": inst_id, + "bucket_combined": "parse_error", + "error": f"{type(exc).__name__}: {exc}", + }) + continue + + chunked_sheets = sorted({c.sheet_name for c in chunks if c.sheet_name}) + + # openpyxl view — both formula and data_only passes so we can + # tell "formula uncached" from "cell genuinely empty". + try: + wb_f = load_workbook(str(input_path), data_only=False, read_only=False) + wb_d = load_workbook(str(input_path), data_only=True, read_only=False) + wb_sheets = list(wb_f.sheetnames) + hidden_sheets = [s for s in wb_sheets + if getattr(wb_f[s], "sheet_state", "visible") != "visible"] + except Exception as exc: + wb_sheets = [] + hidden_sheets = [] + wb_f = wb_d = None + + # Geometric overlap is scored against data_position (the input data + # region the question is asking about); falls back to answer_position + # when the dataset didn't fill data_position in (561 of 912 instances). + # This mirrors eval_retrieval.py's geom_spec = data_pos or answer_pos. + geom_spec = data_position or answer_position + regions = parse_position_spec(geom_spec, answer_sheet) + gt_sheet = regions[0][0] if regions else answer_sheet + gt_range_bbox = regions[0][1] if regions else None + + # Separately track whether the *answer* region is empty in input.xlsx. + # If so, the question is "compute X and write here" — the parser + # cannot possibly contain the answer text, so this is a benchmark + # construct, not a parser bug. We flag it as instruction_requires_execution. + answer_regions = parse_position_spec(answer_position, answer_sheet) + answer_sheet_resolved = (answer_regions[0][0] + if answer_regions else answer_sheet) + answer_range_bbox = answer_regions[0][1] if answer_regions else None + n_input_cells_in_answer_range = 0 + if (wb_d and answer_sheet_resolved and answer_range_bbox + and answer_sheet_resolved in wb_d.sheetnames): + try: + ws = wb_d[answer_sheet_resolved] + r0, c0, r1, c1 = answer_range_bbox + for row in ws.iter_rows(min_row=r0, max_row=r1, min_col=c0, + max_col=c1, values_only=True): + for v in row: + if v is not None and str(v).strip(): + n_input_cells_in_answer_range += 1 + except Exception: + pass + + gt_cell_raw = None + gt_cell_formula = None + gt_cell_data_only = None + n_workbook_cells_in_gt = 0 + if wb_f and gt_sheet and gt_sheet in wb_f.sheetnames and gt_range_bbox: + ws_f = wb_f[gt_sheet] + ws_d = wb_d[gt_sheet] + r0, c0, r1, c1 = gt_range_bbox + # First cell only — enough to know "formula vs. value". + try: + tl_cell_f = ws_f.cell(row=r0, column=c0) + tl_cell_d = ws_d.cell(row=r0, column=c0) + gt_cell_raw = tl_cell_f.value + if isinstance(gt_cell_raw, str) and gt_cell_raw.startswith("="): + gt_cell_formula = gt_cell_raw + gt_cell_data_only = tl_cell_d.value + except Exception: + pass + # Count non-empty cells across the range + try: + for row in ws_d.iter_rows(min_row=r0, max_row=r1, min_col=c0, + max_col=c1, values_only=True): + for v in row: + if v is not None and str(v).strip(): + n_workbook_cells_in_gt += 1 + except Exception: + pass + + chunks_on_gt = [c for c in chunks if gt_sheet and c.sheet_name == gt_sheet] + gt_chunk_bbox = chunk_bbox(chunks_on_gt) + + if not text_hit and not geom_hit: + bucket = "both_miss" + elif text_hit and not geom_hit: + bucket = "text_hit_geom_miss" + elif geom_hit and not text_hit: + bucket = "text_miss_geom_hit" + else: + bucket = "other" + + sheet_chunked = (gt_sheet in chunked_sheets) if gt_sheet else None + sheet_in_wb = (gt_sheet in wb_sheets) if (gt_sheet and wb_sheets) else None + sheet_hidden = (gt_sheet in hidden_sheets) if gt_sheet else False + + # Pre-named signal heuristics — informational only; clustering is + # still done by reading. These flags help spot patterns FAST. + flags: list[str] = [] + if sheet_in_wb is False: + flags.append("gt_sheet_missing_from_workbook") + elif sheet_chunked is False: + flags.append("gt_sheet_present_but_not_chunked") + if sheet_hidden: + flags.append("gt_sheet_hidden") + if gt_cell_formula and gt_cell_data_only in (None, ""): + flags.append("gt_cell_uncached_formula") + elif gt_cell_formula: + flags.append("gt_cell_is_formula") + if (gt_range_bbox and gt_chunk_bbox and + not overlaps(gt_range_bbox, gt_chunk_bbox)): + flags.append("gt_range_outside_chunk_bbox") + if (gt_range_bbox and gt_chunk_bbox and + overlaps(gt_range_bbox, gt_chunk_bbox)): + # Inside the chunk bbox but no individual chunk overlaps? + # That's "the parser saw the right area but split it wrong". + any_overlap = False + for c in chunks_on_gt: + tl = parse_a1(c.top_left_cell) if c.top_left_cell else None + br = parse_a1(c.bottom_right_cell) if c.bottom_right_cell else None + if tl and br and overlaps( + gt_range_bbox, (tl[0], tl[1], br[0], br[1])): + any_overlap = True + break + if not any_overlap: + flags.append("gt_inside_bbox_but_no_chunk_overlap") + if (n_workbook_cells_in_gt == 0 and gt_range_bbox): + flags.append("gt_range_empty_in_workbook") + # The big one: if answer_position is empty in input, the benchmark + # is asking the system to WRITE the answer. Not a parser bug. + if (answer_range_bbox and n_input_cells_in_answer_range == 0): + flags.append("instruction_requires_execution") + # cell rendered but truncated to a sub-range? + if (gt_chunk_bbox and gt_range_bbox and + gt_chunk_bbox[2] < gt_range_bbox[2]): + flags.append("chunk_bbox_rows_truncated") + if (gt_chunk_bbox and gt_range_bbox and + gt_chunk_bbox[3] < gt_range_bbox[3]): + flags.append("chunk_bbox_cols_truncated") + + out_rows.append({ + "instance_id": inst_id, + "bucket_combined": bucket, + "instruction": instruction[:200], + "answer_position": answer_position, + "answer_sheet": answer_sheet, + "gt_sheet": gt_sheet, + "gt_range_bbox": list(gt_range_bbox) if gt_range_bbox else None, + "gt_cell_raw": str(gt_cell_raw)[:120] if gt_cell_raw is not None else None, + "gt_cell_formula": gt_cell_formula, + "gt_cell_data_only": + str(gt_cell_data_only)[:120] if gt_cell_data_only is not None else None, + "n_workbook_sheets": len(wb_sheets), + "n_chunked_sheets": len(chunked_sheets), + "wb_sheets": wb_sheets, + "hidden_sheets": hidden_sheets, + "chunked_sheets": chunked_sheets, + "n_chunks_total": len(chunks), + "n_chunks_on_gt_sheet": len(chunks_on_gt), + "n_workbook_cells_in_gt": n_workbook_cells_in_gt, + "chunk_bbox_on_gt_sheet": list(gt_chunk_bbox) if gt_chunk_bbox else None, + "rank_of_text_match": text_rank, + "rank_of_first_overlap": geom_rank, + "flags": flags, + "data_position": data_position, + "answer_range_bbox": list(answer_range_bbox) if answer_range_bbox else None, + "n_input_cells_in_answer_range": n_input_cells_in_answer_range, + }) + + out_path.write_text("\n".join(json.dumps(r, separators=(",", ":")) for r in out_rows) + "\n") + print(f"Examined {len(rows)} instances, {n_failed} failed (text OR geom).") + print(f"Wrote {len(out_rows)} enriched rows to {out_path}") + + # Quick histogram for sanity + from collections import Counter + bc = Counter(r["bucket_combined"] for r in out_rows) + fc = Counter() + for r in out_rows: + for f in r.get("flags", []): + fc[f] += 1 + print("\nCombined bucket counts:") + for b, n in bc.most_common(): + print(f" {b:<30s} {n}") + print("\nDiagnostic flags (rows can have multiple):") + for f, n in fc.most_common(): + print(f" {f:<40s} {n}") + + +def main(argv: list[str] | None = None) -> int: + ap = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("path", type=Path, + help="run dir, results.ndjson, or parent reports dir") + ap.add_argument("--corpus", type=Path, + default=REPO_ROOT / "data/corpora/spreadsheetbench/all_data_912_v0.1") + ap.add_argument("--out", type=Path, default=None, + help="output path (default: /enriched_failures.ndjson)") + args = ap.parse_args(argv) + + run_dir = find_run(args.path) + out = args.out or (run_dir / "enriched_failures.ndjson") + enrich(run_dir, args.corpus, out) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/eval_retrieval.py b/scripts/eval_retrieval.py index 64c1743..3d4cdff 100644 --- a/scripts/eval_retrieval.py +++ b/scripts/eval_retrieval.py @@ -34,13 +34,16 @@ import signal import sys import time +from collections.abc import Iterable from dataclasses import dataclass, field from pathlib import Path -from typing import Any, Iterable +from typing import Any REPO_ROOT = Path(__file__).resolve().parent.parent +# Keep ``import scripts.X`` style imports working when invoked as +# ``python scripts/eval_retrieval.py``. We no longer need ``src`` on the +# path — ks_xlsx_parser is a properly-installed package now. sys.path.insert(0, str(REPO_ROOT)) -sys.path.insert(0, str(REPO_ROOT / "src")) def _normalize_value_for_match(s: str) -> set[str]: @@ -254,7 +257,7 @@ def parse_position_spec( def extract_chunks_ks(path: Path) -> list[Chunk]: - from pipeline import parse_workbook + from ks_xlsx_parser.pipeline import parse_workbook result = parse_workbook(path=str(path)) out: list[Chunk] = [] @@ -442,6 +445,62 @@ class InstanceResult: extra: dict[str, Any] = field(default_factory=dict) +# ────────────────────────────────────────────────────────── failure triage +# +# recall@5 sitting at ~0.70 means ~30% of questions miss. A single recall +# number can't tell you WHY — and the fix is completely different per cause: +# +# * answer_absent_from_chunks → the answer value is in NO chunk at all. +# The parser dropped/garbled the cell. Fix the EXTRACTION pipeline. +# * present_but_ranked_low → a chunk DOES contain the answer, but the +# embedding model ranked it past k. Fix CHUNKING (smaller/cleaner +# chunks rank better) or the embedding step — NOT the parser. +# * wrong_sheet → answer is on a sheet we produced no chunk +# for, or mis-attributed the sheet. Fix sheet enumeration. +# * geometric_no_overlap → no chunk's A1 range overlaps ground truth +# even though text may match. Fix RANGE bookkeeping (top_left / +# bottom_right anchors drift during merge/split). +# * no_chunks / parse_error → upstream parser failure. +# +# Splitting the miss population into these buckets turns "recall is low" +# into a ranked, actionable worklist. This is the single most useful thing +# for an agent iterating on recall — see scripts/triage_recall.py. + +FAILURE_BUCKETS = ( + "answer_absent_from_chunks", + "present_but_ranked_low", + "wrong_sheet", + "geometric_no_overlap", + "no_chunks", + "parse_error", + "unparseable_ground_truth", +) + + +def classify_text_failure(rec: dict[str, Any]) -> str | None: + """Bucket a single result record for the TEXT-match recall@5 metric. + + Returns None if the instance was a recall@5 hit (rank ≤ 5) or was not + scoreable (no ground-truth values to match against). + """ + if rec.get("error"): + return "no_chunks" if rec.get("n_chunks", 0) == 0 else "parse_error" + if not rec.get("had_answer_values", True): + return None # not scoreable for text-match — exclude, don't penalise + if rec.get("n_chunks", 0) == 0: + return "no_chunks" + rank = rec.get("rank_of_text_match") + if rank is not None and rank <= 5: + return None # hit + if rank is None: + # Not in any chunk. Distinguish "value never extracted" from + # "value extracted but on a sheet we never chunked". + if rec.get("answer_on_unchunked_sheet"): + return "wrong_sheet" + return "answer_absent_from_chunks" + return "present_but_ranked_low" # rank > 5 + + def score_instance( *, parser_name: str, @@ -452,8 +511,10 @@ def score_instance( answer_position: str, default_sheet: str | None, answer_cell_values: list[str], + answer_regions: list[tuple[str | None, tuple[int, int, int, int]]] | None = None, model, per_parser_timeout_s: float = 60.0, + emit_failures: bool = False, ) -> InstanceResult: import numpy as np @@ -547,6 +608,38 @@ def score_instance( rank_text = r break + # Did the parser produce ANY chunk on the sheet(s) the answer lives on? + # If not, a text miss is a sheet-enumeration bug, not a cell-drop bug. + answer_on_unchunked_sheet = False + if answer_regions: + gt_sheets = {s for s, _ in answer_regions if s} + chunk_sheets = {c.sheet for c in chunks if c.sheet} + if gt_sheets and chunk_sheets and not (gt_sheets & chunk_sheets): + answer_on_unchunked_sheet = True + + extra: dict[str, Any] = { + "answer_on_unchunked_sheet": answer_on_unchunked_sheet, + "had_answer_values": bool(answer_cell_values), + } + if emit_failures: + # Dump the top-8 ranked chunks so a human/agent can eyeball WHY the + # answer was missed without re-running the (expensive) benchmark. + top: list[dict[str, Any]] = [] + for r, idx in enumerate(ranking[:8], start=1): + c = chunks[idx] + top.append({ + "rank": r, + "sheet": c.sheet, + "range": ( + f"{c.top_left}:{c.bottom_right}" + if c.top_left else None + ), + "contains_answer": _matches_chunk_text(answer_cell_values, c.text or ""), + "text": (c.text or "")[:280], + }) + extra["top_chunks"] = top + extra["answer_values"] = answer_cell_values[:20] + return InstanceResult( instance_id=inst_id, parser=parser_name, @@ -558,6 +651,7 @@ def score_instance( chunks_overlapping_data=len(overlap_idxs), rank_of_first_overlap=rank_overlap, rank_of_text_match=rank_text, + extra=extra, ) @@ -650,6 +744,21 @@ def _recall_at(k: int, key: str) -> float: parse_times = [r.parse_ms for r in recs if not r.error] + # Failure-bucket histogram for the text-match recall@5 metric. + # Only counts instances that HAD ground-truth values to match + # against (others aren't scoreable). See classify_text_failure. + buckets = dict.fromkeys(FAILURE_BUCKETS, 0) + for r in recs: + bucket = classify_text_failure({ + "error": r.error, + "n_chunks": r.n_chunks, + "rank_of_text_match": r.rank_of_text_match, + "had_answer_values": r.extra.get("had_answer_values", True), + "answer_on_unchunked_sheet": r.extra.get("answer_on_unchunked_sheet"), + }) + if bucket is not None: + buckets[bucket] += 1 + summary[parser] = { "instances": total, "ok": ok, @@ -667,6 +776,9 @@ def _recall_at(k: int, key: str) -> float: if parse_times else None, "p50_parse_ms": round(sorted(parse_times)[len(parse_times) // 2], 2) if parse_times else None, + # Why the text-match recall@5 misses happen, bucketed. The + # biggest bucket is the highest-leverage thing to fix next. + "failure_buckets": buckets, } return summary @@ -701,6 +813,10 @@ def main(argv: list[str] | None = None) -> int: parser.add_argument("--test-case", type=int, default=1, help="Which of the (typically 3) test cases per instance " "to score on. We use one to keep eval costs bounded.") + parser.add_argument("--emit-failures", action="store_true", + help="Also write failures.ndjson — one row per " + "recall@5 miss with the top-8 ranked chunks and " + "ground-truth values, for failure triage.") parser.add_argument("--per-parser-timeout", type=float, default=60.0, help="Wall-clock seconds before a parser is " "considered hung on a single file (docling can " @@ -736,6 +852,7 @@ def main(argv: list[str] | None = None) -> int: ndjson_path = out_dir / "results.ndjson" results: list[InstanceResult] = [] + failure_rows: list[dict[str, Any]] = [] n = len(instances) * len(parser_fns) done = 0 @@ -786,10 +903,20 @@ def main(argv: list[str] | None = None) -> int: answer_position=answer_pos, default_sheet=default_sheet, answer_cell_values=answer_values, + answer_regions=answer_regions, model=model, per_parser_timeout_s=args.per_parser_timeout, + emit_failures=args.emit_failures, ) results.append(res) + bucket = classify_text_failure({ + "error": res.error, + "n_chunks": res.n_chunks, + "rank_of_text_match": res.rank_of_text_match, + "had_answer_values": res.extra.get("had_answer_values", True), + "answer_on_unchunked_sheet": res.extra.get( + "answer_on_unchunked_sheet"), + }) f.write(json.dumps({ "instance_id": res.instance_id, "parser": res.parser, @@ -801,8 +928,22 @@ def main(argv: list[str] | None = None) -> int: "chunks_overlapping_data": res.chunks_overlapping_data, "rank_of_first_overlap": res.rank_of_first_overlap, "rank_of_text_match": res.rank_of_text_match, + "failure_bucket": bucket, "error": res.error, }, separators=(",", ":")) + "\n") + if args.emit_failures and bucket is not None: + failure_rows.append({ + "instance_id": res.instance_id, + "parser": res.parser, + "failure_bucket": bucket, + "instruction": instr, + "answer_position": answer_pos, + "answer_values": res.extra.get("answer_values", []), + "rank_of_text_match": res.rank_of_text_match, + "n_chunks": res.n_chunks, + "top_chunks": res.extra.get("top_chunks", []), + "error": res.error, + }) done += 1 if done % 10 == 0: sys.stderr.write(f"\r[{done}/{n}] ") @@ -810,6 +951,13 @@ def main(argv: list[str] | None = None) -> int: sys.stderr.write(f"\nWrote {ndjson_path}\n") + if args.emit_failures: + fail_path = out_dir / "failures.ndjson" + with fail_path.open("w") as ff: + for row in failure_rows: + ff.write(json.dumps(row, separators=(",", ":")) + "\n") + sys.stderr.write(f"Wrote {fail_path} ({len(failure_rows)} failure rows)\n") + summary = aggregate(results) summary_path = out_dir / "summary.json" summary_path.write_text(json.dumps(summary, indent=2)) @@ -853,6 +1001,20 @@ def main(argv: list[str] | None = None) -> int: "(sheet, range) per chunk — docling does not, so its geometric " "recall is structurally 0.") md_lines.append("") + md_lines.append("## Failure buckets (text-match recall@5 misses)") + md_lines.append("") + md_lines.append("Why each miss happened — the biggest bucket is the highest-") + md_lines.append("leverage fix. `answer_absent_from_chunks` → fix extraction; ") + md_lines.append("`present_but_ranked_low` → fix chunking/embedding.") + md_lines.append("") + md_lines.append("| Bucket | " + " | ".join(parsers) + " |") + md_lines.append("|---|" + "|".join(["---"] * len(parsers)) + "|") + for b in FAILURE_BUCKETS: + row = [b] + for p in parsers: + row.append(str(summary[p].get("failure_buckets", {}).get(b, 0))) + md_lines.append("| " + " | ".join(row) + " |") + md_lines.append("") md_lines.append("**Text-match** = the answer cell's actual string value appears " "as a substring of the chunk's text. Parser-agnostic; this is " "the apples-to-apples retrieval comparison.") diff --git a/scripts/run_bench.sh b/scripts/run_bench.sh new file mode 100755 index 0000000..05a77fb --- /dev/null +++ b/scripts/run_bench.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# Entrypoint for the benchmark Docker image (Dockerfile.bench). +# +# Ensures the SpreadsheetBench corpus is present, runs the retrieval-recall +# benchmark for ks-xlsx-parser, appends the result to history.jsonl, and +# prints a failure-bucket triage so accuracy can be tracked over time. +# +# Env vars: +# BENCH_SAMPLE parse only N random instances (0 / unset = full 912) +# BENCH_PARSERS comma list passed to eval_retrieval.py (default: ks) +# BENCH_TIMEOUT per-file parser timeout in seconds (default: 120) +set -euo pipefail + +cd "$(dirname "$0")/.." + +CORPUS_DIR="data/corpora/spreadsheetbench" +SAMPLE="${BENCH_SAMPLE:-0}" +PARSERS="${BENCH_PARSERS:-ks}" +TIMEOUT="${BENCH_TIMEOUT:-120}" + +if [ ! -d "$CORPUS_DIR" ]; then + echo "→ Downloading SpreadsheetBench corpus ..." + mkdir -p "$CORPUS_DIR" + curl -L --fail --retry 3 --connect-timeout 20 \ + -o /tmp/sb.tar.gz \ + "https://raw.githubusercontent.com/RUCKBReasoning/SpreadsheetBench/main/data/spreadsheetbench_912_v0.1.tar.gz" + tar -xzf /tmp/sb.tar.gz -C "$CORPUS_DIR" + rm -f /tmp/sb.tar.gz +fi + +# eval_retrieval.py expects the dataset.json + spreadsheet/ dir. Find it. +CORPUS_ARG="$CORPUS_DIR" +if [ -d "$CORPUS_DIR/all_data_912_v0.1" ]; then + CORPUS_ARG="$CORPUS_DIR/all_data_912_v0.1" +fi + +SAMPLE_ARG=() +if [ "$SAMPLE" != "0" ]; then + SAMPLE_ARG=(--sample "$SAMPLE") + echo "→ Sampling $SAMPLE instances" +fi + +echo "→ Running retrieval benchmark (parsers=$PARSERS) ..." +python scripts/eval_retrieval.py \ + --corpus "$CORPUS_ARG" \ + --parsers "$PARSERS" \ + --emit-failures \ + --per-parser-timeout "$TIMEOUT" \ + --out tests/benchmarks/reports/retrieval \ + "${SAMPLE_ARG[@]}" + +echo "→ Appending to history.jsonl ..." +python scripts/append_bench_history.py + +echo +python scripts/triage_recall.py tests/benchmarks/reports/retrieval diff --git a/scripts/run_enterprise_metrics.py b/scripts/run_enterprise_metrics.py index 1d6821a..6366fbd 100644 --- a/scripts/run_enterprise_metrics.py +++ b/scripts/run_enterprise_metrics.py @@ -7,9 +7,8 @@ import json -from xlsx_parser import parse_workbook - from scripts.generate_enterprise_fixtures import generate_all +from xlsx_parser import parse_workbook class EnterpriseScorecard: diff --git a/scripts/track_corpus_metrics.py b/scripts/track_corpus_metrics.py index 779995b..a0cb244 100644 --- a/scripts/track_corpus_metrics.py +++ b/scripts/track_corpus_metrics.py @@ -6,7 +6,6 @@ from datetime import datetime from pathlib import Path - ROOT = Path(__file__).resolve().parent.parent METRICS_DIR = ROOT / "metrics" / "corpus" diff --git a/scripts/triage_recall.py b/scripts/triage_recall.py new file mode 100755 index 0000000..41a766a --- /dev/null +++ b/scripts/triage_recall.py @@ -0,0 +1,114 @@ +#!/usr/bin/env python3 +"""Triage retrieval-recall failures into a ranked, actionable worklist. + +Reads a ``failures.ndjson`` produced by +``eval_retrieval.py --emit-failures`` and prints: + + 1. A histogram of failure buckets (biggest = highest leverage). + 2. For each bucket, a few concrete example failures with the + ground-truth answer and the top ranked chunks the parser produced. + +The point: turn "recall@5 is 0.70" into "N misses are answer_absent_from_chunks +— the parser is dropping cells; here are 5 examples to reproduce." + +Usage: + python scripts/triage_recall.py tests/benchmarks/reports/retrieval//failures.ndjson + python scripts/triage_recall.py # finds the latest run + python scripts/triage_recall.py --bucket answer_absent_from_chunks --examples 10 +""" +from __future__ import annotations + +import argparse +import json +import sys +from collections import Counter +from pathlib import Path + +# Ordered worst→least-actionable. Used to sort the worklist. +BUCKET_GUIDANCE = { + "parse_error": "Parser raised on the file. Reproduce with parse_workbook(path) and fix the crash.", + "no_chunks": "Parser produced zero chunks. Sheet/region detection collapsed — check chunking/segmenter.py.", + "answer_absent_from_chunks": "Answer value is in NO chunk. EXTRACTION gap — the cell was dropped or garbled. Highest leverage.", + "wrong_sheet": "Answer sheet was never chunked. Sheet enumeration bug — check workbook_parser.py sheet loop.", + "geometric_no_overlap": "No chunk's A1 range overlaps ground truth. RANGE bookkeeping drift in merge/split.", + "present_but_ranked_low": "A chunk DOES contain the answer but ranked >5. Not a parser bug — fix chunk granularity/embedding.", + "unparseable_ground_truth": "Could not parse the dataset's answer_position. Benchmark-harness issue, not the parser.", +} + + +def find_failures_file(arg: Path) -> Path: + if arg.is_file(): + return arg + if arg.is_dir(): + direct = arg / "failures.ndjson" + if direct.exists(): + return direct + runs = sorted(p for p in arg.glob("*/failures.ndjson")) + if runs: + return runs[-1] + sys.exit(f"ERROR: no failures.ndjson found at {arg}") + + +def load(path: Path) -> list[dict]: + rows = [] + for line in path.read_text().splitlines(): + line = line.strip() + if line: + rows.append(json.loads(line)) + return rows + + +def main(argv: list[str] | None = None) -> int: + ap = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("path", type=Path, + help="failures.ndjson, or a dir containing/parenting one") + ap.add_argument("--bucket", help="only show examples for this bucket") + ap.add_argument("--examples", type=int, default=3, + help="example failures to print per bucket (default 3)") + args = ap.parse_args(argv) + + path = find_failures_file(args.path) + rows = load(path) + if not rows: + print(f"{path} is empty — no failures recorded (or run was a no-op).") + return 0 + + print(f"# Recall failure triage — {path}") + print(f"# {len(rows)} total failure rows\n") + + hist = Counter(r.get("failure_bucket") for r in rows) + print("## Bucket histogram (ranked by count — fix the top one first)\n") + width = max(len(b or "?") for b in hist) + for bucket, count in hist.most_common(): + pct = 100.0 * count / len(rows) + print(f" {str(bucket):<{width}} {count:5d} ({pct:5.1f}%)") + print(f" {'':<{width}} → {BUCKET_GUIDANCE.get(bucket, '')}") + print() + + buckets = [args.bucket] if args.bucket else [b for b, _ in hist.most_common()] + for bucket in buckets: + examples = [r for r in rows if r.get("failure_bucket") == bucket][:args.examples] + if not examples: + continue + print(f"## Examples — {bucket}\n") + for r in examples: + print(f" instance {r.get('instance_id')} ({r.get('parser')})") + print(f" Q: {(r.get('instruction') or '')[:160]}") + print(f" answer_position: {r.get('answer_position')}") + print(f" ground-truth values: {r.get('answer_values')}") + print(f" n_chunks={r.get('n_chunks')} " + f"rank_of_text_match={r.get('rank_of_text_match')}") + if r.get("error"): + print(f" ERROR: {r['error']}") + for c in (r.get("top_chunks") or [])[:4]: + mark = "✓" if c.get("contains_answer") else " " + snippet = (c.get("text") or "").replace("\n", " ")[:120] + print(f" [{mark}] #{c.get('rank')} {c.get('sheet')} " + f"{c.get('range')} {snippet}") + print() + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/verify_wheel.py b/scripts/verify_wheel.py new file mode 100755 index 0000000..26801f3 --- /dev/null +++ b/scripts/verify_wheel.py @@ -0,0 +1,81 @@ +#!/usr/bin/env python3 +"""Verify the built wheel is installable and importable in a clean venv. + +This is the regression guard for the v0.2.0 packaging bug: ``pipeline.py`` +and ``api.py`` were top-level modules under ``src/`` and ``setuptools`` +``packages.find`` only picks up *packages*, so they were silently dropped +from the wheel — ``from ks_xlsx_parser.pipeline import ...`` failed for +every installed user. The flat layout also leaked 13 generic top-level +packages (``models``, ``utils``, ``parsers`` ...) into ``site-packages``. + +Run after ``python -m build --wheel``. Exits non-zero on any problem so it +can gate CI and ``make wheel-check``. +""" +from __future__ import annotations + +import subprocess +import sys +import tempfile +import venv +import zipfile +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent + +# Imports a real downstream consumer relies on. Keep in sync with the +# public surface in ks_xlsx_parser/__init__.py. +SMOKE_IMPORTS = [ + "from ks_xlsx_parser import parse_workbook, ParseResult", + "from ks_xlsx_parser.pipeline import parse_workbook", + "from ks_xlsx_parser.verification import StageVerifier", + "from ks_xlsx_parser.analysis.table_assembler import TableAssembler", + "from ks_xlsx_parser.models.workbook import WorkbookDTO", +] + + +def find_wheel() -> Path: + wheels = sorted((ROOT / "dist").glob("*.whl")) + if not wheels: + sys.exit("ERROR: no wheel in dist/ — run `python -m build --wheel` first") + return wheels[-1] + + +def check_wheel_contents(wheel: Path) -> None: + """Fail loudly if the wheel pollutes the global namespace or drops modules.""" + with zipfile.ZipFile(wheel) as zf: + names = zf.namelist() + top_level = next((n for n in names if n.endswith("top_level.txt")), None) + if top_level: + packages = zf.read(top_level).decode().split() + if packages != ["ks_xlsx_parser"]: + sys.exit( + f"ERROR: wheel exposes top-level packages {packages}; " + "expected only ['ks_xlsx_parser']. The flat src/ layout leaked." + ) + required = ["ks_xlsx_parser/pipeline.py", "ks_xlsx_parser/api.py"] + for req in required: + if not any(n == req for n in names): + sys.exit(f"ERROR: wheel is missing {req}") + print(f"wheel contents OK ({len(names)} entries, top-level: ks_xlsx_parser)") + + +def check_install_and_import(wheel: Path) -> None: + with tempfile.TemporaryDirectory() as tmp: + env_dir = Path(tmp) / "venv" + venv.create(env_dir, with_pip=True) + py = env_dir / ("Scripts" if sys.platform == "win32" else "bin") / "python" + subprocess.run([str(py), "-m", "pip", "install", "-q", str(wheel)], check=True) + script = "; ".join(SMOKE_IMPORTS) + "; print('clean-venv import OK')" + subprocess.run([str(py), "-c", script], check=True) + + +def main() -> None: + wheel = find_wheel() + print(f"verifying {wheel.name}") + check_wheel_contents(wheel) + check_install_and_import(wheel) + print("wheel verification PASSED") + + +if __name__ == "__main__": + main() diff --git a/src/ks_xlsx_parser/__init__.py b/src/ks_xlsx_parser/__init__.py index 184452b..cabcd96 100644 --- a/src/ks_xlsx_parser/__init__.py +++ b/src/ks_xlsx_parser/__init__.py @@ -1,25 +1,27 @@ """ ks_xlsx_parser — public API entry point for the ks-xlsx-parser package. -The source tree is flat: top-level modules at ``src/`` (``pipeline``, +All modules live under the ``ks_xlsx_parser`` package (``pipeline``, ``models``, ``analysis``, ``verification``, etc.). This module re-exports the stable, user-facing names so callers can do:: from ks_xlsx_parser import parse_workbook, ParseResult -regardless of internal layout. +and submodules remain importable directly:: + + from ks_xlsx_parser.pipeline import parse_workbook """ from __future__ import annotations __version__ = "0.2.0" -from pipeline import ( # noqa: F401 +from .pipeline import ( # noqa: F401 ParseResult, compare_workbooks, export_importer, parse_workbook, ) -from verification import ( # noqa: F401 +from .verification import ( # noqa: F401 ExcellentStage, StageVerifier, VerificationReport, diff --git a/src/analysis/__init__.py b/src/ks_xlsx_parser/analysis/__init__.py similarity index 100% rename from src/analysis/__init__.py rename to src/ks_xlsx_parser/analysis/__init__.py diff --git a/src/analysis/light_block_detector.py b/src/ks_xlsx_parser/analysis/light_block_detector.py similarity index 96% rename from src/analysis/light_block_detector.py rename to src/ks_xlsx_parser/analysis/light_block_detector.py index 9816cd0..7fc2d11 100644 --- a/src/analysis/light_block_detector.py +++ b/src/ks_xlsx_parser/analysis/light_block_detector.py @@ -10,9 +10,9 @@ import logging -from models.block import BlockDTO -from models.common import BlockType, CellCoord, CellRange -from models.table_structure import TableRegion, TableRegionRole, TableStructure +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange +from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure logger = logging.getLogger(__name__) diff --git a/src/analysis/llm_artifacts.py b/src/ks_xlsx_parser/analysis/llm_artifacts.py similarity index 98% rename from src/analysis/llm_artifacts.py rename to src/ks_xlsx_parser/analysis/llm_artifacts.py index 01d0777..2122c45 100644 --- a/src/analysis/llm_artifacts.py +++ b/src/ks_xlsx_parser/analysis/llm_artifacts.py @@ -11,24 +11,22 @@ from __future__ import annotations import logging -import re from collections import Counter from pydantic import Field -from models.block import BlockDTO -from models.chart import ChartDTO -from models.common import ( - BlockType, +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.chart import ChartDTO +from ks_xlsx_parser.models.common import ( CellCoord, SheetPurpose, StableModel, compute_hash, ) -from models.dependency import DependencyGraph -from models.sheet import SheetDTO -from models.table import TableDTO -from models.workbook import KpiDTO, SheetSummaryDTO +from ks_xlsx_parser.models.dependency import DependencyGraph +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table import TableDTO +from ks_xlsx_parser.models.workbook import KpiDTO, SheetSummaryDTO logger = logging.getLogger(__name__) diff --git a/src/analysis/pattern_splitter.py b/src/ks_xlsx_parser/analysis/pattern_splitter.py similarity index 96% rename from src/analysis/pattern_splitter.py rename to src/ks_xlsx_parser/analysis/pattern_splitter.py index 92ac82f..e065801 100644 --- a/src/analysis/pattern_splitter.py +++ b/src/ks_xlsx_parser/analysis/pattern_splitter.py @@ -10,12 +10,11 @@ import logging import math -from collections import defaultdict -from models.block import BlockDTO -from models.common import BlockType, CellAnnotation, CellCoord, CellRange -from models.sheet import SheetDTO -from models.table_structure import TableRegion, TableRegionRole, TableStructure +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellAnnotation, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure logger = logging.getLogger(__name__) diff --git a/src/analysis/table_assembler.py b/src/ks_xlsx_parser/analysis/table_assembler.py similarity index 96% rename from src/analysis/table_assembler.py rename to src/ks_xlsx_parser/analysis/table_assembler.py index 279603d..28bc3a3 100644 --- a/src/analysis/table_assembler.py +++ b/src/ks_xlsx_parser/analysis/table_assembler.py @@ -9,10 +9,10 @@ import logging -from models.block import BlockDTO -from models.common import BlockType, CellCoord, CellRange, compute_hash -from models.sheet import SheetDTO -from models.table_structure import TableRegion, TableRegionRole, TableStructure +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure logger = logging.getLogger(__name__) diff --git a/src/analysis/table_grouper.py b/src/ks_xlsx_parser/analysis/table_grouper.py similarity index 97% rename from src/analysis/table_grouper.py rename to src/ks_xlsx_parser/analysis/table_grouper.py index 94fe862..597da57 100644 --- a/src/analysis/table_grouper.py +++ b/src/ks_xlsx_parser/analysis/table_grouper.py @@ -11,10 +11,10 @@ import logging from collections import defaultdict -from models.block import BlockDTO -from models.common import BlockType, CellCoord, CellRange, compute_hash -from models.sheet import SheetDTO -from models.table_structure import TableStructure +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table_structure import TableStructure logger = logging.getLogger(__name__) diff --git a/src/analysis/template_extractor.py b/src/ks_xlsx_parser/analysis/template_extractor.py similarity index 95% rename from src/analysis/template_extractor.py rename to src/ks_xlsx_parser/analysis/template_extractor.py index 27a8025..54dedd9 100644 --- a/src/analysis/template_extractor.py +++ b/src/ks_xlsx_parser/analysis/template_extractor.py @@ -10,15 +10,15 @@ import logging -from models.common import CellAnnotation, CellCoord -from models.sheet import SheetDTO -from models.template import ( +from ks_xlsx_parser.models.common import CellAnnotation, CellCoord +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.template import ( DOFType, TemplateCellSpec, TemplateConstraint, TemplateNode, ) -from models.tree import TreeNode, TreeNodeType +from ks_xlsx_parser.models.tree import TreeNode, TreeNodeType logger = logging.getLogger(__name__) diff --git a/src/analysis/tree_builder.py b/src/ks_xlsx_parser/analysis/tree_builder.py similarity index 96% rename from src/analysis/tree_builder.py rename to src/ks_xlsx_parser/analysis/tree_builder.py index 0e5192f..22a2ed1 100644 --- a/src/analysis/tree_builder.py +++ b/src/ks_xlsx_parser/analysis/tree_builder.py @@ -10,11 +10,10 @@ import logging -from models.block import BlockDTO -from models.common import CellCoord, CellRange, compute_hash -from models.sheet import SheetDTO -from models.table_structure import TableStructure -from models.tree import TreeNode, TreeNodeType +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table_structure import TableStructure +from ks_xlsx_parser.models.tree import TreeNode, TreeNodeType logger = logging.getLogger(__name__) diff --git a/src/annotation/__init__.py b/src/ks_xlsx_parser/annotation/__init__.py similarity index 100% rename from src/annotation/__init__.py rename to src/ks_xlsx_parser/annotation/__init__.py diff --git a/src/annotation/block_splitter.py b/src/ks_xlsx_parser/annotation/block_splitter.py similarity index 97% rename from src/annotation/block_splitter.py rename to src/ks_xlsx_parser/annotation/block_splitter.py index 0ec332f..2d45209 100644 --- a/src/annotation/block_splitter.py +++ b/src/ks_xlsx_parser/annotation/block_splitter.py @@ -8,11 +8,10 @@ from __future__ import annotations import logging -from collections import defaultdict -from models.block import BlockDTO -from models.common import BlockType, CellAnnotation, CellCoord, CellRange -from models.sheet import SheetDTO +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellAnnotation, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO logger = logging.getLogger(__name__) diff --git a/src/annotation/cell_annotator.py b/src/ks_xlsx_parser/annotation/cell_annotator.py similarity index 98% rename from src/annotation/cell_annotator.py rename to src/ks_xlsx_parser/annotation/cell_annotator.py index 5f80bdc..af82bac 100644 --- a/src/annotation/cell_annotator.py +++ b/src/ks_xlsx_parser/annotation/cell_annotator.py @@ -11,9 +11,9 @@ import logging from collections import defaultdict -from models.cell import CellDTO -from models.common import CellAnnotation -from models.sheet import SheetDTO +from ks_xlsx_parser.models.cell import CellDTO +from ks_xlsx_parser.models.common import CellAnnotation +from ks_xlsx_parser.models.sheet import SheetDTO logger = logging.getLogger(__name__) diff --git a/src/api.py b/src/ks_xlsx_parser/api.py similarity index 97% rename from src/api.py rename to src/ks_xlsx_parser/api.py index 36469c5..82a367d 100644 --- a/src/api.py +++ b/src/ks_xlsx_parser/api.py @@ -2,7 +2,7 @@ FastAPI application for uploading and parsing Excel files. Run with: - uvicorn xlsx_parser.api:app --reload --port 8080 + uvicorn ks_xlsx_parser.api:app --reload --port 8080 Or use the xlsx-parser-api script (runs on port 8080 by default): xlsx-parser-api @@ -12,16 +12,11 @@ from __future__ import annotations -import json -import time -import traceback -from pathlib import Path - from fastapi import FastAPI, File, UploadFile from fastapi.responses import HTMLResponse, JSONResponse -from pipeline import parse_workbook -from verification import StageVerifier +from ks_xlsx_parser.pipeline import parse_workbook +from ks_xlsx_parser.verification import StageVerifier app = FastAPI( title="ks-xlsx-parser API", diff --git a/src/charts/__init__.py b/src/ks_xlsx_parser/charts/__init__.py similarity index 100% rename from src/charts/__init__.py rename to src/ks_xlsx_parser/charts/__init__.py diff --git a/src/charts/chart_extractor.py b/src/ks_xlsx_parser/charts/chart_extractor.py similarity index 99% rename from src/charts/chart_extractor.py rename to src/ks_xlsx_parser/charts/chart_extractor.py index 4c4cdca..017feb9 100644 --- a/src/charts/chart_extractor.py +++ b/src/ks_xlsx_parser/charts/chart_extractor.py @@ -21,8 +21,8 @@ from pathlib import Path from xml.etree import ElementTree as ET -from models.chart import ChartAnchor, ChartAxis, ChartDTO, ChartSeries -from models.common import ChartType +from ks_xlsx_parser.models.chart import ChartAnchor, ChartAxis, ChartDTO, ChartSeries +from ks_xlsx_parser.models.common import ChartType logger = logging.getLogger(__name__) diff --git a/src/chunking/__init__.py b/src/ks_xlsx_parser/chunking/__init__.py similarity index 100% rename from src/chunking/__init__.py rename to src/ks_xlsx_parser/chunking/__init__.py diff --git a/src/chunking/chunker.py b/src/ks_xlsx_parser/chunking/chunker.py similarity index 95% rename from src/chunking/chunker.py rename to src/ks_xlsx_parser/chunking/chunker.py index bb4d4f9..7057e78 100644 --- a/src/chunking/chunker.py +++ b/src/ks_xlsx_parser/chunking/chunker.py @@ -13,14 +13,12 @@ import logging -from models.block import BlockDTO, ChunkDTO, DependencySummary -from models.common import CellCoord, EdgeType -from models.dependency import DependencyGraph -from models.sheet import SheetDTO -from models.table import TableDTO -from models.workbook import NamedRangeDTO, WorkbookDTO -from rendering.html_renderer import HtmlRenderer -from rendering.text_renderer import TextRenderer +from ks_xlsx_parser.models.block import BlockDTO, ChunkDTO, DependencySummary +from ks_xlsx_parser.models.common import CellCoord, EdgeType +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.workbook import WorkbookDTO +from ks_xlsx_parser.rendering.html_renderer import HtmlRenderer +from ks_xlsx_parser.rendering.text_renderer import TextRenderer logger = logging.getLogger(__name__) @@ -179,7 +177,7 @@ def _chart_to_chunk(self, chart) -> ChunkDTO: top_left = "A1" bottom_right = "A1" if chart.anchor: - from models.common import col_number_to_letter + from ks_xlsx_parser.models.common import col_number_to_letter top_left = f"{col_number_to_letter(chart.anchor.from_col + 1)}{chart.anchor.from_row + 1}" if chart.anchor.to_col is not None and chart.anchor.to_row is not None: bottom_right = f"{col_number_to_letter(chart.anchor.to_col + 1)}{chart.anchor.to_row + 1}" diff --git a/src/chunking/segmenter.py b/src/ks_xlsx_parser/chunking/segmenter.py similarity index 98% rename from src/chunking/segmenter.py rename to src/ks_xlsx_parser/chunking/segmenter.py index 1b7898a..bde7b2a 100644 --- a/src/chunking/segmenter.py +++ b/src/ks_xlsx_parser/chunking/segmenter.py @@ -22,10 +22,10 @@ import logging -from models.block import BlockDTO -from models.common import BlockType, CellCoord, CellRange -from models.sheet import SheetDTO -from models.table import TableDTO +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange +from ks_xlsx_parser.models.sheet import SheetDTO +from ks_xlsx_parser.models.table import TableDTO logger = logging.getLogger(__name__) diff --git a/src/comparison/__init__.py b/src/ks_xlsx_parser/comparison/__init__.py similarity index 100% rename from src/comparison/__init__.py rename to src/ks_xlsx_parser/comparison/__init__.py diff --git a/src/comparison/template_comparator.py b/src/ks_xlsx_parser/comparison/template_comparator.py similarity index 99% rename from src/comparison/template_comparator.py rename to src/ks_xlsx_parser/comparison/template_comparator.py index e7e043e..0c4fd64 100644 --- a/src/comparison/template_comparator.py +++ b/src/ks_xlsx_parser/comparison/template_comparator.py @@ -11,8 +11,8 @@ import logging from collections import defaultdict -from models.common import CellCoord, compute_hash -from models.template import ( +from ks_xlsx_parser.models.common import CellCoord +from ks_xlsx_parser.models.template import ( DOFConflict, DOFType, GeneralizedTemplate, diff --git a/src/export/__init__.py b/src/ks_xlsx_parser/export/__init__.py similarity index 100% rename from src/export/__init__.py rename to src/ks_xlsx_parser/export/__init__.py diff --git a/src/export/model_exporter.py b/src/ks_xlsx_parser/export/model_exporter.py similarity index 98% rename from src/export/model_exporter.py rename to src/ks_xlsx_parser/export/model_exporter.py index d47dc7d..0784aac 100644 --- a/src/export/model_exporter.py +++ b/src/ks_xlsx_parser/export/model_exporter.py @@ -12,8 +12,8 @@ from pathlib import Path from typing import Any -from models.common import CellCoord -from models.template import DOFType, GeneralizedTemplate, TemplateNode +from ks_xlsx_parser.models.common import CellCoord +from ks_xlsx_parser.models.template import DOFType, GeneralizedTemplate logger = logging.getLogger(__name__) @@ -205,7 +205,7 @@ def export_code( Constants: {template.total_constants} | DOFs: {template.total_dofs} | Formulas: {template.total_formulas} """ - from xlsx_parser.export.model_exporter import SpreadsheetImporter + from ks_xlsx_parser.export.model_exporter import SpreadsheetImporter from xlsx_parser import parse_workbook diff --git a/src/formula/__init__.py b/src/ks_xlsx_parser/formula/__init__.py similarity index 100% rename from src/formula/__init__.py rename to src/ks_xlsx_parser/formula/__init__.py diff --git a/src/formula/dependency_builder.py b/src/ks_xlsx_parser/formula/dependency_builder.py similarity index 93% rename from src/formula/dependency_builder.py rename to src/ks_xlsx_parser/formula/dependency_builder.py index 8bd85b0..19785af 100644 --- a/src/formula/dependency_builder.py +++ b/src/ks_xlsx_parser/formula/dependency_builder.py @@ -11,13 +11,14 @@ import logging from typing import TYPE_CHECKING -from models.common import CellCoord, EdgeType -from models.dependency import DependencyEdgeDTO, DependencyGraph +from ks_xlsx_parser.models.common import CellCoord, EdgeType +from ks_xlsx_parser.models.dependency import DependencyEdgeDTO, DependencyGraph + from .formula_parser import FormulaParser if TYPE_CHECKING: - from models.sheet import SheetDTO - from models.workbook import NamedRangeDTO + from ks_xlsx_parser.models.sheet import SheetDTO + from ks_xlsx_parser.models.workbook import NamedRangeDTO logger = logging.getLogger(__name__) diff --git a/src/formula/formula_parser.py b/src/ks_xlsx_parser/formula/formula_parser.py similarity index 98% rename from src/formula/formula_parser.py rename to src/ks_xlsx_parser/formula/formula_parser.py index f850b91..941f7a4 100644 --- a/src/formula/formula_parser.py +++ b/src/ks_xlsx_parser/formula/formula_parser.py @@ -10,9 +10,9 @@ from __future__ import annotations import re -from dataclasses import dataclass, field +from dataclasses import dataclass -from models.common import CellCoord, CellRange, col_letter_to_number +from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number # Optional Rust fast-path: ks_xlsx_core.scan_formula returns reference tuples # in the same emit order as the Python regex pipeline. When available, we diff --git a/src/models/__init__.py b/src/ks_xlsx_parser/models/__init__.py similarity index 94% rename from src/models/__init__.py rename to src/ks_xlsx_parser/models/__init__.py index 42814ea..5c91780 100644 --- a/src/models/__init__.py +++ b/src/ks_xlsx_parser/models/__init__.py @@ -1,8 +1,8 @@ """ -Data models (DTOs) for the xlsx_parser. +Data models (DTOs) for the ks_xlsx_parser. Re-exports all model classes for convenient importing: - from xlsx_parser.models import WorkbookDTO, SheetDTO, CellDTO, ... + from ks_xlsx_parser.models import WorkbookDTO, SheetDTO, CellDTO, ... """ from .block import BlockDTO, ChunkDTO, DependencySummary diff --git a/src/models/block.py b/src/ks_xlsx_parser/models/block.py similarity index 100% rename from src/models/block.py rename to src/ks_xlsx_parser/models/block.py diff --git a/src/models/cell.py b/src/ks_xlsx_parser/models/cell.py similarity index 100% rename from src/models/cell.py rename to src/ks_xlsx_parser/models/cell.py diff --git a/src/models/chart.py b/src/ks_xlsx_parser/models/chart.py similarity index 100% rename from src/models/chart.py rename to src/ks_xlsx_parser/models/chart.py diff --git a/src/models/common.py b/src/ks_xlsx_parser/models/common.py similarity index 100% rename from src/models/common.py rename to src/ks_xlsx_parser/models/common.py diff --git a/src/models/dependency.py b/src/ks_xlsx_parser/models/dependency.py similarity index 100% rename from src/models/dependency.py rename to src/ks_xlsx_parser/models/dependency.py diff --git a/src/models/shape.py b/src/ks_xlsx_parser/models/shape.py similarity index 97% rename from src/models/shape.py rename to src/ks_xlsx_parser/models/shape.py index 178ebaf..a0f94f9 100644 --- a/src/models/shape.py +++ b/src/ks_xlsx_parser/models/shape.py @@ -9,7 +9,7 @@ from pydantic import Field -from .common import CellRange, StableModel, compute_hash +from .common import StableModel, compute_hash class ShapeAnchor(StableModel): diff --git a/src/models/sheet.py b/src/ks_xlsx_parser/models/sheet.py similarity index 99% rename from src/models/sheet.py rename to src/ks_xlsx_parser/models/sheet.py index bfc2758..68c8cb1 100644 --- a/src/models/sheet.py +++ b/src/ks_xlsx_parser/models/sheet.py @@ -8,8 +8,6 @@ from __future__ import annotations -from typing import Any - from pydantic import Field from .cell import CellDTO diff --git a/src/models/table.py b/src/ks_xlsx_parser/models/table.py similarity index 100% rename from src/models/table.py rename to src/ks_xlsx_parser/models/table.py diff --git a/src/models/table_structure.py b/src/ks_xlsx_parser/models/table_structure.py similarity index 100% rename from src/models/table_structure.py rename to src/ks_xlsx_parser/models/table_structure.py diff --git a/src/models/template.py b/src/ks_xlsx_parser/models/template.py similarity index 100% rename from src/models/template.py rename to src/ks_xlsx_parser/models/template.py diff --git a/src/models/tree.py b/src/ks_xlsx_parser/models/tree.py similarity index 100% rename from src/models/tree.py rename to src/ks_xlsx_parser/models/tree.py diff --git a/src/models/workbook.py b/src/ks_xlsx_parser/models/workbook.py similarity index 99% rename from src/models/workbook.py rename to src/ks_xlsx_parser/models/workbook.py index 772c934..39a428b 100644 --- a/src/models/workbook.py +++ b/src/ks_xlsx_parser/models/workbook.py @@ -32,7 +32,7 @@ from .sheet import SheetDTO from .table import TableDTO from .table_structure import TableStructure -from .template import GeneralizedTemplate, TemplateNode +from .template import TemplateNode from .tree import TreeNode diff --git a/src/parsers/__init__.py b/src/ks_xlsx_parser/parsers/__init__.py similarity index 100% rename from src/parsers/__init__.py rename to src/ks_xlsx_parser/parsers/__init__.py diff --git a/src/parsers/calamine_core.py b/src/ks_xlsx_parser/parsers/calamine_core.py similarity index 99% rename from src/parsers/calamine_core.py rename to src/ks_xlsx_parser/parsers/calamine_core.py index 830df5d..e6ac959 100644 --- a/src/parsers/calamine_core.py +++ b/src/ks_xlsx_parser/parsers/calamine_core.py @@ -17,7 +17,6 @@ """ from __future__ import annotations -import io import logging import os import tempfile diff --git a/src/parsers/cell_parser.py b/src/ks_xlsx_parser/parsers/cell_parser.py similarity index 99% rename from src/parsers/cell_parser.py rename to src/ks_xlsx_parser/parsers/cell_parser.py index cc22092..bf1857c 100644 --- a/src/parsers/cell_parser.py +++ b/src/ks_xlsx_parser/parsers/cell_parser.py @@ -17,7 +17,7 @@ from openpyxl.cell.cell import Cell as OpenpyxlCell from openpyxl.cell.cell import MergedCell as OpenpyxlMergedCell -from models.cell import ( +from ks_xlsx_parser.models.cell import ( AlignmentStyle, BorderSide, BorderStyle, @@ -26,7 +26,7 @@ FillStyle, FontStyle, ) -from models.common import CellCoord, RichTextRun +from ks_xlsx_parser.models.common import CellCoord, RichTextRun logger = logging.getLogger(__name__) diff --git a/src/parsers/sheet_parser.py b/src/ks_xlsx_parser/parsers/sheet_parser.py similarity index 98% rename from src/parsers/sheet_parser.py rename to src/ks_xlsx_parser/parsers/sheet_parser.py index 5883a92..d49a28d 100644 --- a/src/parsers/sheet_parser.py +++ b/src/ks_xlsx_parser/parsers/sheet_parser.py @@ -18,21 +18,22 @@ from lxml import etree from openpyxl.worksheet.worksheet import Worksheet as OpenpyxlWorksheet -from models.cell import CellDTO -from models.common import CellCoord, CellRange, ParseError, Severity, col_letter_to_number +from ks_xlsx_parser.models.cell import CellDTO +from ks_xlsx_parser.models.common import CellCoord, CellRange, ParseError, Severity, col_letter_to_number # OOXML namespace for spreadsheetML _OOXML_NS = {"s": "http://schemas.openxmlformats.org/spreadsheetml/2006/main"} # Regex to split an A1-style cell reference like "B1" into ("B", "1") _CELL_REF_RE = re.compile(r"^([A-Z]+)(\d+)$") -from models.sheet import ( +from ks_xlsx_parser.models.sheet import ( ConditionalFormatRule, DataValidationRule, MergedRegion, SheetDTO, SheetProperties, ) + from .cell_parser import CellParser logger = logging.getLogger(__name__) @@ -282,7 +283,7 @@ def _extract_dimensions(self, sheet: SheetDTO) -> None: for col_letter, cd in self._ws.column_dimensions.items(): if cd.width: - from models.common import col_letter_to_number + from ks_xlsx_parser.models.common import col_letter_to_number col_num = col_letter_to_number(col_letter) sheet.col_widths[col_num] = cd.width * 7.5 # chars to points approx @@ -294,7 +295,7 @@ def _extract_hidden(self, sheet: SheetDTO) -> None: for col_letter, cd in self._ws.column_dimensions.items(): if cd.hidden: - from models.common import col_letter_to_number + from ks_xlsx_parser.models.common import col_letter_to_number col_num = col_letter_to_number(col_letter) sheet.hidden_cols.add(col_num) @@ -347,11 +348,10 @@ def _extract_autofilter(self, sheet: SheetDTO) -> None: if not af or not af.ref: return try: - from models.common import FilterCriteria + from ks_xlsx_parser.models.common import FilterCriteria ref = str(af.ref) parts = ref.split(":") if len(parts) == 2: - from models.common import col_letter_to_number as c2n start_match = _CELL_REF_RE.match(parts[0]) end_match = _CELL_REF_RE.match(parts[1]) if start_match and end_match: diff --git a/src/parsers/table_parser.py b/src/ks_xlsx_parser/parsers/table_parser.py similarity index 97% rename from src/parsers/table_parser.py rename to src/ks_xlsx_parser/parsers/table_parser.py index d2eb2eb..bdc0d55 100644 --- a/src/parsers/table_parser.py +++ b/src/ks_xlsx_parser/parsers/table_parser.py @@ -12,8 +12,8 @@ from openpyxl.worksheet.worksheet import Worksheet as OpenpyxlWorksheet -from models.common import CellCoord, CellRange, col_letter_to_number -from models.table import TableColumn, TableDTO +from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number +from ks_xlsx_parser.models.table import TableColumn, TableDTO logger = logging.getLogger(__name__) diff --git a/src/parsers/workbook_parser.py b/src/ks_xlsx_parser/parsers/workbook_parser.py similarity index 97% rename from src/parsers/workbook_parser.py rename to src/ks_xlsx_parser/parsers/workbook_parser.py index bee0452..57e0c93 100644 --- a/src/parsers/workbook_parser.py +++ b/src/ks_xlsx_parser/parsers/workbook_parser.py @@ -11,7 +11,6 @@ import io import logging import time -from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor from pathlib import Path import xxhash @@ -21,7 +20,7 @@ # both values AND formulas in one read; python-calamine is kept as a fallback # for environments where the Rust crate hasn't been built. Both are bypassed # cleanly if absent — correctness always falls back to openpyxl. -from parsers import calamine_core as _calamine_core +from ks_xlsx_parser.parsers import calamine_core as _calamine_core try: from python_calamine import CalamineWorkbook @@ -30,14 +29,14 @@ CalamineWorkbook = None # type: ignore[assignment] _HAS_PYCALAMINE = False -from models.common import ParseError, Severity -from models.common import CalculationMode, DateSystem -from models.workbook import ( +from ks_xlsx_parser.models.common import CalculationMode, DateSystem, ParseError, Severity +from ks_xlsx_parser.models.workbook import ( ExternalLink, NamedRangeDTO, WorkbookDTO, WorkbookProperties, ) + from .sheet_parser import SheetParser from .table_parser import TableParser @@ -234,7 +233,7 @@ def parse(self) -> WorkbookDTO: # Extract charts via OOXML parsing try: - from charts.chart_extractor import ChartExtractor + from ks_xlsx_parser.charts.chart_extractor import ChartExtractor chart_extractor = ChartExtractor( self._path or self._content, wb_formula.sheetnames ) @@ -253,7 +252,7 @@ def parse(self) -> WorkbookDTO: # accounts for ~25% of the full-mode wall clock). if self._build_dep_graph: try: - from formula.dependency_builder import DependencyBuilder + from ks_xlsx_parser.formula.dependency_builder import DependencyBuilder dep_builder = DependencyBuilder(result.sheets, result.named_ranges) result.dependency_graph = dep_builder.build() except Exception as e: diff --git a/src/pipeline.py b/src/ks_xlsx_parser/pipeline.py similarity index 90% rename from src/pipeline.py rename to src/ks_xlsx_parser/pipeline.py index 9b902c5..1451074 100644 --- a/src/pipeline.py +++ b/src/ks_xlsx_parser/pipeline.py @@ -1,7 +1,7 @@ """ End-to-end parsing pipeline. -This is the primary public API for the xlsx_parser. It orchestrates +This is the primary public API for the ks_xlsx_parser. It orchestrates the full 11-stage Excellent Algorithm: Stage 0: Sheet Chunking (adaptive gaps + style boundaries) Stage 1: Cell Annotation (feature-based scorer) @@ -15,7 +15,7 @@ Render + Store Usage: - from xlsx_parser.pipeline import parse_workbook + from ks_xlsx_parser.pipeline import parse_workbook result = parse_workbook("path/to/workbook.xlsx") print(result.workbook.total_cells) for chunk in result.chunks: @@ -32,25 +32,25 @@ ParseMode = Literal["full", "fast"] -from analysis.light_block_detector import LightBlockDetector -from models.common import CellCoord, CellRange, col_letter_to_number -from analysis.pattern_splitter import PatternSplitter -from analysis.table_assembler import TableAssembler -from analysis.table_grouper import TableGrouper -from analysis.template_extractor import TemplateExtractor -from analysis.tree_builder import TreeBuilder -from annotation.block_splitter import BlockSplitter -from annotation.cell_annotator import CellAnnotator -from chunking.chunker import ChunkBuilder -from comparison.template_comparator import TemplateComparator -from export.model_exporter import ModelExporter -from models.block import ChunkDTO -from models.table_structure import TableStructure -from models.template import GeneralizedTemplate, TemplateNode -from models.tree import TreeNode -from models.workbook import WorkbookDTO -from parsers.workbook_parser import WorkbookParser -from storage.serializer import WorkbookSerializer +from ks_xlsx_parser.analysis.light_block_detector import LightBlockDetector +from ks_xlsx_parser.analysis.pattern_splitter import PatternSplitter +from ks_xlsx_parser.analysis.table_assembler import TableAssembler +from ks_xlsx_parser.analysis.table_grouper import TableGrouper +from ks_xlsx_parser.analysis.template_extractor import TemplateExtractor +from ks_xlsx_parser.analysis.tree_builder import TreeBuilder +from ks_xlsx_parser.annotation.block_splitter import BlockSplitter +from ks_xlsx_parser.annotation.cell_annotator import CellAnnotator +from ks_xlsx_parser.chunking.chunker import ChunkBuilder +from ks_xlsx_parser.comparison.template_comparator import TemplateComparator +from ks_xlsx_parser.export.model_exporter import ModelExporter +from ks_xlsx_parser.models.block import ChunkDTO +from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number +from ks_xlsx_parser.models.table_structure import TableStructure +from ks_xlsx_parser.models.template import GeneralizedTemplate, TemplateNode +from ks_xlsx_parser.models.tree import TreeNode +from ks_xlsx_parser.models.workbook import WorkbookDTO +from ks_xlsx_parser.parsers.workbook_parser import WorkbookParser +from ks_xlsx_parser.storage.serializer import WorkbookSerializer logger = logging.getLogger(__name__) @@ -261,7 +261,7 @@ def parse_workbook( annotator.annotate() # Stage 0+2: Segment then split by annotation - from chunking.segmenter import LayoutSegmenter + from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter sheet_tables = [t for t in workbook.tables if t.sheet_name == sheet.sheet_name] sheet_named = [ nr.name for nr in workbook.named_ranges diff --git a/src/py.typed b/src/ks_xlsx_parser/py.typed similarity index 100% rename from src/py.typed rename to src/ks_xlsx_parser/py.typed diff --git a/src/rendering/__init__.py b/src/ks_xlsx_parser/rendering/__init__.py similarity index 100% rename from src/rendering/__init__.py rename to src/ks_xlsx_parser/rendering/__init__.py diff --git a/src/rendering/html_renderer.py b/src/ks_xlsx_parser/rendering/html_renderer.py similarity index 96% rename from src/rendering/html_renderer.py rename to src/ks_xlsx_parser/rendering/html_renderer.py index c7c1a1c..50a8b21 100644 --- a/src/rendering/html_renderer.py +++ b/src/ks_xlsx_parser/rendering/html_renderer.py @@ -14,10 +14,10 @@ import html import logging -from models.block import BlockDTO -from models.cell import CellDTO, CellStyle -from models.common import BlockType, CellCoord, col_number_to_letter -from models.sheet import SheetDTO +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.cell import CellStyle +from ks_xlsx_parser.models.common import BlockType, col_number_to_letter +from ks_xlsx_parser.models.sheet import SheetDTO logger = logging.getLogger(__name__) diff --git a/src/rendering/text_renderer.py b/src/ks_xlsx_parser/rendering/text_renderer.py similarity index 97% rename from src/rendering/text_renderer.py rename to src/ks_xlsx_parser/rendering/text_renderer.py index 76b9b76..e2c15f4 100644 --- a/src/rendering/text_renderer.py +++ b/src/ks_xlsx_parser/rendering/text_renderer.py @@ -11,10 +11,10 @@ import datetime as _dt import logging -from models.block import BlockDTO -from models.common import BlockType, col_number_to_letter -from models.chart import ChartDTO -from models.sheet import SheetDTO +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.chart import ChartDTO +from ks_xlsx_parser.models.common import BlockType, col_number_to_letter +from ks_xlsx_parser.models.sheet import SheetDTO logger = logging.getLogger(__name__) diff --git a/src/storage/__init__.py b/src/ks_xlsx_parser/storage/__init__.py similarity index 100% rename from src/storage/__init__.py rename to src/ks_xlsx_parser/storage/__init__.py diff --git a/src/storage/serializer.py b/src/ks_xlsx_parser/storage/serializer.py similarity index 98% rename from src/storage/serializer.py rename to src/ks_xlsx_parser/storage/serializer.py index 5ced8f9..66add02 100644 --- a/src/storage/serializer.py +++ b/src/ks_xlsx_parser/storage/serializer.py @@ -15,8 +15,8 @@ import logging from typing import Any -from models.block import ChunkDTO -from models.workbook import WorkbookDTO +from ks_xlsx_parser.models.block import ChunkDTO +from ks_xlsx_parser.models.workbook import WorkbookDTO logger = logging.getLogger(__name__) diff --git a/src/utils/__init__.py b/src/ks_xlsx_parser/utils/__init__.py similarity index 100% rename from src/utils/__init__.py rename to src/ks_xlsx_parser/utils/__init__.py diff --git a/src/utils/logging_config.py b/src/ks_xlsx_parser/utils/logging_config.py similarity index 97% rename from src/utils/logging_config.py rename to src/ks_xlsx_parser/utils/logging_config.py index ca23108..94ceb54 100644 --- a/src/utils/logging_config.py +++ b/src/ks_xlsx_parser/utils/logging_config.py @@ -1,5 +1,5 @@ """ -Structured logging configuration for the xlsx_parser. +Structured logging configuration for the ks_xlsx_parser. Provides a JSON-structured logging format with workbook/sheet/block context fields for observability and debugging in production. diff --git a/src/verification/__init__.py b/src/ks_xlsx_parser/verification/__init__.py similarity index 100% rename from src/verification/__init__.py rename to src/ks_xlsx_parser/verification/__init__.py diff --git a/src/verification/stage_verifier.py b/src/ks_xlsx_parser/verification/stage_verifier.py similarity index 97% rename from src/verification/stage_verifier.py rename to src/ks_xlsx_parser/verification/stage_verifier.py index 3fb315b..88fb4f8 100644 --- a/src/verification/stage_verifier.py +++ b/src/ks_xlsx_parser/verification/stage_verifier.py @@ -16,23 +16,21 @@ from pydantic import Field -from analysis.light_block_detector import LightBlockDetector -from analysis.pattern_splitter import PatternSplitter -from analysis.table_assembler import TableAssembler -from analysis.table_grouper import TableGrouper -from analysis.template_extractor import TemplateExtractor -from analysis.tree_builder import TreeBuilder -from annotation.block_splitter import BlockSplitter -from annotation.cell_annotator import CellAnnotator -from chunking.segmenter import LayoutSegmenter -from comparison.template_comparator import TemplateComparator -from export.model_exporter import ModelExporter -from models.block import BlockDTO -from models.common import BlockType, CellAnnotation, StableModel -from models.table_structure import TableStructure -from models.template import TemplateNode -from models.tree import TreeNode -from parsers.workbook_parser import WorkbookParser +from ks_xlsx_parser.analysis.light_block_detector import LightBlockDetector +from ks_xlsx_parser.analysis.pattern_splitter import PatternSplitter +from ks_xlsx_parser.analysis.table_assembler import TableAssembler +from ks_xlsx_parser.analysis.table_grouper import TableGrouper +from ks_xlsx_parser.analysis.template_extractor import TemplateExtractor +from ks_xlsx_parser.analysis.tree_builder import TreeBuilder +from ks_xlsx_parser.annotation.block_splitter import BlockSplitter +from ks_xlsx_parser.annotation.cell_annotator import CellAnnotator +from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter +from ks_xlsx_parser.models.block import BlockDTO +from ks_xlsx_parser.models.common import BlockType, CellAnnotation, StableModel +from ks_xlsx_parser.models.table_structure import TableStructure +from ks_xlsx_parser.models.template import TemplateNode +from ks_xlsx_parser.models.tree import TreeNode +from ks_xlsx_parser.parsers.workbook_parser import WorkbookParser logger = logging.getLogger(__name__) diff --git a/tests/benchmarks/_driver.py b/tests/benchmarks/_driver.py index a66acd5..94ee633 100644 --- a/tests/benchmarks/_driver.py +++ b/tests/benchmarks/_driver.py @@ -18,10 +18,8 @@ import subprocess import sys from collections import defaultdict -from dataclasses import asdict from datetime import UTC, datetime from pathlib import Path -from typing import Iterable from ._runner import Runner from ._schema import CSV_FIELDS, BenchmarkRecord, record_to_csv_row, validate_record @@ -344,7 +342,7 @@ def generate_drift(out_dir: Path) -> None: short = Path(fname).name lines.append(f"| {short} | {va} | {vb} | {va - vb:+d} |") if n_drift == 0: - lines.append(f"| _no drift_ | | | |") + lines.append("| _no drift_ | | | |") elif n_drift > 50: lines.append(f"| … {n_drift - 50} more rows truncated | | | |") lines.append("") diff --git a/tests/benchmarks/_runner.py b/tests/benchmarks/_runner.py index 2bdb95c..a5febc9 100644 --- a/tests/benchmarks/_runner.py +++ b/tests/benchmarks/_runner.py @@ -18,7 +18,7 @@ import time from dataclasses import dataclass from pathlib import Path -from typing import Any, Iterator +from typing import Any HERE = Path(__file__).resolve().parent REPO_ROOT = HERE.parent.parent diff --git a/tests/benchmarks/adapters/docling_adapter.py b/tests/benchmarks/adapters/docling_adapter.py index 17d4b02..bd573f5 100644 --- a/tests/benchmarks/adapters/docling_adapter.py +++ b/tests/benchmarks/adapters/docling_adapter.py @@ -41,7 +41,7 @@ sys.path.insert(0, str(_HERE.parent.parent.parent)) # repo root from tests.benchmarks._mem import peak_rss_mb # noqa: E402 -from tests.benchmarks._schema import BenchmarkRecord, SCHEMA_VERSION # noqa: E402 +from tests.benchmarks._schema import SCHEMA_VERSION, BenchmarkRecord # noqa: E402 PARSER_NAME = "docling" MAX_ERR_LEN = 500 diff --git a/tests/benchmarks/adapters/ks_adapter.py b/tests/benchmarks/adapters/ks_adapter.py index 917df29..754b2a6 100644 --- a/tests/benchmarks/adapters/ks_adapter.py +++ b/tests/benchmarks/adapters/ks_adapter.py @@ -28,10 +28,10 @@ sys.path.insert(0, str(_HERE.parent.parent.parent)) # repo root sys.path.insert(0, str(_HERE.parent.parent.parent / "src")) # src layout -from tests.benchmarks._mem import peak_rss_mb # noqa: E402 -from tests.benchmarks._schema import BenchmarkRecord, SCHEMA_VERSION # noqa: E402 from ks_xlsx_parser import __version__ as KS_VERSION # noqa: E402 from ks_xlsx_parser import parse_workbook # noqa: E402 +from tests.benchmarks._mem import peak_rss_mb # noqa: E402 +from tests.benchmarks._schema import SCHEMA_VERSION, BenchmarkRecord # noqa: E402 MAX_ERR_LEN = 500 diff --git a/tests/conftest.py b/tests/conftest.py index d422b9b..1417e30 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -8,7 +8,6 @@ -import os import tempfile from pathlib import Path @@ -17,7 +16,6 @@ from openpyxl.chart import BarChart, Reference from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, IconSetRule from openpyxl.styles import Alignment, Border, Font, PatternFill, Side -from openpyxl.utils import get_column_letter from openpyxl.worksheet.datavalidation import DataValidation from openpyxl.worksheet.table import Table, TableStyleInfo diff --git a/tests/helpers/invariant_checker.py b/tests/helpers/invariant_checker.py index a077486..89efd1c 100644 --- a/tests/helpers/invariant_checker.py +++ b/tests/helpers/invariant_checker.py @@ -9,7 +9,7 @@ import re -from models.common import EdgeType +from ks_xlsx_parser.models.common import EdgeType def check_invariants(workbook) -> list[str]: diff --git a/tests/test_charts.py b/tests/test_charts.py index 021e793..802e6a6 100644 --- a/tests/test_charts.py +++ b/tests/test_charts.py @@ -5,11 +5,10 @@ type, series, axes, title, and position anchor. """ -import pytest -from charts.chart_extractor import ChartExtractor -from models import ChartType -from parsers import WorkbookParser +from ks_xlsx_parser.charts.chart_extractor import ChartExtractor +from ks_xlsx_parser.models import ChartType +from ks_xlsx_parser.parsers import WorkbookParser class TestChartExtraction: diff --git a/tests/test_corpus_robustness.py b/tests/test_corpus_robustness.py index 2253eea..4263c89 100644 --- a/tests/test_corpus_robustness.py +++ b/tests/test_corpus_robustness.py @@ -1,5 +1,5 @@ """ -Corpus robustness tests for the xlsx_parser. +Corpus robustness tests for the ks_xlsx_parser. Tests the parser against large sets of real-world .xlsx files to catch crashes, invariant violations, and regression. Corpus tests are skipped @@ -15,17 +15,15 @@ import pytest -from pipeline import parse_workbook -from models.common import Severity - -from tests.helpers.invariant_checker import check_invariants +from ks_xlsx_parser.models.common import Severity +from ks_xlsx_parser.pipeline import parse_workbook from tests.helpers.corpus_downloader import ( - download_euses_corpus, download_enron_corpus, + download_euses_corpus, download_github_xlsx_samples, get_corpus_files, ) - +from tests.helpers.invariant_checker import check_invariants CORPUS_DIR = Path(__file__).parent / "fixtures" / "corpus" diff --git a/tests/test_formula_handling.py b/tests/test_formula_handling.py index b3f8342..8402667 100644 --- a/tests/test_formula_handling.py +++ b/tests/test_formula_handling.py @@ -11,14 +11,11 @@ - Renders formula cells in chunk output """ -import pytest - -from formula.dependency_builder import DependencyBuilder -from formula.formula_parser import FormulaParser, ParsedReference -from models.common import BlockType, CellCoord, EdgeType -from parsers import WorkbookParser -from pipeline import parse_workbook +from ks_xlsx_parser.formula.formula_parser import FormulaParser +from ks_xlsx_parser.models.common import CellCoord, EdgeType +from ks_xlsx_parser.parsers import WorkbookParser +from ks_xlsx_parser.pipeline import parse_workbook # --------------------------------------------------------------------------- # Helpers @@ -705,7 +702,7 @@ class TestFormulaInBlocks: def test_block_formula_count(self, simple_formulas): result = parse_workbook(path=simple_formulas) # The block should report formula_count > 0 - from chunking.segmenter import LayoutSegmenter + from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter sheet = result.workbook.sheets[0] tables = [t for t in result.workbook.tables if t.sheet_name == sheet.sheet_name] segmenter = LayoutSegmenter(sheet, tables=tables) @@ -715,7 +712,7 @@ def test_block_formula_count(self, simple_formulas): def test_cross_sheet_block_has_formulas(self, cross_sheet_formulas): result = parse_workbook(path=cross_sheet_formulas) - from chunking.segmenter import LayoutSegmenter + from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter summary = [s for s in result.workbook.sheets if s.sheet_name == "Summary"][0] tables = [t for t in result.workbook.tables if t.sheet_name == summary.sheet_name] segmenter = LayoutSegmenter(summary, tables=tables) diff --git a/tests/test_formula_parser.py b/tests/test_formula_parser.py index 784e171..03c3959 100644 --- a/tests/test_formula_parser.py +++ b/tests/test_formula_parser.py @@ -5,10 +5,9 @@ and circular reference detection. """ -import pytest -from formula.formula_parser import FormulaParser -from models import CellCoord, DependencyGraph, DependencyEdgeDTO, EdgeType +from ks_xlsx_parser.formula.formula_parser import FormulaParser +from ks_xlsx_parser.models import CellCoord, DependencyEdgeDTO, DependencyGraph, EdgeType class TestFormulaParser: diff --git a/tests/test_llm_artifacts.py b/tests/test_llm_artifacts.py index c3247eb..87f6335 100644 --- a/tests/test_llm_artifacts.py +++ b/tests/test_llm_artifacts.py @@ -5,16 +5,15 @@ and reading-order linearization using programmatic fixtures. """ -import pytest -from analysis import ( +from ks_xlsx_parser.analysis import ( EntityIndexBuilder, KpiCatalogBuilder, ReadingOrderLinearizer, SheetSummaryAnalyzer, ) -from models.common import SheetPurpose -from parsers import WorkbookParser +from ks_xlsx_parser.models.common import SheetPurpose +from ks_xlsx_parser.parsers import WorkbookParser class TestSheetSummaryAnalyzer: diff --git a/tests/test_models.py b/tests/test_models.py index bddec00..5393eea 100644 --- a/tests/test_models.py +++ b/tests/test_models.py @@ -4,12 +4,12 @@ Covers CellCoord, CellRange, hashing, and serialization. """ -from models import ( +from ks_xlsx_parser.models import ( CellCoord, CellRange, - compute_hash, col_letter_to_number, col_number_to_letter, + compute_hash, ) diff --git a/tests/test_multi_table_layout.py b/tests/test_multi_table_layout.py index 7681a52..cca2057 100644 --- a/tests/test_multi_table_layout.py +++ b/tests/test_multi_table_layout.py @@ -8,11 +8,10 @@ import pytest -from chunking.segmenter import LayoutSegmenter -from models import BlockType -from parsers import WorkbookParser -from pipeline import parse_workbook - +from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter +from ks_xlsx_parser.models import BlockType +from ks_xlsx_parser.parsers import WorkbookParser +from ks_xlsx_parser.pipeline import parse_workbook # --------------------------------------------------------------------------- # Helpers diff --git a/tests/test_parsers.py b/tests/test_parsers.py index 850a0bb..fe1e911 100644 --- a/tests/test_parsers.py +++ b/tests/test_parsers.py @@ -6,9 +6,8 @@ tables, comments, data validations, and sheet properties. """ -import pytest -from parsers import WorkbookParser +from ks_xlsx_parser.parsers import WorkbookParser class TestSimpleWorkbook: diff --git a/tests/test_pipeline.py b/tests/test_pipeline.py index 534d125..7f1ab0e 100644 --- a/tests/test_pipeline.py +++ b/tests/test_pipeline.py @@ -7,9 +7,7 @@ import json -import pytest - -from pipeline import parse_workbook +from ks_xlsx_parser.pipeline import parse_workbook class TestEndToEndPipeline: diff --git a/tests/test_rendering.py b/tests/test_rendering.py index e24072a..3a757ec 100644 --- a/tests/test_rendering.py +++ b/tests/test_rendering.py @@ -5,12 +5,11 @@ headers, and coordinate annotations. """ -import pytest -from chunking.segmenter import LayoutSegmenter -from parsers import WorkbookParser -from rendering.html_renderer import HtmlRenderer -from rendering.text_renderer import TextRenderer +from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter +from ks_xlsx_parser.parsers import WorkbookParser +from ks_xlsx_parser.rendering.html_renderer import HtmlRenderer +from ks_xlsx_parser.rendering.text_renderer import TextRenderer class TestHtmlRendering: @@ -112,11 +111,10 @@ def test_numeric_cells_render_raw_not_display_formatted(self): sci-notation fallback (``1.272000e+03``) once the ``[=]`` formula marker pushed the rendered string past col_width — this test guards against that regression.""" - from models.sheet import SheetDTO - from models.cell import CellDTO - from models.common import CellCoord, CellRange - from models.block import BlockDTO - from models.common import BlockType + from ks_xlsx_parser.models.block import BlockDTO + from ks_xlsx_parser.models.cell import CellDTO + from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange + from ks_xlsx_parser.models.sheet import SheetDTO coord = CellCoord(row=1, col=1) cell = CellDTO( diff --git a/tests/test_segmentation.py b/tests/test_segmentation.py index 8cb3cfc..45f13b0 100644 --- a/tests/test_segmentation.py +++ b/tests/test_segmentation.py @@ -5,11 +5,10 @@ assumption blocks, result blocks, and text headers. """ -import pytest -from chunking.segmenter import LayoutSegmenter -from models import BlockType -from parsers import WorkbookParser +from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter +from ks_xlsx_parser.models import BlockType +from ks_xlsx_parser.parsers import WorkbookParser class TestSegmentation: diff --git a/tests/test_stage_verification.py b/tests/test_stage_verification.py index 574be13..75dc8bc 100644 --- a/tests/test_stage_verification.py +++ b/tests/test_stage_verification.py @@ -8,9 +8,7 @@ import json -import pytest - -from verification import ( +from ks_xlsx_parser.verification import ( ExcellentStage, ImplementationStatus, StageResult, @@ -18,7 +16,6 @@ VerificationReport, ) - # --------------------------------------------------------------------------- # Model tests # --------------------------------------------------------------------------- diff --git a/tests/test_structural_invariants.py b/tests/test_structural_invariants.py index 612d11a..bbe1460 100644 --- a/tests/test_structural_invariants.py +++ b/tests/test_structural_invariants.py @@ -1,5 +1,5 @@ """ -Structural invariant tests for the xlsx_parser. +Structural invariant tests for the ks_xlsx_parser. Tests properties that must always hold for any valid parse output, regardless of the input file: merge structure, used range bounds, @@ -13,11 +13,8 @@ import pytest -from pipeline import parse_workbook -from models.common import EdgeType - -from tests.helpers.invariant_checker import check_invariants - +from ks_xlsx_parser.models.common import EdgeType +from ks_xlsx_parser.pipeline import parse_workbook # --------------------------------------------------------------------------- # Invariant tests on programmatic fixtures