diff --git a/.claude/skills/recall-failure-triage/SKILL.md b/.claude/skills/recall-failure-triage/SKILL.md
new file mode 100644
index 0000000..a9a9263
--- /dev/null
+++ b/.claude/skills/recall-failure-triage/SKILL.md
@@ -0,0 +1,154 @@
+---
+name: recall-failure-triage
+description: Diagnose why ks-xlsx-parser's retrieval recall is below target by reading the failure-bucket histogram and exemplar failures, then propose ranked fixes. Use when investigating low recall@k on SpreadsheetBench, deciding what to fix next, or reviewing a benchmark run. Sister skill to excel-extraction-pipeline-improver.
+---
+
+# Recall Failure Triage — turn `recall@5 = X` into a ranked worklist
+
+A single recall number tells you nothing about *what* to fix. Five
+distinct failure modes produce identical recall drops, and the fix is
+completely different per mode. This skill is the bridge from "recall is
+low" to "open `parsers/cell_parser.py` line N, here is the bug".
+
+Pair with `excel-extraction-pipeline-improver` — that skill implements
+fixes; this one decides what to fix.
+
+## When this skill fires
+
+* User says "recall is low / dropped / not improving"
+* New benchmark run produced a `failures.ndjson` or `summary.json`
+* PR-time benchmark sample regressed on the CI step summary
+* You're picking the next parser/chunking PR and need to choose what's
+  highest leverage
+
+## Inputs you should expect
+
+| Input | Where it comes from |
+|---|---|
+| `summary.json` + `summary.md` | latest `tests/benchmarks/reports/retrieval/<stamp>/` |
+| `failures.ndjson` | same dir, **only present if** the run used `--emit-failures` |
+| `history.jsonl` | `tests/benchmarks/reports/history.jsonl` — recall over commits |
+| current branch + recent commits | to know what changed since the last run |
+
+## The decision procedure
+
+### 0. Apply the in-scope filter — most "misses" are unfixable
+
+Before counting buckets, run `scripts/enrich_failures.py` to mark
+instances flagged `instruction_requires_execution` — those where the
+benchmark's `answer_position` is empty in `input.xlsx` because the
+question literally asks the system to *write* the answer. ~63% of
+the SpreadsheetBench corpus is this class on the 200-sample. They are
+NOT parser bugs and CANNOT be fixed by parser work. Exclude them.
+
+```bash
+python scripts/enrich_failures.py tests/benchmarks/reports/retrieval
+jq -c 'select((.flags | contains(["instruction_requires_execution"])) | not)' \
+    tests/benchmarks/reports/retrieval/*/enriched_failures.ndjson
+```
+
+### 1. Read the bucket histogram FIRST
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
+```
+
+The output ranks buckets by count. **Spend your time on the largest
+bucket only.** The next-largest is the next PR.
+
+| Bucket | Root cause | Where to fix |
+|---|---|---|
+| `answer_absent_from_chunks` | The answer value never made it into ANY chunk. The cell was dropped or rendered as a formula expression instead of its computed value. | `parsers/cell_parser.py::_extract_cell_value` and `rendering/text_renderer.py::_cell_render_value`. Check `data_only` plumbing in `parsers/workbook_parser.py`. |
+| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. The chunk is too big/heterogeneous — embedding is diluted. | `chunking/chunker.py`. There is no token cap today; large blocks are emitted whole. Cap at ~512 tokens and row-split tall tables. |
+| `wrong_sheet` | Answer sheet was never chunked. | `parsers/workbook_parser.py` sheet loop. Check hidden sheets, very-hidden sheets, and `wb.sheetnames` vs `wb.worksheets`. |
+| `geometric_no_overlap` | Answer text matches but the chunk's A1 range doesn't overlap ground truth. Range drifts during merge/split. | `annotation/block_splitter.py` + `analysis/pattern_splitter.py`. Add invariant: `block.cell_range` ⊆ bbox of cells that actually rendered text. |
+| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. Reproduce with `parse_workbook(path)`. |
+
+### 2. Read 5 example failures from the top bucket
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval \
+  --bucket <largest> --examples 5
+```
+
+For each example, the script prints:
+* the natural-language question
+* `answer_position` (the ground-truth cell)
+* the ground-truth string values that should appear in a chunk
+* the top-8 ranked chunks with sheet, range, and whether each contains
+  the answer
+
+**Pattern-match.** What do the 5 examples have in common? E.g.
+* All answer cells are formulas → H2 confirmed: render the cached value
+* All chunks span the entire sheet → H1 confirmed: cap chunk size
+* All answer sheets are hidden → wrong_sheet from hidden-sheet skip
+
+### 3. State the hypothesis before changing code
+
+In your reply, write the hypothesis as one sentence:
+> "Bucket X dominates because [observable common pattern in examples].
+> Expected fix: [specific file + change]. Expected lift on recall@5:
+> [number]."
+
+The expected lift = (bucket count) ÷ (scoreable instances). If a fix
+is supposed to halve the biggest bucket but recall barely moves on the
+next run, the hypothesis was wrong — go back to step 1, don't pile on.
+
+### 4. Wire the fix into the existing pipeline
+
+Pass the hypothesis to the `excel-extraction-pipeline-improver` skill
+(or implement directly) with TDD: write a failing crossval/invariant
+test that captures the bug from one example, fix, run `make test`, then
+re-run `make bench-track` to confirm the bucket shrank.
+
+### 5. Verify on the history
+
+```bash
+tail -5 tests/benchmarks/reports/history.jsonl
+```
+
+The script prints a row-over-row delta. Expected behaviour:
+* The targeted bucket count drops
+* `recall_text@5` rises by approximately `target_bucket_drop /
+  scoreable_instances`
+* `recall_text@1` improves proportionally OR more (better chunks rank
+  higher too)
+* Other buckets shouldn't grow — if they do, the fix had a side effect
+
+## Anti-patterns
+
+* **Don't chase the headline number.** "Recall went up 1 pp" without a
+  bucket explanation is suspicious — it might be benchmark noise.
+* **Don't fix two buckets at once.** You won't know which fix worked.
+* **Don't tune chunk size as the first move.** It only helps
+  `present_but_ranked_low`. If `answer_absent_from_chunks` is bigger,
+  chunk tuning is wasted effort.
+* **Don't trust geometric and text recall independently.** A docling-style
+  parser with zero geometric overlap can still have high text recall
+  (because it serializes everything to plain text). That's not the same
+  thing as good — citation overlays need the geometric metric. Always
+  report both.
+
+## Quick reference — running it from scratch
+
+```bash
+make corpus-download                                    # one-time, large
+make bench-track                                        # runs + triages
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval --examples 5
+cat tests/benchmarks/reports/history.jsonl | tail -10   # commit-over-commit
+```
+
+CI runs a 60-instance sample on every PR (`.github/workflows/benchmark.yml`)
+and posts the bucket histogram to the GitHub job summary. Weekly schedule
+runs the full 912-instance corpus.
+
+## Hand-off contract
+
+When you finish a triage round, your final message MUST include:
+1. The bucket histogram (counts + percentages).
+2. The dominant bucket + one-sentence root cause hypothesis.
+3. The 1–2 file paths where the fix lives.
+4. The expected lift on `recall_text@5` if the fix lands.
+5. A pointer to the example instance IDs you read.
+
+Anything less and the next agent will have to re-do the analysis.
diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
new file mode 100644
index 0000000..026b535
--- /dev/null
+++ b/.github/workflows/benchmark.yml
@@ -0,0 +1,85 @@
+name: Benchmark
+
+# Tracks ks-xlsx-parser retrieval recall on SpreadsheetBench over time.
+# The headline goal: text recall@5 > 0.90 (currently ~0.70).
+#
+#   * Pull requests run a fast SAMPLE (60 instances) as a regression smoke
+#     test — keeps the signal without a 40-minute wait.
+#   * The weekly schedule + manual dispatch run the FULL 912-instance
+#     corpus and publish the recall trend.
+
+on:
+  pull_request:
+    branches: [main]
+    paths:
+      - "src/**"
+      - "scripts/eval_retrieval.py"
+      - "scripts/triage_recall.py"
+      - "Dockerfile.bench"
+      - ".github/workflows/benchmark.yml"
+  schedule:
+    - cron: "0 6 * * 1" # Mondays 06:00 UTC
+  workflow_dispatch:
+    inputs:
+      sample:
+        description: "Instances to sample (0 = full 912 corpus)"
+        default: "0"
+
+concurrency:
+  group: benchmark-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  benchmark:
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - uses: actions/checkout@v4
+
+      # PRs use a 60-instance sample; scheduled/dispatch runs use the full
+      # corpus (or whatever the dispatch input requests).
+      - name: Resolve sample size
+        id: cfg
+        run: |
+          if [ "${{ github.event_name }}" = "pull_request" ]; then
+            echo "sample=60" >> "$GITHUB_OUTPUT"
+          else
+            echo "sample=${{ github.event.inputs.sample || 0 }}" >> "$GITHUB_OUTPUT"
+          fi
+
+      - name: Cache SpreadsheetBench corpus
+        uses: actions/cache@v4
+        with:
+          path: data/corpora/spreadsheetbench
+          key: spreadsheetbench-912-v0.1
+
+      - name: Build benchmark image
+        run: docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+
+      - name: Run benchmark
+        run: |
+          mkdir -p tests/benchmarks/reports data
+          docker run --rm \
+            -e BENCH_SAMPLE=${{ steps.cfg.outputs.sample }} \
+            -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+            -v "$PWD/data:/app/data" \
+            ks-xlsx-parser-bench | tee bench.log
+
+      - name: Publish recall to job summary
+        if: always()
+        run: |
+          {
+            echo '## ks-xlsx-parser retrieval benchmark'
+            echo ''
+            echo '```'
+            tail -n 40 bench.log || true
+            echo '```'
+          } >> "$GITHUB_STEP_SUMMARY"
+
+      - name: Upload benchmark reports
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: benchmark-reports-${{ github.run_number }}
+          path: tests/benchmarks/reports/
+          if-no-files-found: warn
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 1168a01..ba7b9e2 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -23,22 +23,20 @@ jobs:
     steps:
       - uses: actions/checkout@v4
 
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          version: "latest"
+          enable-cache: true
+          cache-dependency-glob: "uv.lock"
+
       - name: Set up Python ${{ matrix.python-version }}
         uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
-          cache: pip
-          cache-dependency-path: pyproject.toml
 
       - name: Install
-        run: |
-          python -m pip install --upgrade pip
-          pip install -e ".[dev,api]"
-
-      - name: Ruff lint
-        run: |
-          pip install ruff
-          ruff check src/ tests/ scripts/ || true  # non-blocking until cleanup PR lands
+        run: uv pip install --system -e ".[dev,api]"
 
       - name: Run test suite
         run: make test-ci
@@ -50,3 +48,56 @@ jobs:
           name: junit-${{ matrix.os }}-py${{ matrix.python-version }}
           path: reports/junit.xml
           if-no-files-found: ignore
+
+  # Builds the wheel and proves it installs + imports in a CLEAN venv.
+  # The matrix `test` job runs against an editable install, which exposes the
+  # whole src/ tree on sys.path and therefore CANNOT catch a broken wheel —
+  # exactly how the v0.2.0 "pipeline.py missing from wheel" bug shipped.
+  # This job is the regression guard. Keep it required.
+  wheel-check:
+    name: wheel install smoke test
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Build wheel
+        run: |
+          python -m pip install --upgrade pip build
+          python -m build --wheel
+
+      - name: Verify wheel installs and imports in a clean venv
+        run: python scripts/verify_wheel.py
+
+  lint:
+    name: ruff lint
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Ruff lint
+        run: |
+          python -m pip install --upgrade pip ruff
+          # TODO(cleanup): 45 pre-existing findings (E402/B905/SIM*). Drop the
+          # `|| true` once the lint-cleanup PR lands so this job gates merges.
+          ruff check src/ tests/ scripts/ || true
+
+  typecheck:
+    name: mypy
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: mypy
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -e ".[dev,api]"
+          # TODO(cleanup): 59 pre-existing findings. Drop `|| true` once typed.
+          python -m mypy src/ks_xlsx_parser || true
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index 5cf43b6..1da9bcc 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -37,6 +37,12 @@ jobs:
       - name: Build wheel + sdist
         run: python -m build
 
+      # Gate the release on a clean-venv install of the freshly built wheel.
+      # Prevents shipping a wheel that drops modules or leaks top-level
+      # packages (the v0.2.0 packaging regression).
+      - name: Verify wheel
+        run: python scripts/verify_wheel.py
+
       - name: Upload distribution artifacts
         uses: actions/upload-artifact@v4
         with:
diff --git a/CHANGELOG.md b/CHANGELOG.md
index d7527b2..99d5832 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -45,6 +45,52 @@ Template for a new release (copy this block, fill in, move Unreleased items in):
 
 ## [Unreleased]
 
+### ⚠️ BREAKING (Fixed — see also #ks-xlsx-parser channel report)
+- Repository layout flattened on `src/` was leaking 13 generic top-level
+  packages (`models`, `utils`, `parsers`, …) into installed wheels and
+  silently dropping `pipeline.py` and `api.py` (setuptools `packages.find`
+  only finds *packages*, not top-level modules). Users hitting
+  `from ks_xlsx_parser.pipeline import ...` on 0.2.0 from PyPI got
+  `ModuleNotFoundError`. **All modules now live under
+  `src/ks_xlsx_parser/`**; the wheel's `top_level.txt` contains only
+  `ks_xlsx_parser`. Imports inside the package switched from
+  `from pipeline import` to `from ks_xlsx_parser.pipeline import`.
+  Downstream code that imported the leaked generics
+  (`from models import …`) MUST migrate to `from ks_xlsx_parser.models …`.
+
+### Added
+- `scripts/verify_wheel.py` — builds the wheel, installs it in a fresh
+  venv, and asserts the public import surface resolves. Wired into a
+  new `wheel-check` job in `.github/workflows/ci.yml` and a `Verify wheel`
+  step in `release.yml`. Regression guard for the packaging bug above.
+- `scripts/triage_recall.py` + `scripts/append_bench_history.py` — turn
+  `failures.ndjson` into a ranked bucket histogram with exemplar
+  failures, and append each benchmark run to
+  `tests/benchmarks/reports/history.jsonl` so recall is tracked
+  commit-over-commit. Goal: text recall@5 > 0.90.
+- `eval_retrieval.py --emit-failures` — dumps top-8 ranked chunks per
+  miss with a `failure_bucket` (answer_absent_from_chunks /
+  present_but_ranked_low / wrong_sheet / geometric_no_overlap / …) for
+  triage. Summary JSON gains a `failure_buckets` histogram.
+- `Dockerfile.bench` + `.github/workflows/benchmark.yml` — reproducible
+  benchmark image; PR sample run (60 instances), weekly full corpus run.
+- `make install-dev` alias and `make wheel-check` / `make bench-track`
+  / `make docker-bench` targets.
+- New `bench` optional-dependency group (`sentence-transformers`,
+  `numpy`) — only the benchmark needs these.
+- `docs/recall-investigation.md` documenting the diagnosis framework and
+  three named hypotheses (chunk-size dilution, formula-expression
+  rendering, range-bookkeeping drift).
+- `.claude/skills/recall-failure-triage/SKILL.md` — agent skill that
+  consumes the bucket output and proposes ranked fixes.
+
+### Changed
+- Dropped `PYTHONPATH=src` from Makefile benchmark targets — the
+  package is now properly installable so callers don't need it.
+- `pyproject.toml`: `packages.find` constrained to `ks_xlsx_parser*`,
+  `py.typed` declared as package data, `xlsx-parser-api` console script
+  updated to `ks_xlsx_parser.api:main`.
+
 ### ⚠️ BREAKING
 - Retired the in-tree `testBench/` corpus. The 1054-workbook stress dataset
   and `make testbench*` targets are gone — benchmarks now run against the
diff --git a/Dockerfile.bench b/Dockerfile.bench
new file mode 100644
index 0000000..cf2755f
--- /dev/null
+++ b/Dockerfile.bench
@@ -0,0 +1,49 @@
+# Benchmark image for ks-xlsx-parser.
+#
+# Builds once, then on each run downloads SpreadsheetBench (if not cached),
+# parses the corpus, embeds chunks with a small sentence-transformer, and
+# emits a recall@k report + failure-bucket triage. The output lands in
+# tests/benchmarks/reports/ — mount that path as a volume to persist results.
+#
+# Usage:
+#   docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+#   docker run --rm \
+#       -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+#       -v "$PWD/data:/app/data" \
+#       ks-xlsx-parser-bench
+#
+#   # Quick sanity run on 20 instances:
+#   docker run --rm -e BENCH_SAMPLE=20 ks-xlsx-parser-bench
+
+FROM python:3.12-slim
+
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1
+
+WORKDIR /app
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        curl unzip ca-certificates git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install deps first to keep layers cacheable across code edits.
+COPY pyproject.toml README.md ./
+COPY src ./src
+RUN pip install -e ".[dev,bench]"
+
+# Pre-warm the embedding model so the first ``docker run`` doesn't pay the
+# ~80 MB download. Same model name eval_retrieval.py defaults to.
+RUN python -c "from sentence_transformers import SentenceTransformer; \
+SentenceTransformer('BAAI/bge-small-en-v1.5')"
+
+COPY scripts ./scripts
+COPY tests ./tests
+COPY Makefile ./
+
+ENV BENCH_SAMPLE=0 \
+    BENCH_PARSERS=ks \
+    BENCH_TIMEOUT=120
+
+ENTRYPOINT ["bash", "scripts/run_bench.sh"]
diff --git a/Makefile b/Makefile
index 9bedb5b..5331c52 100644
--- a/Makefile
+++ b/Makefile
@@ -1,4 +1,5 @@
-.PHONY: help install test test-ci lint format typecheck clean corpus-download bench-robust bench-retrieval bench
+.PHONY: help install install-dev test test-ci lint format typecheck wheel-check clean \
+	corpus-download bench-robust bench-retrieval bench bench-track docker-bench
 
 PYTHON ?= python
 PKG_VERSION := $(shell $(PYTHON) -c "import tomllib, pathlib; print(tomllib.loads(pathlib.Path('pyproject.toml').read_text())['project']['version'])")
@@ -7,22 +8,29 @@ help:
 	@echo "ks-xlsx-parser — common targets"
 	@echo ""
 	@echo "  make install         Install package and dev deps (editable)"
+	@echo "  make install-dev     Alias for install (matches ks-backend)"
 	@echo "  make test            Run the default test suite"
 	@echo "  make test-ci         Run the suite with verbose output for CI"
 	@echo ""
 	@echo "  make lint            Ruff lint"
 	@echo "  make format          Ruff format"
 	@echo "  make typecheck       mypy"
+	@echo "  make wheel-check     Build wheel + verify it imports in a clean venv"
 	@echo ""
 	@echo "  make corpus-download Fetch SpreadsheetBench for benchmark runs"
 	@echo ""
 	@echo "  make bench-robust    Robustness on SpreadsheetBench (ks vs docling, ~20 min)"
 	@echo "  make bench-retrieval Retrieval recall on SpreadsheetBench (ks vs docling, ~40 min)"
 	@echo "  make bench           Run both benchmarks back-to-back"
+	@echo "  make bench-track     Run retrieval bench + append metrics to history"
+	@echo "  make docker-bench    Build + run the benchmark Docker image"
 
 install:
 	$(PYTHON) -m pip install -e ".[dev,api]"
 
+# Alias — junior devs pattern-match off ks-backend's `make install-dev`.
+install-dev: install
+
 test:
 	$(PYTHON) -m pytest tests/ -v --tb=short -W ignore::UserWarning
 
@@ -36,7 +44,15 @@ format:
 	$(PYTHON) -m ruff format src/ tests/ scripts/
 
 typecheck:
-	$(PYTHON) -m mypy src/xlsx_parser
+	$(PYTHON) -m mypy src/ks_xlsx_parser
+
+# Build the wheel and prove it imports outside the editable source tree.
+# This is the regression guard for the v0.2.0 packaging bug (pipeline.py
+# missing from the wheel because it was a top-level module, not a package).
+wheel-check:
+	rm -rf dist build
+	$(PYTHON) -m build --wheel
+	$(PYTHON) scripts/verify_wheel.py
 
 clean:
 	rm -rf build/ dist/ *.egg-info src/*.egg-info .pytest_cache .ruff_cache .mypy_cache
@@ -47,13 +63,26 @@ corpus-download:
 
 bench-robust:
 	@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
-	PYTHONPATH=src $(PYTHON) -m tests.benchmarks.vs_hucre \
+	$(PYTHON) -m tests.benchmarks.vs_hucre \
 		--corpus data/corpora/spreadsheetbench --parsers ks,docling \
 		--per-file-timeout 120 \
 		--out tests/benchmarks/reports/spreadsheetbench
 
 bench-retrieval:
 	@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
-	PYTHONPATH=src $(PYTHON) scripts/eval_retrieval.py --parsers ks,docling
+	$(PYTHON) scripts/eval_retrieval.py --parsers ks,docling
 
 bench: bench-robust bench-retrieval
+
+# Run the retrieval benchmark and append a row to history.jsonl so
+# accuracy can be tracked commit-over-commit. Goal: text recall@5 > 0.90.
+bench-track:
+	@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
+	$(PYTHON) scripts/eval_retrieval.py --parsers ks --emit-failures \
+		--out tests/benchmarks/reports/retrieval
+	$(PYTHON) scripts/append_bench_history.py
+	$(PYTHON) scripts/triage_recall.py tests/benchmarks/reports/retrieval
+
+docker-bench:
+	docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+	docker run --rm -v "$(PWD)/tests/benchmarks/reports:/app/tests/benchmarks/reports" ks-xlsx-parser-bench
diff --git a/docs/benchmark-local-setup.md b/docs/benchmark-local-setup.md
new file mode 100644
index 0000000..ab39803
--- /dev/null
+++ b/docs/benchmark-local-setup.md
@@ -0,0 +1,235 @@
+# Running the retrieval benchmark locally
+
+This is the loop you'll use whenever you want to know "is my parser
+change actually moving recall?" — the same pipeline CI runs, just on
+your laptop. Goal: text recall@5 > 0.90 (currently ~0.70).
+
+The full SpreadsheetBench corpus is 912 instances and takes ~30–45 min
+on a Mac. For an iteration loop you'll mostly use `--sample 60` (≈ 3 min
+after the first embedding-model load).
+
+## TL;DR
+
+```bash
+# One-time
+make install-dev              # installs the parser + dev deps
+pip install -e ".[bench]"     # sentence-transformers + numpy (~500 MB)
+make corpus-download          # downloads SpreadsheetBench → data/corpora/
+                              # (91 MB tarball, 2,726 .xlsx after extract)
+
+# Each time you want to score
+python scripts/eval_retrieval.py \
+  --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \
+  --parsers ks \
+  --sample 60 \
+  --emit-failures
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
+python scripts/append_bench_history.py
+```
+
+Output: `tests/benchmarks/reports/retrieval/<UTC stamp>/` with
+`results.ndjson`, `failures.ndjson`, `summary.json`, `summary.md`.
+
+## Step-by-step
+
+### 1. Install bench deps
+
+The benchmark uses `sentence-transformers` (≈ 500 MB with torch) for
+embeddings. They're a separate optional group so the parser package
+itself stays lean:
+
+```bash
+pip install -e ".[bench]"
+```
+
+First run also downloads the embedding model (`BAAI/bge-small-en-v1.5`,
+≈ 130 MB) into `~/.cache/huggingface/`.
+
+### 2. Download the corpus
+
+```bash
+make corpus-download
+```
+
+This is the same `scripts/download_corpora.sh` CI uses. It pulls
+SpreadsheetBench v0.1 plus a few legacy XLSX samples; only
+SpreadsheetBench matters for retrieval scoring.
+
+Layout you should end up with:
+
+```
+data/corpora/spreadsheetbench/
+  all_data_912_v0.1/
+    dataset.json                       # 912 (question, answer_position) tuples
+    spreadsheet/
+      <instance-id>/
+        1_<instance-id>_input.xlsx     # test case 1 input
+        1_<instance-id>_answer.xlsx    # test case 1 ground-truth output
+        2_..., 3_...                   # additional test cases per instance
+```
+
+`data/` is gitignored — never commit corpus files.
+
+### 3. Run the benchmark
+
+A typical iteration cycle uses a sample for speed:
+
+```bash
+python scripts/eval_retrieval.py \
+  --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \
+  --parsers ks \
+  --sample 60 \
+  --emit-failures
+```
+
+Flags worth knowing:
+
+| Flag                  | What it does                                                |
+|-----------------------|-------------------------------------------------------------|
+| `--parsers ks,docling`| Score one or both parsers. Docling is heavy; skip unless comparing. |
+| `--sample N`          | Random N-instance subset (seeded). Omit for the full 912.   |
+| `--seed 1337`         | Random seed for `--sample`. Stays stable across runs.       |
+| `--emit-failures`     | Also write `failures.ndjson` with top-8 chunks per miss.    |
+| `--test-case 1`       | Which of the (usually 3) test cases per instance to score.  |
+| `--per-parser-timeout`| Wall-clock seconds before a hung parse is killed. Default 60. |
+
+For a full run, drop `--sample` and add `--per-parser-timeout 120`:
+
+```bash
+python scripts/eval_retrieval.py \
+  --corpus data/corpora/spreadsheetbench/all_data_912_v0.1 \
+  --parsers ks \
+  --emit-failures \
+  --per-parser-timeout 120
+```
+
+`make bench-retrieval` is the same thing with the canonical defaults.
+
+### 4. Read the triage report
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
+```
+
+This auto-finds the most recent run and prints:
+
+* **Bucket histogram** ranked by count. The top bucket is the
+  highest-leverage thing to fix next.
+* **3 example failures** per bucket showing the question, the
+  ground-truth answer cell + values, and the top-8 chunks the parser
+  produced (with a ✓ next to chunks that contain the answer).
+
+Drill into one bucket:
+
+```bash
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval \
+  --bucket answer_absent_from_chunks --examples 10
+```
+
+Five buckets and what they mean:
+
+| Bucket | Root cause | Fix lives in |
+|---|---|---|
+| `answer_absent_from_chunks` | Answer value in NO chunk. Cell dropped or rendered as formula. | `parsers/cell_parser.py`, `rendering/text_renderer.py` |
+| `present_but_ranked_low` | A chunk DOES contain the answer but ranked >5. Chunk too big/heterogeneous. | `chunking/chunker.py` |
+| `wrong_sheet` | Answer sheet never chunked. | `parsers/workbook_parser.py` |
+| `geometric_no_overlap` | Text matches but the chunk's A1 range doesn't overlap GT. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` |
+| `no_chunks` / `parse_error` | Upstream parser failure. | The exception. |
+
+See `docs/recall-investigation.md` for the named hypotheses behind each
+bucket and `.claude/skills/recall-failure-triage/SKILL.md` for the
+agent-driven loop.
+
+### 5. Append to history.jsonl
+
+```bash
+python scripts/append_bench_history.py
+```
+
+Appends one row per benchmark run to
+`tests/benchmarks/reports/history.jsonl` tagged with the current git
+commit, and prints the row-over-row delta on the headline metrics:
+
+```
+appended to tests/benchmarks/reports/history.jsonl:
+  commit 421783f  recall_text@5=0.704  recall_text@1=0.580
+  recall_text@5: 0.6800 → 0.7040  ▲ +0.0240
+```
+
+That's how "is recall improving?" gets answered. Goal: `recall_text@5 > 0.90`.
+
+`make bench-track` chains eval + history-append + triage in one go.
+
+## Docker path (matches CI exactly)
+
+When you want to make sure local results aren't drifting from CI:
+
+```bash
+docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+
+# Quick sanity (60 instances, ~3 min after image load):
+docker run --rm \
+  -e BENCH_SAMPLE=60 \
+  -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+  -v "$PWD/data:/app/data" \
+  ks-xlsx-parser-bench
+
+# Full corpus:
+docker run --rm \
+  -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+  -v "$PWD/data:/app/data" \
+  ks-xlsx-parser-bench
+```
+
+The image pre-warms the embedding model at build time so the first
+`docker run` doesn't pay the 130 MB download.
+
+Environment knobs (also work in CI dispatch):
+
+| Env var          | Default | What it does                              |
+|------------------|---------|-------------------------------------------|
+| `BENCH_SAMPLE`   | `0`     | Sample N instances (0 = full 912)         |
+| `BENCH_PARSERS`  | `ks`    | Comma list (e.g. `ks,docling`)            |
+| `BENCH_TIMEOUT`  | `120`   | Per-file parse timeout in seconds         |
+
+## Adding a new failure bucket
+
+If you find a recall failure mode that doesn't fit any of the existing
+six buckets, add it instead of stuffing it into `answer_absent_from_chunks`:
+
+1. Append the new bucket name to `FAILURE_BUCKETS` in
+   `scripts/eval_retrieval.py`.
+2. Update `classify_text_failure` so it can return the new name. Keep
+   the predicate cheap — it runs once per scored instance.
+3. Add the bucket + a one-line root cause + fix location to the table
+   in `docs/recall-investigation.md` and the SKILL file.
+4. Re-run `make bench-track`; confirm the histogram shows the new
+   bucket and counts make sense.
+
+## Troubleshooting
+
+* `ModuleNotFoundError: No module named 'sentence_transformers'`
+  — you skipped `pip install -e ".[bench]"`.
+* `dataset.json not found in ...` — your `--corpus` is pointing at
+  `data/corpora/spreadsheetbench`, not the nested
+  `data/corpora/spreadsheetbench/all_data_912_v0.1`. The benchmark
+  expects the leaf directory that contains `dataset.json`.
+* `FileNotFoundError: 1_<id>_input.xlsx` — the corpus tarball didn't
+  fully extract. Delete `data/corpora/spreadsheetbench/` and re-run
+  `make corpus-download`.
+* `recall_text@5 = 0.0` on a sample of 5 — small samples have huge
+  variance because the benchmark seeds. Bump to `--sample 60` minimum
+  before trusting the number; use `--sample 0` (full corpus) for a
+  decision-grade comparison.
+* MPS / CUDA torch errors on first sentence-transformers import —
+  re-install with `pip install --upgrade torch torchvision`. The
+  benchmark runs fine on CPU.
+
+## See also
+
+* `scripts/eval_retrieval.py` — the benchmark itself.
+* `scripts/triage_recall.py` — the bucket histogram + exemplar dump.
+* `scripts/append_bench_history.py` — history.jsonl row writer.
+* `Dockerfile.bench` — reproducible benchmark image.
+* `docs/recall-investigation.md` — diagnosis framework & hypotheses.
+* `.claude/skills/recall-failure-triage/SKILL.md` — agent guide.
diff --git a/docs/recall-investigation.md b/docs/recall-investigation.md
new file mode 100644
index 0000000..ee630fd
--- /dev/null
+++ b/docs/recall-investigation.md
@@ -0,0 +1,176 @@
+# Retrieval-recall investigation — getting ks-xlsx-parser to >0.90
+
+## Where we are (v0.2.0 on SpreadsheetBench, 912 instances)
+
+| Metric                 | ks-xlsx-parser | docling 2.93 |
+|------------------------|----------------|--------------|
+| Parse success          | 99.945%        | not run at scale |
+| Recall@1 (text-match)  | 0.580          | 0.579 |
+| Recall@3 (text-match)  | 0.697          | 0.670 |
+| Recall@5 (text-match)  | **0.704**      | 0.686 |
+| Recall@5 (geometric)   | 0.369          | 0.000 (no A1 anchors) |
+| Mean parse time        | 251 ms         | 265 ms |
+
+Recall@5 = 0.704 means **~30% of questions miss** with k=5. To reach 0.90
+we need to roughly cut the miss rate to a third. A single recall number
+hides which lever to pull, which is why this branch ships failure
+bucketing (`scripts/eval_retrieval.py --emit-failures` +
+`scripts/triage_recall.py`).
+
+## The diagnosis framework — why the bucket histogram is the answer
+
+Every recall@5 miss falls into one of these buckets. The fix is
+completely different per bucket, and only one or two will dominate. The
+job of the investigator is to read the histogram FIRST, then commit to
+fixing the biggest one.
+
+| Bucket | What it means | Where to look |
+|---|---|---|
+| `answer_absent_from_chunks` | Answer value is in NO chunk. Cell was dropped or garbled. | `parsers/cell_parser.py`, `rendering/text_renderer.py::_cell_render_value` |
+| `present_but_ranked_low`    | A chunk DOES contain the answer but ranked >5. Chunk is too large/heterogeneous; the embedding is diluted. | `chunking/chunker.py` (no token cap), `analysis/table_assembler.py` (over-merging) |
+| `wrong_sheet`               | Answer sheet was never chunked. Sheet enumeration missed it. | `parsers/workbook_parser.py` sheet loop |
+| `geometric_no_overlap`      | No chunk's A1 range overlaps ground truth. Range bookkeeping drifts during merge/split. | `annotation/block_splitter.py`, `analysis/pattern_splitter.py` |
+| `no_chunks` / `parse_error` | Upstream parser failure. | The parse exception — fix the crash. |
+
+## First-run findings (200-sample seed=1337, May 2026)
+
+Headline numbers refuted my **H1** ("present_but_ranked_low dominates")
+and surfaced something the original analysis missed entirely. Of 157
+failures in the seed=1337 sample:
+
+* **127 (81%) are `instruction_requires_execution`** — the benchmark
+  is asking the system to *compute and write* the answer. The
+  `answer_position` cell range in `input.xlsx` is empty by design. A
+  parser cannot retrieve what isn't there. These instances need to be
+  filtered from the headline metric (TODO 05).
+* **30 are truly actionable** parser failures, clustering into 4 named
+  buckets (TODOs 00, 01, 02, 03, 04 in
+  [docs/planning/recall-90/](./planning/recall-90/)).
+* Zero of the 30 are `present_but_ranked_low`. Chunk-size dilution
+  may matter on the full 912-corpus but is NOT the dominant problem
+  on the 200-sample.
+
+The dominant **citation-grade killer** turned out to be
+`text_hit_geom_miss` (84 of 200, mostly inside the 127 execution
+instances; 9 are actionable parser bugs). The chunk text contains
+the answer string, but the chunk's claimed A1 range doesn't overlap
+ground truth — exactly **H3** ("range-bookkeeping drift").
+
+When the in-scope filter is applied, real `recall_text@5` is
+**0.59**, not 0.635. Closing the 30 actionable failures gets us
+near 0.90 on the in-scope metric.
+
+See [docs/planning/recall-90/](./planning/recall-90/) for the
+ranked TODO list.
+
+## A priori hypotheses (now confirmed/refuted; left as history)
+
+### H1 — `present_but_ranked_low` is the biggest bucket
+
+There is no per-chunk token cap in `chunking/chunker.py` (`CHARS_PER_TOKEN`
+is only used to *report* `token_count`, never to split). On
+SpreadsheetBench many input files are single-sheet ledgers where the
+block-assembler collapses the whole sheet into one chunk. The
+sentence-transformer query embedding then has to compete against ~2k
+tokens of mostly irrelevant text; the relevant ~5 tokens get washed out.
+
+If H1 is right, the histogram will show `present_but_ranked_low` ≫ the
+others, and recall@1 (0.580) will be much worse than recall@5 (0.704)
+— exactly what we observe (Δ = 12.4 pp, vs typical Δ ≈ 5–6 pp when
+chunks are right-sized).
+
+**Fix**: hard cap chunks at ~512 tokens by row-splitting tables and add
+a "row group" sub-chunk for tall tables. This is a 1–2 day surgical
+change in `chunking/chunker.py`.
+
+### H2 — `answer_absent_from_chunks` dominates the geometric gap
+
+`parsers/workbook_parser.py` loads both `data_only=False` (formula
+expressions) and `data_only=True` (computed values). But what flows into
+`render_text` is whichever `_cell_render_value` picks. If the cell is a
+formula like `=SUM(B2:B10)`, `display_value` may be the *expression*
+when the workbook was saved without cached values (LibreOffice and some
+generated files do this). Those answer cells become unfindable by text
+match even though the data IS in the spreadsheet.
+
+**Diagnostic**: count failure rows where every `top_chunks[*].text`
+matches the formula expression pattern (`=`, function name) but not the
+expected numeric value. The bucket emits the top-8 chunks for inspection.
+
+**Fix**: when the cached value is missing for a formula cell, evaluate it
+with our own formula engine (`formula/formula_parser.py` already exists)
+or use python-calamine's value-only pass as the source of truth for
+render text — never the formula source.
+
+### H3 — `geometric_no_overlap` is high because block ranges over-extend
+
+Geometric recall@5 = 0.369 means **only ~37% of the time** does the
+chunk a parser surfaces actually cover the ground-truth answer cell —
+even when the text match works. The block-detection pipeline merges
+sparse blocks (`analysis/light_block_detector.py`) and groups by
+similarity (`analysis/table_grouper.py`). Each merge widens the
+top-left/bottom-right anchors. If the anchors are widened past the
+sheet's true content, downstream citation overlays in ks-backend will
+highlight whitespace, and the geometric metric registers the chunk as
+"not overlapping" because its claimed range is so large it's not useful.
+
+**Fix**: after every merge/split, clip `cell_range` to the tight bounding
+box of the cells that actually contributed text. Add an invariant test
+that `block.cell_range` ⊆ `bounding_box(block.cells)`.
+
+## How to confirm — the next benchmark run
+
+1. `make corpus-download` (one-time, ~hundreds of MB).
+2. `make bench-track` — runs the full benchmark, appends to
+   `tests/benchmarks/reports/history.jsonl`, prints the bucket triage.
+3. Read the histogram. Pick the biggest bucket. Open 3–5 example
+   failures with `python scripts/triage_recall.py <reports-dir>
+   --bucket <name> --examples 5`. Each row shows:
+   * the natural-language question
+   * the ground-truth answer cell + values
+   * the top-8 ranked chunks we produced (sheet, A1 range, text snippet)
+   * whether each chunk contains the answer
+4. Pattern-match across 5 examples — what's the common parser behaviour?
+   That tells you the fix.
+5. Implement, re-run `make bench-track`. The script prints the delta
+   row-over-row so improvement is visible immediately.
+
+## How to use the Docker image (CI + reproducibility)
+
+```bash
+# Build once
+docker build -f Dockerfile.bench -t ks-xlsx-parser-bench .
+
+# Quick smoke (60 instances, < 2 min)
+docker run --rm -e BENCH_SAMPLE=60 ks-xlsx-parser-bench
+
+# Full corpus, persist reports + corpus cache
+docker run --rm \
+  -v "$PWD/tests/benchmarks/reports:/app/tests/benchmarks/reports" \
+  -v "$PWD/data:/app/data" \
+  ks-xlsx-parser-bench
+```
+
+The `Benchmark` GitHub workflow:
+* Runs a 60-instance smoke on every PR that touches `src/` or the
+  benchmark scripts.
+* Runs the full 912-instance corpus weekly (Monday 06:00 UTC) and on
+  manual dispatch.
+* Uploads `tests/benchmarks/reports/*` as a build artifact and posts the
+  recall summary to the job step summary.
+
+## Goal & cadence
+
+Target: **text recall@5 ≥ 0.90** by end of the current quarter.
+
+Track in `tests/benchmarks/reports/history.jsonl` (commit-over-commit
+row append). Refuse merges that drop recall@5 by ≥ 2 pp on the sample
+run (planned gate; today the PR job is reporting-only).
+
+## See also
+
+* `scripts/eval_retrieval.py` — the benchmark itself.
+* `scripts/triage_recall.py` — bucket histogram + example dump.
+* `scripts/append_bench_history.py` — history.jsonl row writer.
+* `.claude/skills/recall-failure-triage/SKILL.md` — agent guide.
+* `Dockerfile.bench` — reproducible benchmark image.
diff --git a/examples/demo.py b/examples/demo.py
index dbf6118..de59151 100644
--- a/examples/demo.py
+++ b/examples/demo.py
@@ -12,8 +12,8 @@
 # Add src to path for development
 sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
 
-from xlsx_parser.pipeline import parse_workbook
-from xlsx_parser.utils.logging_config import configure_logging
+from ks_xlsx_parser.pipeline import parse_workbook
+from ks_xlsx_parser.utils.logging_config import configure_logging
 
 EXAMPLES_DIR = Path(__file__).parent / "fixtures"
 
@@ -136,7 +136,7 @@ def demo_engineering_calcs():
     print(f"  Named Ranges: {[nr.name for nr in wb.named_ranges]}")
 
     # Show dependency chain for Design Moment (C15)
-    from xlsx_parser.models import CellCoord
+    from ks_xlsx_parser.models import CellCoord
     upstream = wb.dependency_graph.get_upstream(
         "Beam Design", CellCoord(row=15, col=3), max_depth=3
     )
diff --git a/examples/generate_examples.py b/examples/generate_examples.py
index 8d25e01..245caa2 100644
--- a/examples/generate_examples.py
+++ b/examples/generate_examples.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Generate example Excel workbooks for demonstrating the xlsx_parser.
+Generate example Excel workbooks for demonstrating the ks_xlsx_parser.
 
 Creates several representative workbooks in the examples/ folder
 that showcase the parser's capabilities across different Excel features.
diff --git a/examples/stress_test/stress_test_runner.py b/examples/stress_test/stress_test_runner.py
index 2551765..98128fa 100644
--- a/examples/stress_test/stress_test_runner.py
+++ b/examples/stress_test/stress_test_runner.py
@@ -18,7 +18,7 @@
 PROJECT_ROOT = Path(__file__).parent.parent.parent
 sys.path.insert(0, str(PROJECT_ROOT / "src"))
 
-from xlsx_parser.pipeline import parse_workbook
+from ks_xlsx_parser.pipeline import parse_workbook
 
 STRESS_DIR = Path(__file__).parent
 
diff --git a/pyproject.toml b/pyproject.toml
index a425853..bbaa8ac 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -46,7 +46,7 @@ Repository = "https://github.com/knowledgestack/ks-xlsx-parser"
 Documentation = "https://github.com/knowledgestack/ks-xlsx-parser#readme"
 
 [project.scripts]
-xlsx-parser-api = "api:main"
+xlsx-parser-api = "ks_xlsx_parser.api:main"
 
 [project.optional-dependencies]
 api = [
@@ -64,6 +64,12 @@ dev = [
     "ruff>=0.6.0",
     "mypy>=1.0",
 ]
+# Retrieval-recall benchmark (scripts/eval_retrieval.py). Heavy — only the
+# benchmark Docker image and `make bench-retrieval` need these.
+bench = [
+    "sentence-transformers>=2.2.0",
+    "numpy>=1.24.0",
+]
 
 [tool.pytest.ini_options]
 testpaths = ["tests"]
@@ -80,6 +86,10 @@ addopts = "-m 'not corpus'"
 
 [tool.setuptools.packages.find]
 where = ["src"]
+include = ["ks_xlsx_parser*"]
+
+[tool.setuptools.package-data]
+ks_xlsx_parser = ["py.typed"]
 
 [tool.ruff]
 line-length = 110
diff --git a/scripts/append_bench_history.py b/scripts/append_bench_history.py
new file mode 100755
index 0000000..01def1f
--- /dev/null
+++ b/scripts/append_bench_history.py
@@ -0,0 +1,101 @@
+#!/usr/bin/env python3
+"""Append the latest retrieval-benchmark run to a commit-over-commit history.
+
+``eval_retrieval.py`` writes a timestamped ``summary.json`` per run. This
+script picks the most recent one, flattens the headline metrics, tags it
+with the current git commit, and appends one JSON line to
+``tests/benchmarks/reports/history.jsonl``.
+
+That history file is what makes "is recall improving over time?" answerable
+— plot it, diff it in CI, or just ``tail`` it. Goal: text recall@5 > 0.90.
+
+Usage:
+    python scripts/append_bench_history.py
+    python scripts/append_bench_history.py --reports-dir tests/benchmarks/reports/retrieval
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import subprocess
+from datetime import UTC, datetime
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+HISTORY = ROOT / "tests" / "benchmarks" / "reports" / "history.jsonl"
+
+
+def git_commit() -> str:
+    try:
+        return subprocess.check_output(
+            ["git", "rev-parse", "--short", "HEAD"], cwd=ROOT, text=True
+        ).strip()
+    except Exception:
+        return "unknown"
+
+
+def latest_summary(reports_dir: Path) -> Path:
+    summaries = sorted(reports_dir.glob("*/summary.json"))
+    if not summaries:
+        raise SystemExit(
+            f"no summary.json under {reports_dir} — run `make bench-retrieval` first"
+        )
+    return summaries[-1]
+
+
+def main(argv: list[str] | None = None) -> int:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--reports-dir", type=Path,
+                    default=ROOT / "tests" / "benchmarks" / "reports" / "retrieval")
+    ap.add_argument("--parser", default="ks-xlsx-parser",
+                    help="which parser's metrics to record")
+    args = ap.parse_args(argv)
+
+    summary_path = latest_summary(args.reports_dir)
+    summary = json.loads(summary_path.read_text())
+    metrics = summary.get(args.parser)
+    if metrics is None:
+        raise SystemExit(
+            f"parser {args.parser!r} not in {summary_path}; "
+            f"have: {list(summary)}"
+        )
+
+    row = {
+        "timestamp": datetime.now(UTC).isoformat(),
+        "commit": git_commit(),
+        "parser": args.parser,
+        "run": summary_path.parent.name,
+        "instances": metrics.get("instances"),
+        "recall_text@1": metrics.get("recall_text@1"),
+        "recall_text@3": metrics.get("recall_text@3"),
+        "recall_text@5": metrics.get("recall_text@5"),
+        "recall_geometric@5": metrics.get("recall_geometric@5"),
+        "table_fragmentation_rate": metrics.get("table_fragmentation_rate"),
+        "mean_parse_ms": metrics.get("mean_parse_ms"),
+        "errors": metrics.get("errors"),
+        "failure_buckets": metrics.get("failure_buckets"),
+    }
+
+    HISTORY.parent.mkdir(parents=True, exist_ok=True)
+    with HISTORY.open("a") as f:
+        f.write(json.dumps(row, separators=(",", ":")) + "\n")
+
+    print(f"appended to {HISTORY.relative_to(ROOT)}:")
+    print(f"  commit {row['commit']}  recall_text@5={row['recall_text@5']}  "
+          f"recall_text@1={row['recall_text@1']}")
+
+    # Show the trend if there's history to compare against.
+    rows = [json.loads(ln) for ln in HISTORY.read_text().splitlines() if ln.strip()]
+    if len(rows) >= 2:
+        prev, cur = rows[-2], rows[-1]
+        for k in ("recall_text@5", "recall_text@1"):
+            p, c = prev.get(k), cur.get(k)
+            if isinstance(p, int | float) and isinstance(c, int | float):
+                delta = c - p
+                arrow = "▲" if delta > 0 else ("▼" if delta < 0 else "—")
+                print(f"  {k}: {p:.4f} → {c:.4f}  {arrow} {delta:+.4f}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/enrich_failures.py b/scripts/enrich_failures.py
new file mode 100755
index 0000000..a9e4452
--- /dev/null
+++ b/scripts/enrich_failures.py
@@ -0,0 +1,348 @@
+#!/usr/bin/env python3
+"""Enrich a benchmark run with per-instance diagnostics for failure clustering.
+
+`eval_retrieval.py --emit-failures` only sees *text-match* misses. For
+citation-grade scoring (geometric recall@5 = 0.283) we also need to
+classify instances where the answer text is in some chunk but no chunk's
+A1 range covers the ground truth — those don't show up in failures.ndjson
+at all.
+
+This script re-parses each instance's input.xlsx with both ks-xlsx-parser
+and openpyxl, then emits one row per FAILED instance (text-miss OR
+geometric-miss) with diagnostic columns chosen so post-hoc clustering is
+easy:
+
+    instance_id           question id
+    bucket_combined       both_miss / text_hit_geom_miss / text_miss_geom_hit
+    answer_position       the GT spec from dataset.json
+    gt_sheet              ground-truth sheet name (default: answer_sheet)
+    gt_cell_raw           openpyxl raw value at the first GT cell
+    gt_cell_formula       formula string if any
+    gt_cell_data_only     cached computed value if any
+    gt_in_workbook_sheets is gt_sheet in wb.sheetnames?
+    gt_in_chunked_sheets  did the parser produce any chunk on gt_sheet?
+    n_workbook_sheets     total sheets in the workbook (incl hidden)
+    n_chunked_sheets      distinct sheets we emitted chunks for
+    n_workbook_cells_in_gt   non-empty openpyxl cells in the GT range
+    chunks_on_gt_sheet    how many chunks we emitted for gt_sheet
+    chunk_bbox_on_gt_sheet bbox (min_r,min_c,max_r,max_c) over all chunks on gt_sheet
+    gt_range_bbox         GT range as (r0,c0,r1,c1)
+    text_match_rank       rank of first chunk whose text contains a GT value
+    geom_match_rank       rank of first chunk whose A1 range overlaps GT
+
+Usage:
+    python scripts/enrich_failures.py <run-dir-or-results.ndjson>
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Any
+
+# Hoist eval_retrieval helpers so we share the *exact* normalization
+# logic — clustering by "answer present in chunk" must be apples-to-apples.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from scripts.eval_retrieval import (  # noqa: E402
+    _matches_chunk_text,
+    parse_a1,
+    parse_position_spec,
+    parse_range,
+)
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+
+
+def find_run(arg: Path) -> Path:
+    if arg.is_file():
+        return arg.parent
+    if arg.is_dir():
+        if (arg / "results.ndjson").exists():
+            return arg
+        runs = sorted(p for p in arg.glob("*/results.ndjson"))
+        if runs:
+            return runs[-1].parent
+    sys.exit(f"ERROR: no results.ndjson at {arg}")
+
+
+def overlaps(box1: tuple[int, int, int, int], box2: tuple[int, int, int, int]) -> bool:
+    r0a, c0a, r1a, c1a = box1
+    r0b, c0b, r1b, c1b = box2
+    return not (r1a < r0b or r0a > r1b or c1a < c0b or c0a > c1b)
+
+
+def chunk_bbox(chunks) -> tuple[int, int, int, int] | None:
+    boxes = []
+    for c in chunks:
+        tl = parse_a1(c.top_left_cell) if c.top_left_cell else None
+        br = parse_a1(c.bottom_right_cell) if c.bottom_right_cell else None
+        if tl and br:
+            boxes.append((tl[0], tl[1], br[0], br[1]))
+    if not boxes:
+        return None
+    return (
+        min(b[0] for b in boxes),
+        min(b[1] for b in boxes),
+        max(b[2] for b in boxes),
+        max(b[3] for b in boxes),
+    )
+
+
+def enrich(run_dir: Path, corpus: Path, out_path: Path) -> None:
+    from openpyxl import load_workbook
+
+    from ks_xlsx_parser.pipeline import parse_workbook
+
+    # Load dataset.json once — we need question text + the original
+    # answer_sheet attribution for instances where the rank scoring
+    # already considers the spec parse-resolved.
+    dataset = {str(d["id"]): d for d in
+               json.loads((corpus / "dataset.json").read_text())}
+
+    results_path = run_dir / "results.ndjson"
+    rows: list[dict[str, Any]] = []
+    for line in results_path.read_text().splitlines():
+        if line.strip():
+            rows.append(json.loads(line))
+
+    out_rows: list[dict[str, Any]] = []
+    n_failed = 0
+    for rec in rows:
+        if rec.get("error"):
+            continue
+        text_rank = rec.get("rank_of_text_match")
+        geom_rank = rec.get("rank_of_first_overlap")
+        text_hit = text_rank is not None and text_rank <= 5
+        geom_hit = geom_rank is not None and geom_rank <= 5
+        if text_hit and geom_hit:
+            continue  # not a failure either way
+        n_failed += 1
+
+        inst_id = rec["instance_id"]
+        meta = dataset.get(inst_id, {})
+        instruction = meta.get("instruction", "")
+        answer_sheet = meta.get("answer_sheet") or None
+        if answer_sheet and "," in answer_sheet:
+            answer_sheet = answer_sheet.split(",")[0].strip()
+        answer_position = rec.get("answer_position") or meta.get("answer_position") or ""
+        data_position = meta.get("data_position") or ""
+
+        # Re-parse input.xlsx — cheap (50 ms median).
+        inst_dir = corpus / "spreadsheet" / inst_id
+        input_path = inst_dir / f"1_{inst_id}_input.xlsx"
+        if not input_path.exists():
+            continue
+
+        try:
+            result = parse_workbook(path=str(input_path))
+            chunks = list(result.chunks)
+        except Exception as exc:
+            out_rows.append({
+                "instance_id": inst_id,
+                "bucket_combined": "parse_error",
+                "error": f"{type(exc).__name__}: {exc}",
+            })
+            continue
+
+        chunked_sheets = sorted({c.sheet_name for c in chunks if c.sheet_name})
+
+        # openpyxl view — both formula and data_only passes so we can
+        # tell "formula uncached" from "cell genuinely empty".
+        try:
+            wb_f = load_workbook(str(input_path), data_only=False, read_only=False)
+            wb_d = load_workbook(str(input_path), data_only=True, read_only=False)
+            wb_sheets = list(wb_f.sheetnames)
+            hidden_sheets = [s for s in wb_sheets
+                             if getattr(wb_f[s], "sheet_state", "visible") != "visible"]
+        except Exception as exc:
+            wb_sheets = []
+            hidden_sheets = []
+            wb_f = wb_d = None
+
+        # Geometric overlap is scored against data_position (the input data
+        # region the question is asking about); falls back to answer_position
+        # when the dataset didn't fill data_position in (561 of 912 instances).
+        # This mirrors eval_retrieval.py's geom_spec = data_pos or answer_pos.
+        geom_spec = data_position or answer_position
+        regions = parse_position_spec(geom_spec, answer_sheet)
+        gt_sheet = regions[0][0] if regions else answer_sheet
+        gt_range_bbox = regions[0][1] if regions else None
+
+        # Separately track whether the *answer* region is empty in input.xlsx.
+        # If so, the question is "compute X and write here" — the parser
+        # cannot possibly contain the answer text, so this is a benchmark
+        # construct, not a parser bug. We flag it as instruction_requires_execution.
+        answer_regions = parse_position_spec(answer_position, answer_sheet)
+        answer_sheet_resolved = (answer_regions[0][0]
+                                 if answer_regions else answer_sheet)
+        answer_range_bbox = answer_regions[0][1] if answer_regions else None
+        n_input_cells_in_answer_range = 0
+        if (wb_d and answer_sheet_resolved and answer_range_bbox
+                and answer_sheet_resolved in wb_d.sheetnames):
+            try:
+                ws = wb_d[answer_sheet_resolved]
+                r0, c0, r1, c1 = answer_range_bbox
+                for row in ws.iter_rows(min_row=r0, max_row=r1, min_col=c0,
+                                        max_col=c1, values_only=True):
+                    for v in row:
+                        if v is not None and str(v).strip():
+                            n_input_cells_in_answer_range += 1
+            except Exception:
+                pass
+
+        gt_cell_raw = None
+        gt_cell_formula = None
+        gt_cell_data_only = None
+        n_workbook_cells_in_gt = 0
+        if wb_f and gt_sheet and gt_sheet in wb_f.sheetnames and gt_range_bbox:
+            ws_f = wb_f[gt_sheet]
+            ws_d = wb_d[gt_sheet]
+            r0, c0, r1, c1 = gt_range_bbox
+            # First cell only — enough to know "formula vs. value".
+            try:
+                tl_cell_f = ws_f.cell(row=r0, column=c0)
+                tl_cell_d = ws_d.cell(row=r0, column=c0)
+                gt_cell_raw = tl_cell_f.value
+                if isinstance(gt_cell_raw, str) and gt_cell_raw.startswith("="):
+                    gt_cell_formula = gt_cell_raw
+                gt_cell_data_only = tl_cell_d.value
+            except Exception:
+                pass
+            # Count non-empty cells across the range
+            try:
+                for row in ws_d.iter_rows(min_row=r0, max_row=r1, min_col=c0,
+                                          max_col=c1, values_only=True):
+                    for v in row:
+                        if v is not None and str(v).strip():
+                            n_workbook_cells_in_gt += 1
+            except Exception:
+                pass
+
+        chunks_on_gt = [c for c in chunks if gt_sheet and c.sheet_name == gt_sheet]
+        gt_chunk_bbox = chunk_bbox(chunks_on_gt)
+
+        if not text_hit and not geom_hit:
+            bucket = "both_miss"
+        elif text_hit and not geom_hit:
+            bucket = "text_hit_geom_miss"
+        elif geom_hit and not text_hit:
+            bucket = "text_miss_geom_hit"
+        else:
+            bucket = "other"
+
+        sheet_chunked = (gt_sheet in chunked_sheets) if gt_sheet else None
+        sheet_in_wb = (gt_sheet in wb_sheets) if (gt_sheet and wb_sheets) else None
+        sheet_hidden = (gt_sheet in hidden_sheets) if gt_sheet else False
+
+        # Pre-named signal heuristics — informational only; clustering is
+        # still done by reading. These flags help spot patterns FAST.
+        flags: list[str] = []
+        if sheet_in_wb is False:
+            flags.append("gt_sheet_missing_from_workbook")
+        elif sheet_chunked is False:
+            flags.append("gt_sheet_present_but_not_chunked")
+        if sheet_hidden:
+            flags.append("gt_sheet_hidden")
+        if gt_cell_formula and gt_cell_data_only in (None, ""):
+            flags.append("gt_cell_uncached_formula")
+        elif gt_cell_formula:
+            flags.append("gt_cell_is_formula")
+        if (gt_range_bbox and gt_chunk_bbox and
+                not overlaps(gt_range_bbox, gt_chunk_bbox)):
+            flags.append("gt_range_outside_chunk_bbox")
+        if (gt_range_bbox and gt_chunk_bbox and
+                overlaps(gt_range_bbox, gt_chunk_bbox)):
+            # Inside the chunk bbox but no individual chunk overlaps?
+            # That's "the parser saw the right area but split it wrong".
+            any_overlap = False
+            for c in chunks_on_gt:
+                tl = parse_a1(c.top_left_cell) if c.top_left_cell else None
+                br = parse_a1(c.bottom_right_cell) if c.bottom_right_cell else None
+                if tl and br and overlaps(
+                        gt_range_bbox, (tl[0], tl[1], br[0], br[1])):
+                    any_overlap = True
+                    break
+            if not any_overlap:
+                flags.append("gt_inside_bbox_but_no_chunk_overlap")
+        if (n_workbook_cells_in_gt == 0 and gt_range_bbox):
+            flags.append("gt_range_empty_in_workbook")
+        # The big one: if answer_position is empty in input, the benchmark
+        # is asking the system to WRITE the answer. Not a parser bug.
+        if (answer_range_bbox and n_input_cells_in_answer_range == 0):
+            flags.append("instruction_requires_execution")
+        # cell rendered but truncated to a sub-range?
+        if (gt_chunk_bbox and gt_range_bbox and
+                gt_chunk_bbox[2] < gt_range_bbox[2]):
+            flags.append("chunk_bbox_rows_truncated")
+        if (gt_chunk_bbox and gt_range_bbox and
+                gt_chunk_bbox[3] < gt_range_bbox[3]):
+            flags.append("chunk_bbox_cols_truncated")
+
+        out_rows.append({
+            "instance_id": inst_id,
+            "bucket_combined": bucket,
+            "instruction": instruction[:200],
+            "answer_position": answer_position,
+            "answer_sheet": answer_sheet,
+            "gt_sheet": gt_sheet,
+            "gt_range_bbox": list(gt_range_bbox) if gt_range_bbox else None,
+            "gt_cell_raw": str(gt_cell_raw)[:120] if gt_cell_raw is not None else None,
+            "gt_cell_formula": gt_cell_formula,
+            "gt_cell_data_only":
+                str(gt_cell_data_only)[:120] if gt_cell_data_only is not None else None,
+            "n_workbook_sheets": len(wb_sheets),
+            "n_chunked_sheets": len(chunked_sheets),
+            "wb_sheets": wb_sheets,
+            "hidden_sheets": hidden_sheets,
+            "chunked_sheets": chunked_sheets,
+            "n_chunks_total": len(chunks),
+            "n_chunks_on_gt_sheet": len(chunks_on_gt),
+            "n_workbook_cells_in_gt": n_workbook_cells_in_gt,
+            "chunk_bbox_on_gt_sheet": list(gt_chunk_bbox) if gt_chunk_bbox else None,
+            "rank_of_text_match": text_rank,
+            "rank_of_first_overlap": geom_rank,
+            "flags": flags,
+            "data_position": data_position,
+            "answer_range_bbox": list(answer_range_bbox) if answer_range_bbox else None,
+            "n_input_cells_in_answer_range": n_input_cells_in_answer_range,
+        })
+
+    out_path.write_text("\n".join(json.dumps(r, separators=(",", ":")) for r in out_rows) + "\n")
+    print(f"Examined {len(rows)} instances, {n_failed} failed (text OR geom).")
+    print(f"Wrote {len(out_rows)} enriched rows to {out_path}")
+
+    # Quick histogram for sanity
+    from collections import Counter
+    bc = Counter(r["bucket_combined"] for r in out_rows)
+    fc = Counter()
+    for r in out_rows:
+        for f in r.get("flags", []):
+            fc[f] += 1
+    print("\nCombined bucket counts:")
+    for b, n in bc.most_common():
+        print(f"  {b:<30s} {n}")
+    print("\nDiagnostic flags (rows can have multiple):")
+    for f, n in fc.most_common():
+        print(f"  {f:<40s} {n}")
+
+
+def main(argv: list[str] | None = None) -> int:
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("path", type=Path,
+                    help="run dir, results.ndjson, or parent reports dir")
+    ap.add_argument("--corpus", type=Path,
+                    default=REPO_ROOT / "data/corpora/spreadsheetbench/all_data_912_v0.1")
+    ap.add_argument("--out", type=Path, default=None,
+                    help="output path (default: <run>/enriched_failures.ndjson)")
+    args = ap.parse_args(argv)
+
+    run_dir = find_run(args.path)
+    out = args.out or (run_dir / "enriched_failures.ndjson")
+    enrich(run_dir, args.corpus, out)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/eval_retrieval.py b/scripts/eval_retrieval.py
index 64c1743..3d4cdff 100644
--- a/scripts/eval_retrieval.py
+++ b/scripts/eval_retrieval.py
@@ -34,13 +34,16 @@
 import signal
 import sys
 import time
+from collections.abc import Iterable
 from dataclasses import dataclass, field
 from pathlib import Path
-from typing import Any, Iterable
+from typing import Any
 
 REPO_ROOT = Path(__file__).resolve().parent.parent
+# Keep ``import scripts.X`` style imports working when invoked as
+# ``python scripts/eval_retrieval.py``. We no longer need ``src`` on the
+# path — ks_xlsx_parser is a properly-installed package now.
 sys.path.insert(0, str(REPO_ROOT))
-sys.path.insert(0, str(REPO_ROOT / "src"))
 
 
 def _normalize_value_for_match(s: str) -> set[str]:
@@ -254,7 +257,7 @@ def parse_position_spec(
 
 
 def extract_chunks_ks(path: Path) -> list[Chunk]:
-    from pipeline import parse_workbook
+    from ks_xlsx_parser.pipeline import parse_workbook
 
     result = parse_workbook(path=str(path))
     out: list[Chunk] = []
@@ -442,6 +445,62 @@ class InstanceResult:
     extra: dict[str, Any] = field(default_factory=dict)
 
 
+# ────────────────────────────────────────────────────────── failure triage
+#
+# recall@5 sitting at ~0.70 means ~30% of questions miss. A single recall
+# number can't tell you WHY — and the fix is completely different per cause:
+#
+#   * answer_absent_from_chunks → the answer value is in NO chunk at all.
+#     The parser dropped/garbled the cell. Fix the EXTRACTION pipeline.
+#   * present_but_ranked_low    → a chunk DOES contain the answer, but the
+#     embedding model ranked it past k. Fix CHUNKING (smaller/cleaner
+#     chunks rank better) or the embedding step — NOT the parser.
+#   * wrong_sheet               → answer is on a sheet we produced no chunk
+#     for, or mis-attributed the sheet. Fix sheet enumeration.
+#   * geometric_no_overlap      → no chunk's A1 range overlaps ground truth
+#     even though text may match. Fix RANGE bookkeeping (top_left /
+#     bottom_right anchors drift during merge/split).
+#   * no_chunks / parse_error   → upstream parser failure.
+#
+# Splitting the miss population into these buckets turns "recall is low"
+# into a ranked, actionable worklist. This is the single most useful thing
+# for an agent iterating on recall — see scripts/triage_recall.py.
+
+FAILURE_BUCKETS = (
+    "answer_absent_from_chunks",
+    "present_but_ranked_low",
+    "wrong_sheet",
+    "geometric_no_overlap",
+    "no_chunks",
+    "parse_error",
+    "unparseable_ground_truth",
+)
+
+
+def classify_text_failure(rec: dict[str, Any]) -> str | None:
+    """Bucket a single result record for the TEXT-match recall@5 metric.
+
+    Returns None if the instance was a recall@5 hit (rank ≤ 5) or was not
+    scoreable (no ground-truth values to match against).
+    """
+    if rec.get("error"):
+        return "no_chunks" if rec.get("n_chunks", 0) == 0 else "parse_error"
+    if not rec.get("had_answer_values", True):
+        return None  # not scoreable for text-match — exclude, don't penalise
+    if rec.get("n_chunks", 0) == 0:
+        return "no_chunks"
+    rank = rec.get("rank_of_text_match")
+    if rank is not None and rank <= 5:
+        return None  # hit
+    if rank is None:
+        # Not in any chunk. Distinguish "value never extracted" from
+        # "value extracted but on a sheet we never chunked".
+        if rec.get("answer_on_unchunked_sheet"):
+            return "wrong_sheet"
+        return "answer_absent_from_chunks"
+    return "present_but_ranked_low"  # rank > 5
+
+
 def score_instance(
     *,
     parser_name: str,
@@ -452,8 +511,10 @@ def score_instance(
     answer_position: str,
     default_sheet: str | None,
     answer_cell_values: list[str],
+    answer_regions: list[tuple[str | None, tuple[int, int, int, int]]] | None = None,
     model,
     per_parser_timeout_s: float = 60.0,
+    emit_failures: bool = False,
 ) -> InstanceResult:
     import numpy as np
 
@@ -547,6 +608,38 @@ def score_instance(
                 rank_text = r
                 break
 
+    # Did the parser produce ANY chunk on the sheet(s) the answer lives on?
+    # If not, a text miss is a sheet-enumeration bug, not a cell-drop bug.
+    answer_on_unchunked_sheet = False
+    if answer_regions:
+        gt_sheets = {s for s, _ in answer_regions if s}
+        chunk_sheets = {c.sheet for c in chunks if c.sheet}
+        if gt_sheets and chunk_sheets and not (gt_sheets & chunk_sheets):
+            answer_on_unchunked_sheet = True
+
+    extra: dict[str, Any] = {
+        "answer_on_unchunked_sheet": answer_on_unchunked_sheet,
+        "had_answer_values": bool(answer_cell_values),
+    }
+    if emit_failures:
+        # Dump the top-8 ranked chunks so a human/agent can eyeball WHY the
+        # answer was missed without re-running the (expensive) benchmark.
+        top: list[dict[str, Any]] = []
+        for r, idx in enumerate(ranking[:8], start=1):
+            c = chunks[idx]
+            top.append({
+                "rank": r,
+                "sheet": c.sheet,
+                "range": (
+                    f"{c.top_left}:{c.bottom_right}"
+                    if c.top_left else None
+                ),
+                "contains_answer": _matches_chunk_text(answer_cell_values, c.text or ""),
+                "text": (c.text or "")[:280],
+            })
+        extra["top_chunks"] = top
+        extra["answer_values"] = answer_cell_values[:20]
+
     return InstanceResult(
         instance_id=inst_id,
         parser=parser_name,
@@ -558,6 +651,7 @@ def score_instance(
         chunks_overlapping_data=len(overlap_idxs),
         rank_of_first_overlap=rank_overlap,
         rank_of_text_match=rank_text,
+        extra=extra,
     )
 
 
@@ -650,6 +744,21 @@ def _recall_at(k: int, key: str) -> float:
 
         parse_times = [r.parse_ms for r in recs if not r.error]
 
+        # Failure-bucket histogram for the text-match recall@5 metric.
+        # Only counts instances that HAD ground-truth values to match
+        # against (others aren't scoreable). See classify_text_failure.
+        buckets = dict.fromkeys(FAILURE_BUCKETS, 0)
+        for r in recs:
+            bucket = classify_text_failure({
+                "error": r.error,
+                "n_chunks": r.n_chunks,
+                "rank_of_text_match": r.rank_of_text_match,
+                "had_answer_values": r.extra.get("had_answer_values", True),
+                "answer_on_unchunked_sheet": r.extra.get("answer_on_unchunked_sheet"),
+            })
+            if bucket is not None:
+                buckets[bucket] += 1
+
         summary[parser] = {
             "instances": total,
             "ok": ok,
@@ -667,6 +776,9 @@ def _recall_at(k: int, key: str) -> float:
             if parse_times else None,
             "p50_parse_ms": round(sorted(parse_times)[len(parse_times) // 2], 2)
             if parse_times else None,
+            # Why the text-match recall@5 misses happen, bucketed. The
+            # biggest bucket is the highest-leverage thing to fix next.
+            "failure_buckets": buckets,
         }
 
     return summary
@@ -701,6 +813,10 @@ def main(argv: list[str] | None = None) -> int:
     parser.add_argument("--test-case", type=int, default=1,
                         help="Which of the (typically 3) test cases per instance "
                              "to score on. We use one to keep eval costs bounded.")
+    parser.add_argument("--emit-failures", action="store_true",
+                        help="Also write failures.ndjson — one row per "
+                             "recall@5 miss with the top-8 ranked chunks and "
+                             "ground-truth values, for failure triage.")
     parser.add_argument("--per-parser-timeout", type=float, default=60.0,
                         help="Wall-clock seconds before a parser is "
                              "considered hung on a single file (docling can "
@@ -736,6 +852,7 @@ def main(argv: list[str] | None = None) -> int:
     ndjson_path = out_dir / "results.ndjson"
 
     results: list[InstanceResult] = []
+    failure_rows: list[dict[str, Any]] = []
     n = len(instances) * len(parser_fns)
     done = 0
 
@@ -786,10 +903,20 @@ def main(argv: list[str] | None = None) -> int:
                     answer_position=answer_pos,
                     default_sheet=default_sheet,
                     answer_cell_values=answer_values,
+                    answer_regions=answer_regions,
                     model=model,
                     per_parser_timeout_s=args.per_parser_timeout,
+                    emit_failures=args.emit_failures,
                 )
                 results.append(res)
+                bucket = classify_text_failure({
+                    "error": res.error,
+                    "n_chunks": res.n_chunks,
+                    "rank_of_text_match": res.rank_of_text_match,
+                    "had_answer_values": res.extra.get("had_answer_values", True),
+                    "answer_on_unchunked_sheet": res.extra.get(
+                        "answer_on_unchunked_sheet"),
+                })
                 f.write(json.dumps({
                     "instance_id": res.instance_id,
                     "parser": res.parser,
@@ -801,8 +928,22 @@ def main(argv: list[str] | None = None) -> int:
                     "chunks_overlapping_data": res.chunks_overlapping_data,
                     "rank_of_first_overlap": res.rank_of_first_overlap,
                     "rank_of_text_match": res.rank_of_text_match,
+                    "failure_bucket": bucket,
                     "error": res.error,
                 }, separators=(",", ":")) + "\n")
+                if args.emit_failures and bucket is not None:
+                    failure_rows.append({
+                        "instance_id": res.instance_id,
+                        "parser": res.parser,
+                        "failure_bucket": bucket,
+                        "instruction": instr,
+                        "answer_position": answer_pos,
+                        "answer_values": res.extra.get("answer_values", []),
+                        "rank_of_text_match": res.rank_of_text_match,
+                        "n_chunks": res.n_chunks,
+                        "top_chunks": res.extra.get("top_chunks", []),
+                        "error": res.error,
+                    })
                 done += 1
                 if done % 10 == 0:
                     sys.stderr.write(f"\r[{done}/{n}] ")
@@ -810,6 +951,13 @@ def main(argv: list[str] | None = None) -> int:
 
     sys.stderr.write(f"\nWrote {ndjson_path}\n")
 
+    if args.emit_failures:
+        fail_path = out_dir / "failures.ndjson"
+        with fail_path.open("w") as ff:
+            for row in failure_rows:
+                ff.write(json.dumps(row, separators=(",", ":")) + "\n")
+        sys.stderr.write(f"Wrote {fail_path} ({len(failure_rows)} failure rows)\n")
+
     summary = aggregate(results)
     summary_path = out_dir / "summary.json"
     summary_path.write_text(json.dumps(summary, indent=2))
@@ -853,6 +1001,20 @@ def main(argv: list[str] | None = None) -> int:
                     "(sheet, range) per chunk — docling does not, so its geometric "
                     "recall is structurally 0.")
     md_lines.append("")
+    md_lines.append("## Failure buckets (text-match recall@5 misses)")
+    md_lines.append("")
+    md_lines.append("Why each miss happened — the biggest bucket is the highest-")
+    md_lines.append("leverage fix. `answer_absent_from_chunks` → fix extraction; ")
+    md_lines.append("`present_but_ranked_low` → fix chunking/embedding.")
+    md_lines.append("")
+    md_lines.append("| Bucket | " + " | ".join(parsers) + " |")
+    md_lines.append("|---|" + "|".join(["---"] * len(parsers)) + "|")
+    for b in FAILURE_BUCKETS:
+        row = [b]
+        for p in parsers:
+            row.append(str(summary[p].get("failure_buckets", {}).get(b, 0)))
+        md_lines.append("| " + " | ".join(row) + " |")
+    md_lines.append("")
     md_lines.append("**Text-match** = the answer cell's actual string value appears "
                     "as a substring of the chunk's text. Parser-agnostic; this is "
                     "the apples-to-apples retrieval comparison.")
diff --git a/scripts/run_bench.sh b/scripts/run_bench.sh
new file mode 100755
index 0000000..05a77fb
--- /dev/null
+++ b/scripts/run_bench.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Entrypoint for the benchmark Docker image (Dockerfile.bench).
+#
+# Ensures the SpreadsheetBench corpus is present, runs the retrieval-recall
+# benchmark for ks-xlsx-parser, appends the result to history.jsonl, and
+# prints a failure-bucket triage so accuracy can be tracked over time.
+#
+# Env vars:
+#   BENCH_SAMPLE   parse only N random instances (0 / unset = full 912)
+#   BENCH_PARSERS  comma list passed to eval_retrieval.py (default: ks)
+#   BENCH_TIMEOUT  per-file parser timeout in seconds (default: 120)
+set -euo pipefail
+
+cd "$(dirname "$0")/.."
+
+CORPUS_DIR="data/corpora/spreadsheetbench"
+SAMPLE="${BENCH_SAMPLE:-0}"
+PARSERS="${BENCH_PARSERS:-ks}"
+TIMEOUT="${BENCH_TIMEOUT:-120}"
+
+if [ ! -d "$CORPUS_DIR" ]; then
+  echo "→ Downloading SpreadsheetBench corpus ..."
+  mkdir -p "$CORPUS_DIR"
+  curl -L --fail --retry 3 --connect-timeout 20 \
+    -o /tmp/sb.tar.gz \
+    "https://raw.githubusercontent.com/RUCKBReasoning/SpreadsheetBench/main/data/spreadsheetbench_912_v0.1.tar.gz"
+  tar -xzf /tmp/sb.tar.gz -C "$CORPUS_DIR"
+  rm -f /tmp/sb.tar.gz
+fi
+
+# eval_retrieval.py expects the dataset.json + spreadsheet/ dir. Find it.
+CORPUS_ARG="$CORPUS_DIR"
+if [ -d "$CORPUS_DIR/all_data_912_v0.1" ]; then
+  CORPUS_ARG="$CORPUS_DIR/all_data_912_v0.1"
+fi
+
+SAMPLE_ARG=()
+if [ "$SAMPLE" != "0" ]; then
+  SAMPLE_ARG=(--sample "$SAMPLE")
+  echo "→ Sampling $SAMPLE instances"
+fi
+
+echo "→ Running retrieval benchmark (parsers=$PARSERS) ..."
+python scripts/eval_retrieval.py \
+  --corpus "$CORPUS_ARG" \
+  --parsers "$PARSERS" \
+  --emit-failures \
+  --per-parser-timeout "$TIMEOUT" \
+  --out tests/benchmarks/reports/retrieval \
+  "${SAMPLE_ARG[@]}"
+
+echo "→ Appending to history.jsonl ..."
+python scripts/append_bench_history.py
+
+echo
+python scripts/triage_recall.py tests/benchmarks/reports/retrieval
diff --git a/scripts/run_enterprise_metrics.py b/scripts/run_enterprise_metrics.py
index 1d6821a..6366fbd 100644
--- a/scripts/run_enterprise_metrics.py
+++ b/scripts/run_enterprise_metrics.py
@@ -7,9 +7,8 @@
 
 import json
 
-from xlsx_parser import parse_workbook
-
 from scripts.generate_enterprise_fixtures import generate_all
+from xlsx_parser import parse_workbook
 
 
 class EnterpriseScorecard:
diff --git a/scripts/track_corpus_metrics.py b/scripts/track_corpus_metrics.py
index 779995b..a0cb244 100644
--- a/scripts/track_corpus_metrics.py
+++ b/scripts/track_corpus_metrics.py
@@ -6,7 +6,6 @@
 from datetime import datetime
 from pathlib import Path
 
-
 ROOT = Path(__file__).resolve().parent.parent
 METRICS_DIR = ROOT / "metrics" / "corpus"
 
diff --git a/scripts/triage_recall.py b/scripts/triage_recall.py
new file mode 100755
index 0000000..41a766a
--- /dev/null
+++ b/scripts/triage_recall.py
@@ -0,0 +1,114 @@
+#!/usr/bin/env python3
+"""Triage retrieval-recall failures into a ranked, actionable worklist.
+
+Reads a ``failures.ndjson`` produced by
+``eval_retrieval.py --emit-failures`` and prints:
+
+  1. A histogram of failure buckets (biggest = highest leverage).
+  2. For each bucket, a few concrete example failures with the
+     ground-truth answer and the top ranked chunks the parser produced.
+
+The point: turn "recall@5 is 0.70" into "N misses are answer_absent_from_chunks
+— the parser is dropping cells; here are 5 examples to reproduce."
+
+Usage:
+    python scripts/triage_recall.py tests/benchmarks/reports/retrieval/<stamp>/failures.ndjson
+    python scripts/triage_recall.py <dir>           # finds the latest run
+    python scripts/triage_recall.py <file> --bucket answer_absent_from_chunks --examples 10
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+
+# Ordered worst→least-actionable. Used to sort the worklist.
+BUCKET_GUIDANCE = {
+    "parse_error": "Parser raised on the file. Reproduce with parse_workbook(path) and fix the crash.",
+    "no_chunks": "Parser produced zero chunks. Sheet/region detection collapsed — check chunking/segmenter.py.",
+    "answer_absent_from_chunks": "Answer value is in NO chunk. EXTRACTION gap — the cell was dropped or garbled. Highest leverage.",
+    "wrong_sheet": "Answer sheet was never chunked. Sheet enumeration bug — check workbook_parser.py sheet loop.",
+    "geometric_no_overlap": "No chunk's A1 range overlaps ground truth. RANGE bookkeeping drift in merge/split.",
+    "present_but_ranked_low": "A chunk DOES contain the answer but ranked >5. Not a parser bug — fix chunk granularity/embedding.",
+    "unparseable_ground_truth": "Could not parse the dataset's answer_position. Benchmark-harness issue, not the parser.",
+}
+
+
+def find_failures_file(arg: Path) -> Path:
+    if arg.is_file():
+        return arg
+    if arg.is_dir():
+        direct = arg / "failures.ndjson"
+        if direct.exists():
+            return direct
+        runs = sorted(p for p in arg.glob("*/failures.ndjson"))
+        if runs:
+            return runs[-1]
+    sys.exit(f"ERROR: no failures.ndjson found at {arg}")
+
+
+def load(path: Path) -> list[dict]:
+    rows = []
+    for line in path.read_text().splitlines():
+        line = line.strip()
+        if line:
+            rows.append(json.loads(line))
+    return rows
+
+
+def main(argv: list[str] | None = None) -> int:
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("path", type=Path,
+                    help="failures.ndjson, or a dir containing/parenting one")
+    ap.add_argument("--bucket", help="only show examples for this bucket")
+    ap.add_argument("--examples", type=int, default=3,
+                    help="example failures to print per bucket (default 3)")
+    args = ap.parse_args(argv)
+
+    path = find_failures_file(args.path)
+    rows = load(path)
+    if not rows:
+        print(f"{path} is empty — no failures recorded (or run was a no-op).")
+        return 0
+
+    print(f"# Recall failure triage — {path}")
+    print(f"# {len(rows)} total failure rows\n")
+
+    hist = Counter(r.get("failure_bucket") for r in rows)
+    print("## Bucket histogram (ranked by count — fix the top one first)\n")
+    width = max(len(b or "?") for b in hist)
+    for bucket, count in hist.most_common():
+        pct = 100.0 * count / len(rows)
+        print(f"  {str(bucket):<{width}}  {count:5d}  ({pct:5.1f}%)")
+        print(f"  {'':<{width}}         → {BUCKET_GUIDANCE.get(bucket, '')}")
+    print()
+
+    buckets = [args.bucket] if args.bucket else [b for b, _ in hist.most_common()]
+    for bucket in buckets:
+        examples = [r for r in rows if r.get("failure_bucket") == bucket][:args.examples]
+        if not examples:
+            continue
+        print(f"## Examples — {bucket}\n")
+        for r in examples:
+            print(f"  instance {r.get('instance_id')}  ({r.get('parser')})")
+            print(f"    Q: {(r.get('instruction') or '')[:160]}")
+            print(f"    answer_position: {r.get('answer_position')}")
+            print(f"    ground-truth values: {r.get('answer_values')}")
+            print(f"    n_chunks={r.get('n_chunks')} "
+                  f"rank_of_text_match={r.get('rank_of_text_match')}")
+            if r.get("error"):
+                print(f"    ERROR: {r['error']}")
+            for c in (r.get("top_chunks") or [])[:4]:
+                mark = "✓" if c.get("contains_answer") else " "
+                snippet = (c.get("text") or "").replace("\n", " ")[:120]
+                print(f"    [{mark}] #{c.get('rank')} {c.get('sheet')} "
+                      f"{c.get('range')}  {snippet}")
+            print()
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/scripts/verify_wheel.py b/scripts/verify_wheel.py
new file mode 100755
index 0000000..26801f3
--- /dev/null
+++ b/scripts/verify_wheel.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+"""Verify the built wheel is installable and importable in a clean venv.
+
+This is the regression guard for the v0.2.0 packaging bug: ``pipeline.py``
+and ``api.py`` were top-level modules under ``src/`` and ``setuptools``
+``packages.find`` only picks up *packages*, so they were silently dropped
+from the wheel — ``from ks_xlsx_parser.pipeline import ...`` failed for
+every installed user. The flat layout also leaked 13 generic top-level
+packages (``models``, ``utils``, ``parsers`` ...) into ``site-packages``.
+
+Run after ``python -m build --wheel``. Exits non-zero on any problem so it
+can gate CI and ``make wheel-check``.
+"""
+from __future__ import annotations
+
+import subprocess
+import sys
+import tempfile
+import venv
+import zipfile
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+
+# Imports a real downstream consumer relies on. Keep in sync with the
+# public surface in ks_xlsx_parser/__init__.py.
+SMOKE_IMPORTS = [
+    "from ks_xlsx_parser import parse_workbook, ParseResult",
+    "from ks_xlsx_parser.pipeline import parse_workbook",
+    "from ks_xlsx_parser.verification import StageVerifier",
+    "from ks_xlsx_parser.analysis.table_assembler import TableAssembler",
+    "from ks_xlsx_parser.models.workbook import WorkbookDTO",
+]
+
+
+def find_wheel() -> Path:
+    wheels = sorted((ROOT / "dist").glob("*.whl"))
+    if not wheels:
+        sys.exit("ERROR: no wheel in dist/ — run `python -m build --wheel` first")
+    return wheels[-1]
+
+
+def check_wheel_contents(wheel: Path) -> None:
+    """Fail loudly if the wheel pollutes the global namespace or drops modules."""
+    with zipfile.ZipFile(wheel) as zf:
+        names = zf.namelist()
+        top_level = next((n for n in names if n.endswith("top_level.txt")), None)
+        if top_level:
+            packages = zf.read(top_level).decode().split()
+            if packages != ["ks_xlsx_parser"]:
+                sys.exit(
+                    f"ERROR: wheel exposes top-level packages {packages}; "
+                    "expected only ['ks_xlsx_parser']. The flat src/ layout leaked."
+                )
+    required = ["ks_xlsx_parser/pipeline.py", "ks_xlsx_parser/api.py"]
+    for req in required:
+        if not any(n == req for n in names):
+            sys.exit(f"ERROR: wheel is missing {req}")
+    print(f"wheel contents OK ({len(names)} entries, top-level: ks_xlsx_parser)")
+
+
+def check_install_and_import(wheel: Path) -> None:
+    with tempfile.TemporaryDirectory() as tmp:
+        env_dir = Path(tmp) / "venv"
+        venv.create(env_dir, with_pip=True)
+        py = env_dir / ("Scripts" if sys.platform == "win32" else "bin") / "python"
+        subprocess.run([str(py), "-m", "pip", "install", "-q", str(wheel)], check=True)
+        script = "; ".join(SMOKE_IMPORTS) + "; print('clean-venv import OK')"
+        subprocess.run([str(py), "-c", script], check=True)
+
+
+def main() -> None:
+    wheel = find_wheel()
+    print(f"verifying {wheel.name}")
+    check_wheel_contents(wheel)
+    check_install_and_import(wheel)
+    print("wheel verification PASSED")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/src/ks_xlsx_parser/__init__.py b/src/ks_xlsx_parser/__init__.py
index 184452b..cabcd96 100644
--- a/src/ks_xlsx_parser/__init__.py
+++ b/src/ks_xlsx_parser/__init__.py
@@ -1,25 +1,27 @@
 """
 ks_xlsx_parser — public API entry point for the ks-xlsx-parser package.
 
-The source tree is flat: top-level modules at ``src/`` (``pipeline``,
+All modules live under the ``ks_xlsx_parser`` package (``pipeline``,
 ``models``, ``analysis``, ``verification``, etc.). This module re-exports
 the stable, user-facing names so callers can do::
 
     from ks_xlsx_parser import parse_workbook, ParseResult
 
-regardless of internal layout.
+and submodules remain importable directly::
+
+    from ks_xlsx_parser.pipeline import parse_workbook
 """
 from __future__ import annotations
 
 __version__ = "0.2.0"
 
-from pipeline import (  # noqa: F401
+from .pipeline import (  # noqa: F401
     ParseResult,
     compare_workbooks,
     export_importer,
     parse_workbook,
 )
-from verification import (  # noqa: F401
+from .verification import (  # noqa: F401
     ExcellentStage,
     StageVerifier,
     VerificationReport,
diff --git a/src/analysis/__init__.py b/src/ks_xlsx_parser/analysis/__init__.py
similarity index 100%
rename from src/analysis/__init__.py
rename to src/ks_xlsx_parser/analysis/__init__.py
diff --git a/src/analysis/light_block_detector.py b/src/ks_xlsx_parser/analysis/light_block_detector.py
similarity index 96%
rename from src/analysis/light_block_detector.py
rename to src/ks_xlsx_parser/analysis/light_block_detector.py
index 9816cd0..7fc2d11 100644
--- a/src/analysis/light_block_detector.py
+++ b/src/ks_xlsx_parser/analysis/light_block_detector.py
@@ -10,9 +10,9 @@
 
 import logging
 
-from models.block import BlockDTO
-from models.common import BlockType, CellCoord, CellRange
-from models.table_structure import TableRegion, TableRegionRole, TableStructure
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/llm_artifacts.py b/src/ks_xlsx_parser/analysis/llm_artifacts.py
similarity index 98%
rename from src/analysis/llm_artifacts.py
rename to src/ks_xlsx_parser/analysis/llm_artifacts.py
index 01d0777..2122c45 100644
--- a/src/analysis/llm_artifacts.py
+++ b/src/ks_xlsx_parser/analysis/llm_artifacts.py
@@ -11,24 +11,22 @@
 from __future__ import annotations
 
 import logging
-import re
 from collections import Counter
 
 from pydantic import Field
 
-from models.block import BlockDTO
-from models.chart import ChartDTO
-from models.common import (
-    BlockType,
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.chart import ChartDTO
+from ks_xlsx_parser.models.common import (
     CellCoord,
     SheetPurpose,
     StableModel,
     compute_hash,
 )
-from models.dependency import DependencyGraph
-from models.sheet import SheetDTO
-from models.table import TableDTO
-from models.workbook import KpiDTO, SheetSummaryDTO
+from ks_xlsx_parser.models.dependency import DependencyGraph
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table import TableDTO
+from ks_xlsx_parser.models.workbook import KpiDTO, SheetSummaryDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/pattern_splitter.py b/src/ks_xlsx_parser/analysis/pattern_splitter.py
similarity index 96%
rename from src/analysis/pattern_splitter.py
rename to src/ks_xlsx_parser/analysis/pattern_splitter.py
index 92ac82f..e065801 100644
--- a/src/analysis/pattern_splitter.py
+++ b/src/ks_xlsx_parser/analysis/pattern_splitter.py
@@ -10,12 +10,11 @@
 
 import logging
 import math
-from collections import defaultdict
 
-from models.block import BlockDTO
-from models.common import BlockType, CellAnnotation, CellCoord, CellRange
-from models.sheet import SheetDTO
-from models.table_structure import TableRegion, TableRegionRole, TableStructure
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellAnnotation, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/table_assembler.py b/src/ks_xlsx_parser/analysis/table_assembler.py
similarity index 96%
rename from src/analysis/table_assembler.py
rename to src/ks_xlsx_parser/analysis/table_assembler.py
index 279603d..28bc3a3 100644
--- a/src/analysis/table_assembler.py
+++ b/src/ks_xlsx_parser/analysis/table_assembler.py
@@ -9,10 +9,10 @@
 
 import logging
 
-from models.block import BlockDTO
-from models.common import BlockType, CellCoord, CellRange, compute_hash
-from models.sheet import SheetDTO
-from models.table_structure import TableRegion, TableRegionRole, TableStructure
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table_structure import TableRegion, TableRegionRole, TableStructure
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/table_grouper.py b/src/ks_xlsx_parser/analysis/table_grouper.py
similarity index 97%
rename from src/analysis/table_grouper.py
rename to src/ks_xlsx_parser/analysis/table_grouper.py
index 94fe862..597da57 100644
--- a/src/analysis/table_grouper.py
+++ b/src/ks_xlsx_parser/analysis/table_grouper.py
@@ -11,10 +11,10 @@
 import logging
 from collections import defaultdict
 
-from models.block import BlockDTO
-from models.common import BlockType, CellCoord, CellRange, compute_hash
-from models.sheet import SheetDTO
-from models.table_structure import TableStructure
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table_structure import TableStructure
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/template_extractor.py b/src/ks_xlsx_parser/analysis/template_extractor.py
similarity index 95%
rename from src/analysis/template_extractor.py
rename to src/ks_xlsx_parser/analysis/template_extractor.py
index 27a8025..54dedd9 100644
--- a/src/analysis/template_extractor.py
+++ b/src/ks_xlsx_parser/analysis/template_extractor.py
@@ -10,15 +10,15 @@
 
 import logging
 
-from models.common import CellAnnotation, CellCoord
-from models.sheet import SheetDTO
-from models.template import (
+from ks_xlsx_parser.models.common import CellAnnotation, CellCoord
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.template import (
     DOFType,
     TemplateCellSpec,
     TemplateConstraint,
     TemplateNode,
 )
-from models.tree import TreeNode, TreeNodeType
+from ks_xlsx_parser.models.tree import TreeNode, TreeNodeType
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/analysis/tree_builder.py b/src/ks_xlsx_parser/analysis/tree_builder.py
similarity index 96%
rename from src/analysis/tree_builder.py
rename to src/ks_xlsx_parser/analysis/tree_builder.py
index 0e5192f..22a2ed1 100644
--- a/src/analysis/tree_builder.py
+++ b/src/ks_xlsx_parser/analysis/tree_builder.py
@@ -10,11 +10,10 @@
 
 import logging
 
-from models.block import BlockDTO
-from models.common import CellCoord, CellRange, compute_hash
-from models.sheet import SheetDTO
-from models.table_structure import TableStructure
-from models.tree import TreeNode, TreeNodeType
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table_structure import TableStructure
+from ks_xlsx_parser.models.tree import TreeNode, TreeNodeType
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/annotation/__init__.py b/src/ks_xlsx_parser/annotation/__init__.py
similarity index 100%
rename from src/annotation/__init__.py
rename to src/ks_xlsx_parser/annotation/__init__.py
diff --git a/src/annotation/block_splitter.py b/src/ks_xlsx_parser/annotation/block_splitter.py
similarity index 97%
rename from src/annotation/block_splitter.py
rename to src/ks_xlsx_parser/annotation/block_splitter.py
index 0ec332f..2d45209 100644
--- a/src/annotation/block_splitter.py
+++ b/src/ks_xlsx_parser/annotation/block_splitter.py
@@ -8,11 +8,10 @@
 from __future__ import annotations
 
 import logging
-from collections import defaultdict
 
-from models.block import BlockDTO
-from models.common import BlockType, CellAnnotation, CellCoord, CellRange
-from models.sheet import SheetDTO
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellAnnotation, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/annotation/cell_annotator.py b/src/ks_xlsx_parser/annotation/cell_annotator.py
similarity index 98%
rename from src/annotation/cell_annotator.py
rename to src/ks_xlsx_parser/annotation/cell_annotator.py
index 5f80bdc..af82bac 100644
--- a/src/annotation/cell_annotator.py
+++ b/src/ks_xlsx_parser/annotation/cell_annotator.py
@@ -11,9 +11,9 @@
 import logging
 from collections import defaultdict
 
-from models.cell import CellDTO
-from models.common import CellAnnotation
-from models.sheet import SheetDTO
+from ks_xlsx_parser.models.cell import CellDTO
+from ks_xlsx_parser.models.common import CellAnnotation
+from ks_xlsx_parser.models.sheet import SheetDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/api.py b/src/ks_xlsx_parser/api.py
similarity index 97%
rename from src/api.py
rename to src/ks_xlsx_parser/api.py
index 36469c5..82a367d 100644
--- a/src/api.py
+++ b/src/ks_xlsx_parser/api.py
@@ -2,7 +2,7 @@
 FastAPI application for uploading and parsing Excel files.
 
 Run with:
-    uvicorn xlsx_parser.api:app --reload --port 8080
+    uvicorn ks_xlsx_parser.api:app --reload --port 8080
 
 Or use the xlsx-parser-api script (runs on port 8080 by default):
     xlsx-parser-api
@@ -12,16 +12,11 @@
 
 from __future__ import annotations
 
-import json
-import time
-import traceback
-from pathlib import Path
-
 from fastapi import FastAPI, File, UploadFile
 from fastapi.responses import HTMLResponse, JSONResponse
 
-from pipeline import parse_workbook
-from verification import StageVerifier
+from ks_xlsx_parser.pipeline import parse_workbook
+from ks_xlsx_parser.verification import StageVerifier
 
 app = FastAPI(
     title="ks-xlsx-parser API",
diff --git a/src/charts/__init__.py b/src/ks_xlsx_parser/charts/__init__.py
similarity index 100%
rename from src/charts/__init__.py
rename to src/ks_xlsx_parser/charts/__init__.py
diff --git a/src/charts/chart_extractor.py b/src/ks_xlsx_parser/charts/chart_extractor.py
similarity index 99%
rename from src/charts/chart_extractor.py
rename to src/ks_xlsx_parser/charts/chart_extractor.py
index 4c4cdca..017feb9 100644
--- a/src/charts/chart_extractor.py
+++ b/src/ks_xlsx_parser/charts/chart_extractor.py
@@ -21,8 +21,8 @@
 from pathlib import Path
 from xml.etree import ElementTree as ET
 
-from models.chart import ChartAnchor, ChartAxis, ChartDTO, ChartSeries
-from models.common import ChartType
+from ks_xlsx_parser.models.chart import ChartAnchor, ChartAxis, ChartDTO, ChartSeries
+from ks_xlsx_parser.models.common import ChartType
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/chunking/__init__.py b/src/ks_xlsx_parser/chunking/__init__.py
similarity index 100%
rename from src/chunking/__init__.py
rename to src/ks_xlsx_parser/chunking/__init__.py
diff --git a/src/chunking/chunker.py b/src/ks_xlsx_parser/chunking/chunker.py
similarity index 95%
rename from src/chunking/chunker.py
rename to src/ks_xlsx_parser/chunking/chunker.py
index bb4d4f9..7057e78 100644
--- a/src/chunking/chunker.py
+++ b/src/ks_xlsx_parser/chunking/chunker.py
@@ -13,14 +13,12 @@
 
 import logging
 
-from models.block import BlockDTO, ChunkDTO, DependencySummary
-from models.common import CellCoord, EdgeType
-from models.dependency import DependencyGraph
-from models.sheet import SheetDTO
-from models.table import TableDTO
-from models.workbook import NamedRangeDTO, WorkbookDTO
-from rendering.html_renderer import HtmlRenderer
-from rendering.text_renderer import TextRenderer
+from ks_xlsx_parser.models.block import BlockDTO, ChunkDTO, DependencySummary
+from ks_xlsx_parser.models.common import CellCoord, EdgeType
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.workbook import WorkbookDTO
+from ks_xlsx_parser.rendering.html_renderer import HtmlRenderer
+from ks_xlsx_parser.rendering.text_renderer import TextRenderer
 
 logger = logging.getLogger(__name__)
 
@@ -179,7 +177,7 @@ def _chart_to_chunk(self, chart) -> ChunkDTO:
         top_left = "A1"
         bottom_right = "A1"
         if chart.anchor:
-            from models.common import col_number_to_letter
+            from ks_xlsx_parser.models.common import col_number_to_letter
             top_left = f"{col_number_to_letter(chart.anchor.from_col + 1)}{chart.anchor.from_row + 1}"
             if chart.anchor.to_col is not None and chart.anchor.to_row is not None:
                 bottom_right = f"{col_number_to_letter(chart.anchor.to_col + 1)}{chart.anchor.to_row + 1}"
diff --git a/src/chunking/segmenter.py b/src/ks_xlsx_parser/chunking/segmenter.py
similarity index 98%
rename from src/chunking/segmenter.py
rename to src/ks_xlsx_parser/chunking/segmenter.py
index 1b7898a..bde7b2a 100644
--- a/src/chunking/segmenter.py
+++ b/src/ks_xlsx_parser/chunking/segmenter.py
@@ -22,10 +22,10 @@
 
 import logging
 
-from models.block import BlockDTO
-from models.common import BlockType, CellCoord, CellRange
-from models.sheet import SheetDTO
-from models.table import TableDTO
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+from ks_xlsx_parser.models.sheet import SheetDTO
+from ks_xlsx_parser.models.table import TableDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/comparison/__init__.py b/src/ks_xlsx_parser/comparison/__init__.py
similarity index 100%
rename from src/comparison/__init__.py
rename to src/ks_xlsx_parser/comparison/__init__.py
diff --git a/src/comparison/template_comparator.py b/src/ks_xlsx_parser/comparison/template_comparator.py
similarity index 99%
rename from src/comparison/template_comparator.py
rename to src/ks_xlsx_parser/comparison/template_comparator.py
index e7e043e..0c4fd64 100644
--- a/src/comparison/template_comparator.py
+++ b/src/ks_xlsx_parser/comparison/template_comparator.py
@@ -11,8 +11,8 @@
 import logging
 from collections import defaultdict
 
-from models.common import CellCoord, compute_hash
-from models.template import (
+from ks_xlsx_parser.models.common import CellCoord
+from ks_xlsx_parser.models.template import (
     DOFConflict,
     DOFType,
     GeneralizedTemplate,
diff --git a/src/export/__init__.py b/src/ks_xlsx_parser/export/__init__.py
similarity index 100%
rename from src/export/__init__.py
rename to src/ks_xlsx_parser/export/__init__.py
diff --git a/src/export/model_exporter.py b/src/ks_xlsx_parser/export/model_exporter.py
similarity index 98%
rename from src/export/model_exporter.py
rename to src/ks_xlsx_parser/export/model_exporter.py
index d47dc7d..0784aac 100644
--- a/src/export/model_exporter.py
+++ b/src/ks_xlsx_parser/export/model_exporter.py
@@ -12,8 +12,8 @@
 from pathlib import Path
 from typing import Any
 
-from models.common import CellCoord
-from models.template import DOFType, GeneralizedTemplate, TemplateNode
+from ks_xlsx_parser.models.common import CellCoord
+from ks_xlsx_parser.models.template import DOFType, GeneralizedTemplate
 
 logger = logging.getLogger(__name__)
 
@@ -205,7 +205,7 @@ def export_code(
             Constants: {template.total_constants} | DOFs: {template.total_dofs} | Formulas: {template.total_formulas}
             """
 
-            from xlsx_parser.export.model_exporter import SpreadsheetImporter
+            from ks_xlsx_parser.export.model_exporter import SpreadsheetImporter
             from xlsx_parser import parse_workbook
 
 
diff --git a/src/formula/__init__.py b/src/ks_xlsx_parser/formula/__init__.py
similarity index 100%
rename from src/formula/__init__.py
rename to src/ks_xlsx_parser/formula/__init__.py
diff --git a/src/formula/dependency_builder.py b/src/ks_xlsx_parser/formula/dependency_builder.py
similarity index 93%
rename from src/formula/dependency_builder.py
rename to src/ks_xlsx_parser/formula/dependency_builder.py
index 8bd85b0..19785af 100644
--- a/src/formula/dependency_builder.py
+++ b/src/ks_xlsx_parser/formula/dependency_builder.py
@@ -11,13 +11,14 @@
 import logging
 from typing import TYPE_CHECKING
 
-from models.common import CellCoord, EdgeType
-from models.dependency import DependencyEdgeDTO, DependencyGraph
+from ks_xlsx_parser.models.common import CellCoord, EdgeType
+from ks_xlsx_parser.models.dependency import DependencyEdgeDTO, DependencyGraph
+
 from .formula_parser import FormulaParser
 
 if TYPE_CHECKING:
-    from models.sheet import SheetDTO
-    from models.workbook import NamedRangeDTO
+    from ks_xlsx_parser.models.sheet import SheetDTO
+    from ks_xlsx_parser.models.workbook import NamedRangeDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/formula/formula_parser.py b/src/ks_xlsx_parser/formula/formula_parser.py
similarity index 98%
rename from src/formula/formula_parser.py
rename to src/ks_xlsx_parser/formula/formula_parser.py
index f850b91..941f7a4 100644
--- a/src/formula/formula_parser.py
+++ b/src/ks_xlsx_parser/formula/formula_parser.py
@@ -10,9 +10,9 @@
 from __future__ import annotations
 
 import re
-from dataclasses import dataclass, field
+from dataclasses import dataclass
 
-from models.common import CellCoord, CellRange, col_letter_to_number
+from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number
 
 # Optional Rust fast-path: ks_xlsx_core.scan_formula returns reference tuples
 # in the same emit order as the Python regex pipeline. When available, we
diff --git a/src/models/__init__.py b/src/ks_xlsx_parser/models/__init__.py
similarity index 94%
rename from src/models/__init__.py
rename to src/ks_xlsx_parser/models/__init__.py
index 42814ea..5c91780 100644
--- a/src/models/__init__.py
+++ b/src/ks_xlsx_parser/models/__init__.py
@@ -1,8 +1,8 @@
 """
-Data models (DTOs) for the xlsx_parser.
+Data models (DTOs) for the ks_xlsx_parser.
 
 Re-exports all model classes for convenient importing:
-    from xlsx_parser.models import WorkbookDTO, SheetDTO, CellDTO, ...
+    from ks_xlsx_parser.models import WorkbookDTO, SheetDTO, CellDTO, ...
 """
 
 from .block import BlockDTO, ChunkDTO, DependencySummary
diff --git a/src/models/block.py b/src/ks_xlsx_parser/models/block.py
similarity index 100%
rename from src/models/block.py
rename to src/ks_xlsx_parser/models/block.py
diff --git a/src/models/cell.py b/src/ks_xlsx_parser/models/cell.py
similarity index 100%
rename from src/models/cell.py
rename to src/ks_xlsx_parser/models/cell.py
diff --git a/src/models/chart.py b/src/ks_xlsx_parser/models/chart.py
similarity index 100%
rename from src/models/chart.py
rename to src/ks_xlsx_parser/models/chart.py
diff --git a/src/models/common.py b/src/ks_xlsx_parser/models/common.py
similarity index 100%
rename from src/models/common.py
rename to src/ks_xlsx_parser/models/common.py
diff --git a/src/models/dependency.py b/src/ks_xlsx_parser/models/dependency.py
similarity index 100%
rename from src/models/dependency.py
rename to src/ks_xlsx_parser/models/dependency.py
diff --git a/src/models/shape.py b/src/ks_xlsx_parser/models/shape.py
similarity index 97%
rename from src/models/shape.py
rename to src/ks_xlsx_parser/models/shape.py
index 178ebaf..a0f94f9 100644
--- a/src/models/shape.py
+++ b/src/ks_xlsx_parser/models/shape.py
@@ -9,7 +9,7 @@
 
 from pydantic import Field
 
-from .common import CellRange, StableModel, compute_hash
+from .common import StableModel, compute_hash
 
 
 class ShapeAnchor(StableModel):
diff --git a/src/models/sheet.py b/src/ks_xlsx_parser/models/sheet.py
similarity index 99%
rename from src/models/sheet.py
rename to src/ks_xlsx_parser/models/sheet.py
index bfc2758..68c8cb1 100644
--- a/src/models/sheet.py
+++ b/src/ks_xlsx_parser/models/sheet.py
@@ -8,8 +8,6 @@
 
 from __future__ import annotations
 
-from typing import Any
-
 from pydantic import Field
 
 from .cell import CellDTO
diff --git a/src/models/table.py b/src/ks_xlsx_parser/models/table.py
similarity index 100%
rename from src/models/table.py
rename to src/ks_xlsx_parser/models/table.py
diff --git a/src/models/table_structure.py b/src/ks_xlsx_parser/models/table_structure.py
similarity index 100%
rename from src/models/table_structure.py
rename to src/ks_xlsx_parser/models/table_structure.py
diff --git a/src/models/template.py b/src/ks_xlsx_parser/models/template.py
similarity index 100%
rename from src/models/template.py
rename to src/ks_xlsx_parser/models/template.py
diff --git a/src/models/tree.py b/src/ks_xlsx_parser/models/tree.py
similarity index 100%
rename from src/models/tree.py
rename to src/ks_xlsx_parser/models/tree.py
diff --git a/src/models/workbook.py b/src/ks_xlsx_parser/models/workbook.py
similarity index 99%
rename from src/models/workbook.py
rename to src/ks_xlsx_parser/models/workbook.py
index 772c934..39a428b 100644
--- a/src/models/workbook.py
+++ b/src/ks_xlsx_parser/models/workbook.py
@@ -32,7 +32,7 @@
 from .sheet import SheetDTO
 from .table import TableDTO
 from .table_structure import TableStructure
-from .template import GeneralizedTemplate, TemplateNode
+from .template import TemplateNode
 from .tree import TreeNode
 
 
diff --git a/src/parsers/__init__.py b/src/ks_xlsx_parser/parsers/__init__.py
similarity index 100%
rename from src/parsers/__init__.py
rename to src/ks_xlsx_parser/parsers/__init__.py
diff --git a/src/parsers/calamine_core.py b/src/ks_xlsx_parser/parsers/calamine_core.py
similarity index 99%
rename from src/parsers/calamine_core.py
rename to src/ks_xlsx_parser/parsers/calamine_core.py
index 830df5d..e6ac959 100644
--- a/src/parsers/calamine_core.py
+++ b/src/ks_xlsx_parser/parsers/calamine_core.py
@@ -17,7 +17,6 @@
 """
 from __future__ import annotations
 
-import io
 import logging
 import os
 import tempfile
diff --git a/src/parsers/cell_parser.py b/src/ks_xlsx_parser/parsers/cell_parser.py
similarity index 99%
rename from src/parsers/cell_parser.py
rename to src/ks_xlsx_parser/parsers/cell_parser.py
index cc22092..bf1857c 100644
--- a/src/parsers/cell_parser.py
+++ b/src/ks_xlsx_parser/parsers/cell_parser.py
@@ -17,7 +17,7 @@
 from openpyxl.cell.cell import Cell as OpenpyxlCell
 from openpyxl.cell.cell import MergedCell as OpenpyxlMergedCell
 
-from models.cell import (
+from ks_xlsx_parser.models.cell import (
     AlignmentStyle,
     BorderSide,
     BorderStyle,
@@ -26,7 +26,7 @@
     FillStyle,
     FontStyle,
 )
-from models.common import CellCoord, RichTextRun
+from ks_xlsx_parser.models.common import CellCoord, RichTextRun
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/parsers/sheet_parser.py b/src/ks_xlsx_parser/parsers/sheet_parser.py
similarity index 98%
rename from src/parsers/sheet_parser.py
rename to src/ks_xlsx_parser/parsers/sheet_parser.py
index 5883a92..d49a28d 100644
--- a/src/parsers/sheet_parser.py
+++ b/src/ks_xlsx_parser/parsers/sheet_parser.py
@@ -18,21 +18,22 @@
 from lxml import etree
 from openpyxl.worksheet.worksheet import Worksheet as OpenpyxlWorksheet
 
-from models.cell import CellDTO
-from models.common import CellCoord, CellRange, ParseError, Severity, col_letter_to_number
+from ks_xlsx_parser.models.cell import CellDTO
+from ks_xlsx_parser.models.common import CellCoord, CellRange, ParseError, Severity, col_letter_to_number
 
 # OOXML namespace for spreadsheetML
 _OOXML_NS = {"s": "http://schemas.openxmlformats.org/spreadsheetml/2006/main"}
 
 # Regex to split an A1-style cell reference like "B1" into ("B", "1")
 _CELL_REF_RE = re.compile(r"^([A-Z]+)(\d+)$")
-from models.sheet import (
+from ks_xlsx_parser.models.sheet import (
     ConditionalFormatRule,
     DataValidationRule,
     MergedRegion,
     SheetDTO,
     SheetProperties,
 )
+
 from .cell_parser import CellParser
 
 logger = logging.getLogger(__name__)
@@ -282,7 +283,7 @@ def _extract_dimensions(self, sheet: SheetDTO) -> None:
 
         for col_letter, cd in self._ws.column_dimensions.items():
             if cd.width:
-                from models.common import col_letter_to_number
+                from ks_xlsx_parser.models.common import col_letter_to_number
                 col_num = col_letter_to_number(col_letter)
                 sheet.col_widths[col_num] = cd.width * 7.5  # chars to points approx
 
@@ -294,7 +295,7 @@ def _extract_hidden(self, sheet: SheetDTO) -> None:
 
         for col_letter, cd in self._ws.column_dimensions.items():
             if cd.hidden:
-                from models.common import col_letter_to_number
+                from ks_xlsx_parser.models.common import col_letter_to_number
                 col_num = col_letter_to_number(col_letter)
                 sheet.hidden_cols.add(col_num)
 
@@ -347,11 +348,10 @@ def _extract_autofilter(self, sheet: SheetDTO) -> None:
         if not af or not af.ref:
             return
         try:
-            from models.common import FilterCriteria
+            from ks_xlsx_parser.models.common import FilterCriteria
             ref = str(af.ref)
             parts = ref.split(":")
             if len(parts) == 2:
-                from models.common import col_letter_to_number as c2n
                 start_match = _CELL_REF_RE.match(parts[0])
                 end_match = _CELL_REF_RE.match(parts[1])
                 if start_match and end_match:
diff --git a/src/parsers/table_parser.py b/src/ks_xlsx_parser/parsers/table_parser.py
similarity index 97%
rename from src/parsers/table_parser.py
rename to src/ks_xlsx_parser/parsers/table_parser.py
index d2eb2eb..bdc0d55 100644
--- a/src/parsers/table_parser.py
+++ b/src/ks_xlsx_parser/parsers/table_parser.py
@@ -12,8 +12,8 @@
 
 from openpyxl.worksheet.worksheet import Worksheet as OpenpyxlWorksheet
 
-from models.common import CellCoord, CellRange, col_letter_to_number
-from models.table import TableColumn, TableDTO
+from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number
+from ks_xlsx_parser.models.table import TableColumn, TableDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/parsers/workbook_parser.py b/src/ks_xlsx_parser/parsers/workbook_parser.py
similarity index 97%
rename from src/parsers/workbook_parser.py
rename to src/ks_xlsx_parser/parsers/workbook_parser.py
index bee0452..57e0c93 100644
--- a/src/parsers/workbook_parser.py
+++ b/src/ks_xlsx_parser/parsers/workbook_parser.py
@@ -11,7 +11,6 @@
 import io
 import logging
 import time
-from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
 from pathlib import Path
 
 import xxhash
@@ -21,7 +20,7 @@
 # both values AND formulas in one read; python-calamine is kept as a fallback
 # for environments where the Rust crate hasn't been built. Both are bypassed
 # cleanly if absent — correctness always falls back to openpyxl.
-from parsers import calamine_core as _calamine_core
+from ks_xlsx_parser.parsers import calamine_core as _calamine_core
 
 try:
     from python_calamine import CalamineWorkbook
@@ -30,14 +29,14 @@
     CalamineWorkbook = None  # type: ignore[assignment]
     _HAS_PYCALAMINE = False
 
-from models.common import ParseError, Severity
-from models.common import CalculationMode, DateSystem
-from models.workbook import (
+from ks_xlsx_parser.models.common import CalculationMode, DateSystem, ParseError, Severity
+from ks_xlsx_parser.models.workbook import (
     ExternalLink,
     NamedRangeDTO,
     WorkbookDTO,
     WorkbookProperties,
 )
+
 from .sheet_parser import SheetParser
 from .table_parser import TableParser
 
@@ -234,7 +233,7 @@ def parse(self) -> WorkbookDTO:
 
         # Extract charts via OOXML parsing
         try:
-            from charts.chart_extractor import ChartExtractor
+            from ks_xlsx_parser.charts.chart_extractor import ChartExtractor
             chart_extractor = ChartExtractor(
                 self._path or self._content, wb_formula.sheetnames
             )
@@ -253,7 +252,7 @@ def parse(self) -> WorkbookDTO:
         # accounts for ~25% of the full-mode wall clock).
         if self._build_dep_graph:
             try:
-                from formula.dependency_builder import DependencyBuilder
+                from ks_xlsx_parser.formula.dependency_builder import DependencyBuilder
                 dep_builder = DependencyBuilder(result.sheets, result.named_ranges)
                 result.dependency_graph = dep_builder.build()
             except Exception as e:
diff --git a/src/pipeline.py b/src/ks_xlsx_parser/pipeline.py
similarity index 90%
rename from src/pipeline.py
rename to src/ks_xlsx_parser/pipeline.py
index 9b902c5..1451074 100644
--- a/src/pipeline.py
+++ b/src/ks_xlsx_parser/pipeline.py
@@ -1,7 +1,7 @@
 """
 End-to-end parsing pipeline.
 
-This is the primary public API for the xlsx_parser. It orchestrates
+This is the primary public API for the ks_xlsx_parser. It orchestrates
 the full 11-stage Excellent Algorithm:
   Stage 0: Sheet Chunking (adaptive gaps + style boundaries)
   Stage 1: Cell Annotation (feature-based scorer)
@@ -15,7 +15,7 @@
   Render + Store
 
 Usage:
-    from xlsx_parser.pipeline import parse_workbook
+    from ks_xlsx_parser.pipeline import parse_workbook
     result = parse_workbook("path/to/workbook.xlsx")
     print(result.workbook.total_cells)
     for chunk in result.chunks:
@@ -32,25 +32,25 @@
 
 ParseMode = Literal["full", "fast"]
 
-from analysis.light_block_detector import LightBlockDetector
-from models.common import CellCoord, CellRange, col_letter_to_number
-from analysis.pattern_splitter import PatternSplitter
-from analysis.table_assembler import TableAssembler
-from analysis.table_grouper import TableGrouper
-from analysis.template_extractor import TemplateExtractor
-from analysis.tree_builder import TreeBuilder
-from annotation.block_splitter import BlockSplitter
-from annotation.cell_annotator import CellAnnotator
-from chunking.chunker import ChunkBuilder
-from comparison.template_comparator import TemplateComparator
-from export.model_exporter import ModelExporter
-from models.block import ChunkDTO
-from models.table_structure import TableStructure
-from models.template import GeneralizedTemplate, TemplateNode
-from models.tree import TreeNode
-from models.workbook import WorkbookDTO
-from parsers.workbook_parser import WorkbookParser
-from storage.serializer import WorkbookSerializer
+from ks_xlsx_parser.analysis.light_block_detector import LightBlockDetector
+from ks_xlsx_parser.analysis.pattern_splitter import PatternSplitter
+from ks_xlsx_parser.analysis.table_assembler import TableAssembler
+from ks_xlsx_parser.analysis.table_grouper import TableGrouper
+from ks_xlsx_parser.analysis.template_extractor import TemplateExtractor
+from ks_xlsx_parser.analysis.tree_builder import TreeBuilder
+from ks_xlsx_parser.annotation.block_splitter import BlockSplitter
+from ks_xlsx_parser.annotation.cell_annotator import CellAnnotator
+from ks_xlsx_parser.chunking.chunker import ChunkBuilder
+from ks_xlsx_parser.comparison.template_comparator import TemplateComparator
+from ks_xlsx_parser.export.model_exporter import ModelExporter
+from ks_xlsx_parser.models.block import ChunkDTO
+from ks_xlsx_parser.models.common import CellCoord, CellRange, col_letter_to_number
+from ks_xlsx_parser.models.table_structure import TableStructure
+from ks_xlsx_parser.models.template import GeneralizedTemplate, TemplateNode
+from ks_xlsx_parser.models.tree import TreeNode
+from ks_xlsx_parser.models.workbook import WorkbookDTO
+from ks_xlsx_parser.parsers.workbook_parser import WorkbookParser
+from ks_xlsx_parser.storage.serializer import WorkbookSerializer
 
 logger = logging.getLogger(__name__)
 
@@ -261,7 +261,7 @@ def parse_workbook(
         annotator.annotate()
 
         # Stage 0+2: Segment then split by annotation
-        from chunking.segmenter import LayoutSegmenter
+        from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
         sheet_tables = [t for t in workbook.tables if t.sheet_name == sheet.sheet_name]
         sheet_named = [
             nr.name for nr in workbook.named_ranges
diff --git a/src/py.typed b/src/ks_xlsx_parser/py.typed
similarity index 100%
rename from src/py.typed
rename to src/ks_xlsx_parser/py.typed
diff --git a/src/rendering/__init__.py b/src/ks_xlsx_parser/rendering/__init__.py
similarity index 100%
rename from src/rendering/__init__.py
rename to src/ks_xlsx_parser/rendering/__init__.py
diff --git a/src/rendering/html_renderer.py b/src/ks_xlsx_parser/rendering/html_renderer.py
similarity index 96%
rename from src/rendering/html_renderer.py
rename to src/ks_xlsx_parser/rendering/html_renderer.py
index c7c1a1c..50a8b21 100644
--- a/src/rendering/html_renderer.py
+++ b/src/ks_xlsx_parser/rendering/html_renderer.py
@@ -14,10 +14,10 @@
 import html
 import logging
 
-from models.block import BlockDTO
-from models.cell import CellDTO, CellStyle
-from models.common import BlockType, CellCoord, col_number_to_letter
-from models.sheet import SheetDTO
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.cell import CellStyle
+from ks_xlsx_parser.models.common import BlockType, col_number_to_letter
+from ks_xlsx_parser.models.sheet import SheetDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/rendering/text_renderer.py b/src/ks_xlsx_parser/rendering/text_renderer.py
similarity index 97%
rename from src/rendering/text_renderer.py
rename to src/ks_xlsx_parser/rendering/text_renderer.py
index 76b9b76..e2c15f4 100644
--- a/src/rendering/text_renderer.py
+++ b/src/ks_xlsx_parser/rendering/text_renderer.py
@@ -11,10 +11,10 @@
 import datetime as _dt
 import logging
 
-from models.block import BlockDTO
-from models.common import BlockType, col_number_to_letter
-from models.chart import ChartDTO
-from models.sheet import SheetDTO
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.chart import ChartDTO
+from ks_xlsx_parser.models.common import BlockType, col_number_to_letter
+from ks_xlsx_parser.models.sheet import SheetDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/storage/__init__.py b/src/ks_xlsx_parser/storage/__init__.py
similarity index 100%
rename from src/storage/__init__.py
rename to src/ks_xlsx_parser/storage/__init__.py
diff --git a/src/storage/serializer.py b/src/ks_xlsx_parser/storage/serializer.py
similarity index 98%
rename from src/storage/serializer.py
rename to src/ks_xlsx_parser/storage/serializer.py
index 5ced8f9..66add02 100644
--- a/src/storage/serializer.py
+++ b/src/ks_xlsx_parser/storage/serializer.py
@@ -15,8 +15,8 @@
 import logging
 from typing import Any
 
-from models.block import ChunkDTO
-from models.workbook import WorkbookDTO
+from ks_xlsx_parser.models.block import ChunkDTO
+from ks_xlsx_parser.models.workbook import WorkbookDTO
 
 logger = logging.getLogger(__name__)
 
diff --git a/src/utils/__init__.py b/src/ks_xlsx_parser/utils/__init__.py
similarity index 100%
rename from src/utils/__init__.py
rename to src/ks_xlsx_parser/utils/__init__.py
diff --git a/src/utils/logging_config.py b/src/ks_xlsx_parser/utils/logging_config.py
similarity index 97%
rename from src/utils/logging_config.py
rename to src/ks_xlsx_parser/utils/logging_config.py
index ca23108..94ceb54 100644
--- a/src/utils/logging_config.py
+++ b/src/ks_xlsx_parser/utils/logging_config.py
@@ -1,5 +1,5 @@
 """
-Structured logging configuration for the xlsx_parser.
+Structured logging configuration for the ks_xlsx_parser.
 
 Provides a JSON-structured logging format with workbook/sheet/block context
 fields for observability and debugging in production.
diff --git a/src/verification/__init__.py b/src/ks_xlsx_parser/verification/__init__.py
similarity index 100%
rename from src/verification/__init__.py
rename to src/ks_xlsx_parser/verification/__init__.py
diff --git a/src/verification/stage_verifier.py b/src/ks_xlsx_parser/verification/stage_verifier.py
similarity index 97%
rename from src/verification/stage_verifier.py
rename to src/ks_xlsx_parser/verification/stage_verifier.py
index 3fb315b..88fb4f8 100644
--- a/src/verification/stage_verifier.py
+++ b/src/ks_xlsx_parser/verification/stage_verifier.py
@@ -16,23 +16,21 @@
 
 from pydantic import Field
 
-from analysis.light_block_detector import LightBlockDetector
-from analysis.pattern_splitter import PatternSplitter
-from analysis.table_assembler import TableAssembler
-from analysis.table_grouper import TableGrouper
-from analysis.template_extractor import TemplateExtractor
-from analysis.tree_builder import TreeBuilder
-from annotation.block_splitter import BlockSplitter
-from annotation.cell_annotator import CellAnnotator
-from chunking.segmenter import LayoutSegmenter
-from comparison.template_comparator import TemplateComparator
-from export.model_exporter import ModelExporter
-from models.block import BlockDTO
-from models.common import BlockType, CellAnnotation, StableModel
-from models.table_structure import TableStructure
-from models.template import TemplateNode
-from models.tree import TreeNode
-from parsers.workbook_parser import WorkbookParser
+from ks_xlsx_parser.analysis.light_block_detector import LightBlockDetector
+from ks_xlsx_parser.analysis.pattern_splitter import PatternSplitter
+from ks_xlsx_parser.analysis.table_assembler import TableAssembler
+from ks_xlsx_parser.analysis.table_grouper import TableGrouper
+from ks_xlsx_parser.analysis.template_extractor import TemplateExtractor
+from ks_xlsx_parser.analysis.tree_builder import TreeBuilder
+from ks_xlsx_parser.annotation.block_splitter import BlockSplitter
+from ks_xlsx_parser.annotation.cell_annotator import CellAnnotator
+from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
+from ks_xlsx_parser.models.block import BlockDTO
+from ks_xlsx_parser.models.common import BlockType, CellAnnotation, StableModel
+from ks_xlsx_parser.models.table_structure import TableStructure
+from ks_xlsx_parser.models.template import TemplateNode
+from ks_xlsx_parser.models.tree import TreeNode
+from ks_xlsx_parser.parsers.workbook_parser import WorkbookParser
 
 logger = logging.getLogger(__name__)
 
diff --git a/tests/benchmarks/_driver.py b/tests/benchmarks/_driver.py
index a66acd5..94ee633 100644
--- a/tests/benchmarks/_driver.py
+++ b/tests/benchmarks/_driver.py
@@ -18,10 +18,8 @@
 import subprocess
 import sys
 from collections import defaultdict
-from dataclasses import asdict
 from datetime import UTC, datetime
 from pathlib import Path
-from typing import Iterable
 
 from ._runner import Runner
 from ._schema import CSV_FIELDS, BenchmarkRecord, record_to_csv_row, validate_record
@@ -344,7 +342,7 @@ def generate_drift(out_dir: Path) -> None:
                 short = Path(fname).name
                 lines.append(f"| {short} | {va} | {vb} | {va - vb:+d} |")
         if n_drift == 0:
-            lines.append(f"| _no drift_ | | | |")
+            lines.append("| _no drift_ | | | |")
         elif n_drift > 50:
             lines.append(f"| … {n_drift - 50} more rows truncated | | | |")
         lines.append("")
diff --git a/tests/benchmarks/_runner.py b/tests/benchmarks/_runner.py
index 2bdb95c..a5febc9 100644
--- a/tests/benchmarks/_runner.py
+++ b/tests/benchmarks/_runner.py
@@ -18,7 +18,7 @@
 import time
 from dataclasses import dataclass
 from pathlib import Path
-from typing import Any, Iterator
+from typing import Any
 
 HERE = Path(__file__).resolve().parent
 REPO_ROOT = HERE.parent.parent
diff --git a/tests/benchmarks/adapters/docling_adapter.py b/tests/benchmarks/adapters/docling_adapter.py
index 17d4b02..bd573f5 100644
--- a/tests/benchmarks/adapters/docling_adapter.py
+++ b/tests/benchmarks/adapters/docling_adapter.py
@@ -41,7 +41,7 @@
 sys.path.insert(0, str(_HERE.parent.parent.parent))  # repo root
 
 from tests.benchmarks._mem import peak_rss_mb  # noqa: E402
-from tests.benchmarks._schema import BenchmarkRecord, SCHEMA_VERSION  # noqa: E402
+from tests.benchmarks._schema import SCHEMA_VERSION, BenchmarkRecord  # noqa: E402
 
 PARSER_NAME = "docling"
 MAX_ERR_LEN = 500
diff --git a/tests/benchmarks/adapters/ks_adapter.py b/tests/benchmarks/adapters/ks_adapter.py
index 917df29..754b2a6 100644
--- a/tests/benchmarks/adapters/ks_adapter.py
+++ b/tests/benchmarks/adapters/ks_adapter.py
@@ -28,10 +28,10 @@
 sys.path.insert(0, str(_HERE.parent.parent.parent))  # repo root
 sys.path.insert(0, str(_HERE.parent.parent.parent / "src"))  # src layout
 
-from tests.benchmarks._mem import peak_rss_mb  # noqa: E402
-from tests.benchmarks._schema import BenchmarkRecord, SCHEMA_VERSION  # noqa: E402
 from ks_xlsx_parser import __version__ as KS_VERSION  # noqa: E402
 from ks_xlsx_parser import parse_workbook  # noqa: E402
+from tests.benchmarks._mem import peak_rss_mb  # noqa: E402
+from tests.benchmarks._schema import SCHEMA_VERSION, BenchmarkRecord  # noqa: E402
 
 MAX_ERR_LEN = 500
 
diff --git a/tests/conftest.py b/tests/conftest.py
index d422b9b..1417e30 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -8,7 +8,6 @@
 
 
 
-import os
 import tempfile
 from pathlib import Path
 
@@ -17,7 +16,6 @@
 from openpyxl.chart import BarChart, Reference
 from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, IconSetRule
 from openpyxl.styles import Alignment, Border, Font, PatternFill, Side
-from openpyxl.utils import get_column_letter
 from openpyxl.worksheet.datavalidation import DataValidation
 from openpyxl.worksheet.table import Table, TableStyleInfo
 
diff --git a/tests/helpers/invariant_checker.py b/tests/helpers/invariant_checker.py
index a077486..89efd1c 100644
--- a/tests/helpers/invariant_checker.py
+++ b/tests/helpers/invariant_checker.py
@@ -9,7 +9,7 @@
 
 import re
 
-from models.common import EdgeType
+from ks_xlsx_parser.models.common import EdgeType
 
 
 def check_invariants(workbook) -> list[str]:
diff --git a/tests/test_charts.py b/tests/test_charts.py
index 021e793..802e6a6 100644
--- a/tests/test_charts.py
+++ b/tests/test_charts.py
@@ -5,11 +5,10 @@
 type, series, axes, title, and position anchor.
 """
 
-import pytest
 
-from charts.chart_extractor import ChartExtractor
-from models import ChartType
-from parsers import WorkbookParser
+from ks_xlsx_parser.charts.chart_extractor import ChartExtractor
+from ks_xlsx_parser.models import ChartType
+from ks_xlsx_parser.parsers import WorkbookParser
 
 
 class TestChartExtraction:
diff --git a/tests/test_corpus_robustness.py b/tests/test_corpus_robustness.py
index 2253eea..4263c89 100644
--- a/tests/test_corpus_robustness.py
+++ b/tests/test_corpus_robustness.py
@@ -1,5 +1,5 @@
 """
-Corpus robustness tests for the xlsx_parser.
+Corpus robustness tests for the ks_xlsx_parser.
 
 Tests the parser against large sets of real-world .xlsx files to catch
 crashes, invariant violations, and regression. Corpus tests are skipped
@@ -15,17 +15,15 @@
 
 import pytest
 
-from pipeline import parse_workbook
-from models.common import Severity
-
-from tests.helpers.invariant_checker import check_invariants
+from ks_xlsx_parser.models.common import Severity
+from ks_xlsx_parser.pipeline import parse_workbook
 from tests.helpers.corpus_downloader import (
-    download_euses_corpus,
     download_enron_corpus,
+    download_euses_corpus,
     download_github_xlsx_samples,
     get_corpus_files,
 )
-
+from tests.helpers.invariant_checker import check_invariants
 
 CORPUS_DIR = Path(__file__).parent / "fixtures" / "corpus"
 
diff --git a/tests/test_formula_handling.py b/tests/test_formula_handling.py
index b3f8342..8402667 100644
--- a/tests/test_formula_handling.py
+++ b/tests/test_formula_handling.py
@@ -11,14 +11,11 @@
 - Renders formula cells in chunk output
 """
 
-import pytest
-
-from formula.dependency_builder import DependencyBuilder
-from formula.formula_parser import FormulaParser, ParsedReference
-from models.common import BlockType, CellCoord, EdgeType
-from parsers import WorkbookParser
-from pipeline import parse_workbook
 
+from ks_xlsx_parser.formula.formula_parser import FormulaParser
+from ks_xlsx_parser.models.common import CellCoord, EdgeType
+from ks_xlsx_parser.parsers import WorkbookParser
+from ks_xlsx_parser.pipeline import parse_workbook
 
 # ---------------------------------------------------------------------------
 # Helpers
@@ -705,7 +702,7 @@ class TestFormulaInBlocks:
     def test_block_formula_count(self, simple_formulas):
         result = parse_workbook(path=simple_formulas)
         # The block should report formula_count > 0
-        from chunking.segmenter import LayoutSegmenter
+        from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
         sheet = result.workbook.sheets[0]
         tables = [t for t in result.workbook.tables if t.sheet_name == sheet.sheet_name]
         segmenter = LayoutSegmenter(sheet, tables=tables)
@@ -715,7 +712,7 @@ def test_block_formula_count(self, simple_formulas):
 
     def test_cross_sheet_block_has_formulas(self, cross_sheet_formulas):
         result = parse_workbook(path=cross_sheet_formulas)
-        from chunking.segmenter import LayoutSegmenter
+        from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
         summary = [s for s in result.workbook.sheets if s.sheet_name == "Summary"][0]
         tables = [t for t in result.workbook.tables if t.sheet_name == summary.sheet_name]
         segmenter = LayoutSegmenter(summary, tables=tables)
diff --git a/tests/test_formula_parser.py b/tests/test_formula_parser.py
index 784e171..03c3959 100644
--- a/tests/test_formula_parser.py
+++ b/tests/test_formula_parser.py
@@ -5,10 +5,9 @@
 and circular reference detection.
 """
 
-import pytest
 
-from formula.formula_parser import FormulaParser
-from models import CellCoord, DependencyGraph, DependencyEdgeDTO, EdgeType
+from ks_xlsx_parser.formula.formula_parser import FormulaParser
+from ks_xlsx_parser.models import CellCoord, DependencyEdgeDTO, DependencyGraph, EdgeType
 
 
 class TestFormulaParser:
diff --git a/tests/test_llm_artifacts.py b/tests/test_llm_artifacts.py
index c3247eb..87f6335 100644
--- a/tests/test_llm_artifacts.py
+++ b/tests/test_llm_artifacts.py
@@ -5,16 +5,15 @@
 and reading-order linearization using programmatic fixtures.
 """
 
-import pytest
 
-from analysis import (
+from ks_xlsx_parser.analysis import (
     EntityIndexBuilder,
     KpiCatalogBuilder,
     ReadingOrderLinearizer,
     SheetSummaryAnalyzer,
 )
-from models.common import SheetPurpose
-from parsers import WorkbookParser
+from ks_xlsx_parser.models.common import SheetPurpose
+from ks_xlsx_parser.parsers import WorkbookParser
 
 
 class TestSheetSummaryAnalyzer:
diff --git a/tests/test_models.py b/tests/test_models.py
index bddec00..5393eea 100644
--- a/tests/test_models.py
+++ b/tests/test_models.py
@@ -4,12 +4,12 @@
 Covers CellCoord, CellRange, hashing, and serialization.
 """
 
-from models import (
+from ks_xlsx_parser.models import (
     CellCoord,
     CellRange,
-    compute_hash,
     col_letter_to_number,
     col_number_to_letter,
+    compute_hash,
 )
 
 
diff --git a/tests/test_multi_table_layout.py b/tests/test_multi_table_layout.py
index 7681a52..cca2057 100644
--- a/tests/test_multi_table_layout.py
+++ b/tests/test_multi_table_layout.py
@@ -8,11 +8,10 @@
 
 import pytest
 
-from chunking.segmenter import LayoutSegmenter
-from models import BlockType
-from parsers import WorkbookParser
-from pipeline import parse_workbook
-
+from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
+from ks_xlsx_parser.models import BlockType
+from ks_xlsx_parser.parsers import WorkbookParser
+from ks_xlsx_parser.pipeline import parse_workbook
 
 # ---------------------------------------------------------------------------
 # Helpers
diff --git a/tests/test_parsers.py b/tests/test_parsers.py
index 850a0bb..fe1e911 100644
--- a/tests/test_parsers.py
+++ b/tests/test_parsers.py
@@ -6,9 +6,8 @@
 tables, comments, data validations, and sheet properties.
 """
 
-import pytest
 
-from parsers import WorkbookParser
+from ks_xlsx_parser.parsers import WorkbookParser
 
 
 class TestSimpleWorkbook:
diff --git a/tests/test_pipeline.py b/tests/test_pipeline.py
index 534d125..7f1ab0e 100644
--- a/tests/test_pipeline.py
+++ b/tests/test_pipeline.py
@@ -7,9 +7,7 @@
 
 import json
 
-import pytest
-
-from pipeline import parse_workbook
+from ks_xlsx_parser.pipeline import parse_workbook
 
 
 class TestEndToEndPipeline:
diff --git a/tests/test_rendering.py b/tests/test_rendering.py
index e24072a..3a757ec 100644
--- a/tests/test_rendering.py
+++ b/tests/test_rendering.py
@@ -5,12 +5,11 @@
 headers, and coordinate annotations.
 """
 
-import pytest
 
-from chunking.segmenter import LayoutSegmenter
-from parsers import WorkbookParser
-from rendering.html_renderer import HtmlRenderer
-from rendering.text_renderer import TextRenderer
+from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
+from ks_xlsx_parser.parsers import WorkbookParser
+from ks_xlsx_parser.rendering.html_renderer import HtmlRenderer
+from ks_xlsx_parser.rendering.text_renderer import TextRenderer
 
 
 class TestHtmlRendering:
@@ -112,11 +111,10 @@ def test_numeric_cells_render_raw_not_display_formatted(self):
         sci-notation fallback (``1.272000e+03``) once the ``[=]`` formula
         marker pushed the rendered string past col_width — this test
         guards against that regression."""
-        from models.sheet import SheetDTO
-        from models.cell import CellDTO
-        from models.common import CellCoord, CellRange
-        from models.block import BlockDTO
-        from models.common import BlockType
+        from ks_xlsx_parser.models.block import BlockDTO
+        from ks_xlsx_parser.models.cell import CellDTO
+        from ks_xlsx_parser.models.common import BlockType, CellCoord, CellRange
+        from ks_xlsx_parser.models.sheet import SheetDTO
 
         coord = CellCoord(row=1, col=1)
         cell = CellDTO(
diff --git a/tests/test_segmentation.py b/tests/test_segmentation.py
index 8cb3cfc..45f13b0 100644
--- a/tests/test_segmentation.py
+++ b/tests/test_segmentation.py
@@ -5,11 +5,10 @@
 assumption blocks, result blocks, and text headers.
 """
 
-import pytest
 
-from chunking.segmenter import LayoutSegmenter
-from models import BlockType
-from parsers import WorkbookParser
+from ks_xlsx_parser.chunking.segmenter import LayoutSegmenter
+from ks_xlsx_parser.models import BlockType
+from ks_xlsx_parser.parsers import WorkbookParser
 
 
 class TestSegmentation:
diff --git a/tests/test_stage_verification.py b/tests/test_stage_verification.py
index 574be13..75dc8bc 100644
--- a/tests/test_stage_verification.py
+++ b/tests/test_stage_verification.py
@@ -8,9 +8,7 @@
 
 import json
 
-import pytest
-
-from verification import (
+from ks_xlsx_parser.verification import (
     ExcellentStage,
     ImplementationStatus,
     StageResult,
@@ -18,7 +16,6 @@
     VerificationReport,
 )
 
-
 # ---------------------------------------------------------------------------
 # Model tests
 # ---------------------------------------------------------------------------
diff --git a/tests/test_structural_invariants.py b/tests/test_structural_invariants.py
index 612d11a..bbe1460 100644
--- a/tests/test_structural_invariants.py
+++ b/tests/test_structural_invariants.py
@@ -1,5 +1,5 @@
 """
-Structural invariant tests for the xlsx_parser.
+Structural invariant tests for the ks_xlsx_parser.
 
 Tests properties that must always hold for any valid parse output,
 regardless of the input file: merge structure, used range bounds,
@@ -13,11 +13,8 @@
 
 import pytest
 
-from pipeline import parse_workbook
-from models.common import EdgeType
-
-from tests.helpers.invariant_checker import check_invariants
-
+from ks_xlsx_parser.models.common import EdgeType
+from ks_xlsx_parser.pipeline import parse_workbook
 
 # ---------------------------------------------------------------------------
 # Invariant tests on programmatic fixtures