Skip to content

bench(harness): make malformed GT specs scorable + add in-scope recall filter#10

Merged
arnav2 merged 1 commit into
mainfrom
bench/harness-regex-and-in-scope-filter
May 20, 2026
Merged

bench(harness): make malformed GT specs scorable + add in-scope recall filter#10
arnav2 merged 1 commit into
mainfrom
bench/harness-regex-and-in-scope-filter

Conversation

@arnav2
Copy link
Copy Markdown
Collaborator

@arnav2 arnav2 commented May 20, 2026

Summary

Two harness-side improvements to the SpreadsheetBench retrieval-recall benchmark, plus regression-test guards that lock in already-correct parser behavior. No src/ks_xlsx_parser/* changes — this PR is entirely scripts/ + tests/.

The motivating goal was the recall-90 roadmap in docs/planning/recall-90/. This PR doesn't reach 0.90; it ships the work that was actually completed and surfaces the metric the roadmap gates on (in-scope recall) so future parser work has a clean signal.

What changed

1. parse_position_spec tolerates malformed dataset entries

Eight to fourteen instances in SpreadsheetBench v0.1 carry GT strings the harness regex never matched:

Shape Example Count
unmatched single quote Sheet1'!A1, Dashboard'!B8, Output'!B11:B17 5
quote past separator 'RAWDATA!'A1:P6,'OUTPUT!'A1:P6' 2
fullwidth colon G12:J15 1
whitespace around : 'A1: A89', 'C1: C7' 6

All now parse via a normalize-then-fallback pass in scripts/eval_retrieval.py::parse_position_spec + parse_range. These instances were scored as misses despite the parser being correct.

Full 912 v0.1 impact: geom@5 0.369 → 0.478 (+10.9 pp), text@5 0.704 → 0.754 (+5.0 pp).

2. classify_execution_required + dual recall numbers (cluster 05)

~33% of SpreadsheetBench instances ask the system to compute and write a value that doesn't exist in the input (e.g. "Round B2 to nearest $5 and put it in C2"). A retrieval parser cannot satisfy these. They drag headline recall down ~16pp without representing parser quality.

summary.json / history.jsonl / summary.md now emit BOTH:

  • recall_*@5 — denominator = all 912 (long-standing number)
  • recall_*@5_in_scope — denominator excludes execution-required instances (the metric the roadmap gates on)

scripts/triage_recall.py default-filters out-of-scope rows; opt-in via --include-out-of-scope. An audit log out_of_scope.txt is written per run so the classifier's behavior is diffable.

Full 912 v0.1 in-scope: text@5 0.841, geom@5 0.534 (denominator 667 in-scope, 245 OOS).

3. enrich_failures.py: two real bug fixes

When the dataset entry has no explicit answer_sheet (561/912 instances) the old code skipped the first-worksheet fallback that the scoring path uses. Result: hundreds of spurious flags.

Flag Before After
instruction_requires_execution 498 193
gt_range_empty_in_workbook 477 172

The corrected counts now agree with the new classify_execution_required in eval_retrieval.py.

4. Regression-test guards (parser code unchanged)

Four new test files lock in behavior that's already correct on main but was undertested:

  • test_eval_retrieval_spec_parser.py (17 cases) — spec-parser tolerance for every malformed shape observed in the corpus
  • test_eval_retrieval_classify.py (8 cases) — classifier semantics + dual-recall aggregator math
  • test_array_formula_rendering.py (3 cases) — <ArrayFormula object …> repr must not leak into chunk render_text
  • test_formula_uncached_rendering.py (2 cases) — uncached formula cells render their formula source (not None or repr)

Type of change

  • 🐞 Bug fix (enrich_failures false positives; harness regex misses)
  • ✨ New feature (in-scope recall metric, dual numbers, out_of_scope.txt audit)
  • 🧪 Parser edge case / new regression test (4 new test files, +17 cases)

Test plan

  • make test passes locally — 1071 passed (1054 → 1071, +17 new)
  • Full 912 SpreadsheetBench v0.1 bench run completed; history.jsonl row appended (commit cf633d1, n=912)
  • scripts/enrich_failures.py re-runs cleanly on the corrected output; flag counts match the classifier
  • ruff check clean on every file this PR touches (pre-existing F401/F841 issues in untouched code remain; not widened)
  • Reviewer: spot-check the 14 malformed-spec instances flip miss → hit
  • Reviewer: confirm recall_*@5_in_scope is the metric the recall-90 roadmap should gate on

Notes for reviewers

  • No parser internals changed, only the eval harness. Tests-4 above guard behavior the parser already has so future refactors don't regress it.
  • Recall-90 target not reached: in-scope text@5 is 0.841 (target 0.90), in-scope geom@5 is 0.534 (target 0.70). The residual gap (cluster 02 range-bookkeeping + the cluster-04 chunk-size cap) is left to a follow-up. Cluster docs in docs/planning/recall-90/ (gitignored) carry status snapshots.
  • The dual recall_*_all / recall_*_in_scope schema is additive — existing consumers of summary.json keep working.

🤖 Generated with Claude Code

…l filter

Two harness-side changes to ks-xlsx-parser's SpreadsheetBench eval,
plus regression-test guards locking already-shipped parser behavior.

1. parse_position_spec tolerates malformed dataset entries:
   - unmatched single quotes (`Sheet1'!A1`, `'Sheet1!'A1`)
   - fullwidth colon (`G12:J15`)
   - whitespace around `:` (`A1: A89`)
   14 of 912 instances were scored as misses even when the parser was
   correct; they now score normally. On the full 912 v0.1:
     geom@5 0.369 → 0.478 (+10.9 pp), text@5 0.704 → 0.754 (+5.0 pp).

2. classify_execution_required + dual recall numbers.
   ~33% of SpreadsheetBench instances ask the system to compute and
   write a value that doesn't exist in the input — a retrieval parser
   cannot satisfy these. summary.json / history.jsonl / summary.md now
   emit BOTH `recall_*@5` (all 912) and `recall_*@5_in_scope`
   (denominator excludes execution-required instances). triage_recall
   default-filters out-of-scope rows; opt-in via --include-out-of-scope.
   `out_of_scope.txt` audit log written per run.
   On the full 912: text@5 in-scope 0.841, geom@5 in-scope 0.534.

3. enrich_failures: fix two `answer_sheet=None` false-positive bugs.
   When the dataset entry has no explicit sheet name (561/912 instances)
   the old code skipped the first-worksheet fallback used by the
   scoring path. Result: spurious flags on hundreds of instances.
   `instruction_requires_execution`  498 → 193
   `gt_range_empty_in_workbook`      477 → 172
   The remaining counts now match the corrected classifier in
   eval_retrieval.py.

4. Regression-test guards for parser behavior that is already correct
   on `main` (no parser code changed, just locked in tests):
   - test_eval_retrieval_spec_parser.py (17 cases) — spec-parser
     tolerance for every malformed shape observed in the corpus
   - test_eval_retrieval_classify.py (8 cases) — classifier semantics
     + dual-recall aggregator math
   - test_array_formula_rendering.py (3 cases) — ArrayFormula repr
     must not leak into chunk render_text
   - test_formula_uncached_rendering.py (2 cases) — uncached formula
     cells render their formula source (not None or repr)

Tests: 1071 passed (+17 new). No parser internals changed; all gains
come from harness scorability and from surfacing a metric that was
hidden inside the eval output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arnav2 arnav2 merged commit 1f852a3 into main May 20, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant