Skip to content

Add eval harness scaffold: spec, scenarios, fixtures dir#52

Open
SakshiKekre wants to merge 11 commits into
mainfrom
feat/eval-harness
Open

Add eval harness scaffold: spec, scenarios, fixtures dir#52
SakshiKekre wants to merge 11 commits into
mainfrom
feat/eval-harness

Conversation

@SakshiKekre
Copy link
Copy Markdown
Collaborator

@SakshiKekre SakshiKekre commented May 15, 2026

Summary

End-to-end eval harness for uk-chat: scenarios, fixtures, runner, grader, and a first run's results. Originally scoped as just the scaffold + scenarios — grew to cover the full pipeline because the follow-up PRs would have been small and the harness only earns its keep once you can actually run it.

Pre-committed thresholds in evals/SPEC.md; first eval pass against PR #51's preview written up in evals/RESULTS-2026-05-27.md.

What's in the PR

Spec + scenarios (evals/SPEC.md, evals/scenarios/)

  • Two-test design: Test A (open-ended, rubric-graded, supplement positioning) vs Test B (numeric, fixture-graded, alternative positioning).
  • 9 hand-authored scenarios as YAML: 5 Test A + 4 Test B. B5 dropped — the reform (two-child limit removal) is a no-op vs policyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law. Drop documented in evals/fixtures/drift_report.md.

Runner (evals/runner/run.py)

  • POSTs each scenario N times to a configured chat backend, saves raw SSE + extracted text + manifest JSON per run.
  • --concurrency N for parallel scenarios (default sequential).
  • Captures tool_call_sequence, tool_call_counts_by_name, tool_failure_count, model_backend per run — needed for A/B'ing tool-registration changes.
  • Configurable via UK_CHAT_BACKEND_URL env or --backend-url flag; supports Vercel protection bypass.

Fixture builder (evals/runner/build_fixtures.py)

  • Builds Test B fixtures locally by running policyengine + policyengine_uk 2.88.20 against the EFRS 2023-24 dataset under real reform IDs (83092, 94906, 94910, 94938, 94911).
  • Pulls reform JSON from PE-API /uk/policy/<id> (read-only DB endpoint, unaffected by the May 12 API outage) rather than hand-rolling reform specs.
  • Filters each candidate field against Vahid's published blog figures with a 10% drift threshold; fields that drift more than 10% are dropped rather than tested against possibly-stale ground truth. Drift decisions documented in evals/fixtures/drift_report.md.
  • Uses production-aligned version stack: policyengine 0.13.0 + policyengine_uk 2.88.20 + policyengine_core 3.26.10. Separate requirements-fixtures.txt so the runner's runtime stays minimal.

Grader (evals/runner/grade.py)

  • Test A: emits a per-response markdown grading sheet with rubric prompts. Human grades; --threshold-check aggregates afterwards.
  • Test B: extracts numbers from chat prose (markdown-table-aware + line-scan fallback), diffs against fixtures with per-field tolerances, checks self-consistency SD across runs, checks anchor must_mention / must_not_say for methodology drift, and applies the SPEC's pre-committed thresholds.

Tool-usage aggregator (evals/runner/tool_usage.py) — rolls up tool_call_counts_by_name across an entire run directory. Used for comparing PR #51 (one execution tool) vs PR #55 (typed tools registered) to see whether Claude actually picks the typed surface.

Results from first pass (evals/RESULTS-2026-05-27.md)

  • Both tests fail by the pre-committed thresholds. Full numbers in the writeup.
  • Headline: Test B field accuracy 75% (need 95%); failure rate 67% (need <10%; most B-scenario timeouts are population-level questions hitting the 600s Modal HTTP timeout). Test A mean rubric 3.09 (need 4.0); fabrication 27% (need ≤20%); 10 trust-killer scores concentrated in A3 (multi-param what-if) and A5 (factual lookup) run 2.
  • The clean win: A4 (out-of-scope refusal) scores 5.00 on every run.
  • The clearest pattern: scenarios that need EFRS microdata + free-form Python time out; scenarios that need only schedule lookups complete reliably.

What's out of scope

  • Test A grading is human-only — the runner generates the grading sheet; a person fills it in. One grader's judgement; aggregate verdict (fail on all three thresholds) is robust to ±1 per dimension, per-scenario means less so.
  • B3 extractor still false-negatives on some prose-embedded numbers — flagged in the writeup.
  • The eval doesn't yet cover structured-tool variants (PR Register the three dormant typed tools (UK) #55) — that's the next A/B run.

How to reproduce

# Run all scenarios against PR 51's preview
UK_CHAT_BACKEND_URL="..." python evals/runner/run.py --concurrency 4

# Grade
python evals/runner/grade.py evals/runs/<timestamp>

# After human fills in A_grading.md
python evals/runner/grade.py evals/runs/<timestamp> --threshold-check

Fixtures are pre-built and committed under evals/fixtures/pe_api/ — only re-build them if you bump policyengine_uk or want to retest drift against new published figures.

🤖 Generated with Claude Code

Scenarios

Test A — supplement positioning (chat seeded with a report's context, asked open-ended follow-ups):

ID Title Prompt (gist)
a1_mechanism Mechanism explanation Why does the top decile gain less in % terms (0.91%) than the 8th (1.56%) or 9th (1.54%)? Walk through the mechanism.
a2_subset_slice Subset breakdown not in the report How does the PA reform affect single parents with two children specifically — decile-by-decile gains in £?
a3_multiparam_what_if Multi-parameter what-if the user invented What if we also raised the higher-rate threshold from £50,270 to £55,000 alongside the PA raise? Compare budgetary impact and progressivity.
a4_out_of_scope Out-of-scope question How would this reform affect UK inflation over the next 12 months? (Chat should refuse cleanly — PolicyEngine doesn't model macro effects.)
a5_factual_lookup Historical parameter lookup How has the UK personal allowance changed over the last 15 years? Just the figures, no analysis.

Test B — alternative positioning (chat replicates what app-v2 reports compute):

ID Title Prompt (gist) Fixture source
b1_society_wide_pa Society-wide PA reform — baseline replication Run an economy-wide comparison for UK 2025: raise income tax PA from £12,570 to £15,000 on EFRS 2023-24. Report budgetary impact, decile impacts, poverty changes. PE-API reform 83092 (Vahid blog)
b2_ni_it_stacked Stacked NI + income tax reform UK 2026-27 economy comparison for a layered reform that adds an NI surcharge layer and raises income tax — subset of the Reeves Nov-2025 pre-Budget package. PE-API reforms 94906, 94910, 94938
b3_household_calc Household calculation — no microdata UK 2025 figures for a single adult, age 35, £45,000 employment income, no dependents, England. Income tax, NI, household net income, MTR. Local policyengine_uk.Simulation
b4_mtr_schedule MTR schedule — sanity check For a single adult in 2025/26, compute combined income tax + employee NI MTR at £10k, £20k, £30k, £50k, £75k, £100k, £125k, £150k. Local rule-driven schedule

Dropped: b5_two_child_limit (the reform removing the two-child limit is a no-op vs policyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law). See evals/fixtures/drift_report.md.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment May 27, 2026 3:53pm

Request Review

@github-actions
Copy link
Copy Markdown

Beta preview is ready.

Moves the eval design doc into the repo as evals/SPEC.md and lays out
the directory structure the harness will use. Ten hand-authored
scenarios are included as YAML — five Test A (chat as supplement) and
five Test B (chat as alternative). Each scenario covers a distinct
question shape and stress-tests a specific failure mode.

No runner yet — that's the next PR. This PR is just the data and
schema. See evals/README.md for layout and evals/SPEC.md for design,
thresholds, and roadmap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes:

1. Replace the two made-up B scenarios (B2 PA+HRT, B5 Scotland) with
   ones drawn from Vahid Ahmadi's published UK analyses:
   - B2 → stacked NI/IT/threshold-freeze reform (Nov 2025 pre-Budget)
     with reference figures from uk-income-tax-ni-reforms-2025.md
   - B5 → remove the two-child benefit limit (Autumn Budget 2025)
     with reference figures from uk-two-child-limit.md

   This shifts Test B from "does chat match a one-off API call I made"
   to "does chat reproduce PolicyEngine's published analyses" — a much
   stronger framing.

2. Add `anchor` blocks to every scenario. Anchors carry:
   - must_mention: phrases a good answer must include
   - must_not_say: claims that would be wrong
   - ideal_explanation / ideal_finding: prose sketch the grader uses

   In v1, anchors are human-grader aids. In v2 they become inputs to
   an automated LLM-judge.

Per-scenario anchor sourcing documented in SPEC.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_fixtures.py: fetches PE-API responses for B1/B2/B3/B5 and computes
  B4 locally via policyengine_uk (PE-API has no MTR endpoint). Output JSONs
  are committed so the grader doesn't refetch on every run.
- Generated fixtures for B3 (household calc) and B4 (MTR schedule).
- grade.py: split scalar vs list-of-dicts field comparison. List shape uses
  `key_by` (row identifier) + `compare` (field to diff). Adds a per-row
  extractor that locates the key in chat prose and pulls the nearby
  percentage.
- b4_mtr_schedule.yaml: switch fields_to_compare to the new shape so the
  grader diffs combined_mtr per gross-income row.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- summarise_events() now extracts tool_call_sequence,
  tool_call_counts_by_name, and tool_failure_count from the SSE stream.
- run_all() surfaces these in each manifest row so you don't have to
  re-read per-run meta.json to see what Claude called.
- New tool_usage.py prints a per-scenario tool-routing table from a
  finished run's manifest. Accepts one or more run dirs for A/B
  comparison.

The point: when we register a new typed tool (calculate_household etc.),
we need to see whether Claude actually picked it vs falling back to
run_python. Reading 60 SSE logs by hand doesn't scale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant