Add eval harness scaffold: spec, scenarios, fixtures dir by SakshiKekre · Pull Request #52 · PolicyEngine/policyengine-uk-chat

SakshiKekre · 2026-05-15T18:23:16Z

Summary

End-to-end eval harness for uk-chat: scenarios, fixtures, runner, grader, and a first run's results. Originally scoped as just the scaffold + scenarios — grew to cover the full pipeline because the follow-up PRs would have been small and the harness only earns its keep once you can actually run it.

Pre-committed thresholds in evals/SPEC.md; first eval pass against PR #51's preview written up in evals/RESULTS-2026-05-27.md.

What's in the PR

Spec + scenarios (evals/SPEC.md, evals/scenarios/)

Two-test design: Test A (open-ended, rubric-graded, supplement positioning) vs Test B (numeric, fixture-graded, alternative positioning).
9 hand-authored scenarios as YAML: 5 Test A + 4 Test B. B5 dropped — the reform (two-child limit removal) is a no-op vs policyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law. Drop documented in evals/fixtures/drift_report.md.

Runner (evals/runner/run.py)

POSTs each scenario N times to a configured chat backend, saves raw SSE + extracted text + manifest JSON per run.
--concurrency N for parallel scenarios (default sequential).
Captures tool_call_sequence, tool_call_counts_by_name, tool_failure_count, model_backend per run — needed for A/B'ing tool-registration changes.
Configurable via UK_CHAT_BACKEND_URL env or --backend-url flag; supports Vercel protection bypass.

Fixture builder (evals/runner/build_fixtures.py)

Builds Test B fixtures locally by running policyengine + policyengine_uk 2.88.20 against the EFRS 2023-24 dataset under real reform IDs (83092, 94906, 94910, 94938, 94911).
Pulls reform JSON from PE-API /uk/policy/<id> (read-only DB endpoint, unaffected by the May 12 API outage) rather than hand-rolling reform specs.
Filters each candidate field against Vahid's published blog figures with a 10% drift threshold; fields that drift more than 10% are dropped rather than tested against possibly-stale ground truth. Drift decisions documented in evals/fixtures/drift_report.md.
Uses production-aligned version stack: policyengine 0.13.0 + policyengine_uk 2.88.20 + policyengine_core 3.26.10. Separate requirements-fixtures.txt so the runner's runtime stays minimal.

Grader (evals/runner/grade.py)

Test A: emits a per-response markdown grading sheet with rubric prompts. Human grades; --threshold-check aggregates afterwards.
Test B: extracts numbers from chat prose (markdown-table-aware + line-scan fallback), diffs against fixtures with per-field tolerances, checks self-consistency SD across runs, checks anchor must_mention / must_not_say for methodology drift, and applies the SPEC's pre-committed thresholds.

Tool-usage aggregator (evals/runner/tool_usage.py) — rolls up tool_call_counts_by_name across an entire run directory. Used for comparing PR #51 (one execution tool) vs PR #55 (typed tools registered) to see whether Claude actually picks the typed surface.

Results from first pass (evals/RESULTS-2026-05-27.md)

Both tests fail by the pre-committed thresholds. Full numbers in the writeup.
Headline: Test B field accuracy 75% (need 95%); failure rate 67% (need <10%; most B-scenario timeouts are population-level questions hitting the 600s Modal HTTP timeout). Test A mean rubric 3.09 (need 4.0); fabrication 27% (need ≤20%); 10 trust-killer scores concentrated in A3 (multi-param what-if) and A5 (factual lookup) run 2.
The clean win: A4 (out-of-scope refusal) scores 5.00 on every run.
The clearest pattern: scenarios that need EFRS microdata + free-form Python time out; scenarios that need only schedule lookups complete reliably.

What's out of scope

Test A grading is human-only — the runner generates the grading sheet; a person fills it in. One grader's judgement; aggregate verdict (fail on all three thresholds) is robust to ±1 per dimension, per-scenario means less so.
B3 extractor still false-negatives on some prose-embedded numbers — flagged in the writeup.
The eval doesn't yet cover structured-tool variants (PR Register the three dormant typed tools (UK) #55) — that's the next A/B run.

How to reproduce

# Run all scenarios against PR 51's preview
UK_CHAT_BACKEND_URL="..." python evals/runner/run.py --concurrency 4

# Grade
python evals/runner/grade.py evals/runs/<timestamp>

# After human fills in A_grading.md
python evals/runner/grade.py evals/runs/<timestamp> --threshold-check

Fixtures are pre-built and committed under evals/fixtures/pe_api/ — only re-build them if you bump policyengine_uk or want to retest drift against new published figures.

🤖 Generated with Claude Code

Scenarios

Test A — supplement positioning (chat seeded with a report's context, asked open-ended follow-ups):

ID	Title	Prompt (gist)
`a1_mechanism`	Mechanism explanation	Why does the top decile gain less in % terms (0.91%) than the 8th (1.56%) or 9th (1.54%)? Walk through the mechanism.
`a2_subset_slice`	Subset breakdown not in the report	How does the PA reform affect single parents with two children specifically — decile-by-decile gains in £?
`a3_multiparam_what_if`	Multi-parameter what-if the user invented	What if we also raised the higher-rate threshold from £50,270 to £55,000 alongside the PA raise? Compare budgetary impact and progressivity.
`a4_out_of_scope`	Out-of-scope question	How would this reform affect UK inflation over the next 12 months? (Chat should refuse cleanly — PolicyEngine doesn't model macro effects.)
`a5_factual_lookup`	Historical parameter lookup	How has the UK personal allowance changed over the last 15 years? Just the figures, no analysis.

Test B — alternative positioning (chat replicates what app-v2 reports compute):

ID	Title	Prompt (gist)	Fixture source
`b1_society_wide_pa`	Society-wide PA reform — baseline replication	Run an economy-wide comparison for UK 2025: raise income tax PA from £12,570 to £15,000 on EFRS 2023-24. Report budgetary impact, decile impacts, poverty changes.	PE-API reform 83092 (Vahid blog)
`b2_ni_it_stacked`	Stacked NI + income tax reform	UK 2026-27 economy comparison for a layered reform that adds an NI surcharge layer and raises income tax — subset of the Reeves Nov-2025 pre-Budget package.	PE-API reforms 94906, 94910, 94938
`b3_household_calc`	Household calculation — no microdata	UK 2025 figures for a single adult, age 35, £45,000 employment income, no dependents, England. Income tax, NI, household net income, MTR.	Local `policyengine_uk.Simulation`
`b4_mtr_schedule`	MTR schedule — sanity check	For a single adult in 2025/26, compute combined income tax + employee NI MTR at £10k, £20k, £30k, £50k, £75k, £100k, £125k, £150k.	Local rule-driven schedule

Dropped: b5_two_child_limit (the reform removing the two-child limit is a no-op vs policyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law). See evals/fixtures/drift_report.md.

vercel · 2026-05-15T18:23:22Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policyengine-uk-chat	Ready	Preview, Comment	May 27, 2026 3:53pm

github-actions · 2026-05-15T18:23:45Z

Beta preview is ready.

Frontend: open preview
Backend: open backend

Moves the eval design doc into the repo as evals/SPEC.md and lays out the directory structure the harness will use. Ten hand-authored scenarios are included as YAML — five Test A (chat as supplement) and five Test B (chat as alternative). Each scenario covers a distinct question shape and stress-tests a specific failure mode. No runner yet — that's the next PR. This PR is just the data and schema. See evals/README.md for layout and evals/SPEC.md for design, thresholds, and roadmap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes: 1. Replace the two made-up B scenarios (B2 PA+HRT, B5 Scotland) with ones drawn from Vahid Ahmadi's published UK analyses: - B2 → stacked NI/IT/threshold-freeze reform (Nov 2025 pre-Budget) with reference figures from uk-income-tax-ni-reforms-2025.md - B5 → remove the two-child benefit limit (Autumn Budget 2025) with reference figures from uk-two-child-limit.md This shifts Test B from "does chat match a one-off API call I made" to "does chat reproduce PolicyEngine's published analyses" — a much stronger framing. 2. Add `anchor` blocks to every scenario. Anchors carry: - must_mention: phrases a good answer must include - must_not_say: claims that would be wrong - ideal_explanation / ideal_finding: prose sketch the grader uses In v1, anchors are human-grader aids. In v2 they become inputs to an automated LLM-judge. Per-scenario anchor sourcing documented in SPEC.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- build_fixtures.py: fetches PE-API responses for B1/B2/B3/B5 and computes B4 locally via policyengine_uk (PE-API has no MTR endpoint). Output JSONs are committed so the grader doesn't refetch on every run. - Generated fixtures for B3 (household calc) and B4 (MTR schedule). - grade.py: split scalar vs list-of-dicts field comparison. List shape uses `key_by` (row identifier) + `compare` (field to diff). Adds a per-row extractor that locates the key in chat prose and pulls the nearby percentage. - b4_mtr_schedule.yaml: switch fields_to_compare to the new shape so the grader diffs combined_mtr per gross-income row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- summarise_events() now extracts tool_call_sequence, tool_call_counts_by_name, and tool_failure_count from the SSE stream. - run_all() surfaces these in each manifest row so you don't have to re-read per-run meta.json to see what Claude called. - New tool_usage.py prints a per-scenario tool-routing table from a finished run's manifest. Accepts one or more run dirs for A/B comparison. The point: when we register a new typed tool (calculate_household etc.), we need to see whether Claude actually picked it vs falling back to run_python. Reading 60 SSE logs by hand doesn't scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 15, 2026 18:23 View deployment

SakshiKekre force-pushed the feat/eval-harness branch from 86af2b0 to 765afb5 Compare May 15, 2026 18:51

vercel Bot deployed to Preview May 15, 2026 18:53 View deployment

vercel Bot deployed to Preview May 19, 2026 19:57 View deployment

Add eval runner

5fac830

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 20, 2026 12:53 View deployment

Add eval grader

86dec72

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 20, 2026 14:30 View deployment

vercel Bot deployed to Preview May 21, 2026 22:39 View deployment

SakshiKekre mentioned this pull request May 27, 2026

WIP: US Python backend (latency spike — do not merge) #54

Open

5 tasks

Generate Test B fixtures via local policyengine, filter by 10% drift

dd768db

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 13:01 View deployment

Grader: parse markdown tables for list-of-dicts extraction

3ea1f80

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 13:27 View deployment

Runner: add --concurrency for parallel scenario runs

2161c99

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 15:40 View deployment

vercel Bot deployed to Preview May 27, 2026 15:43 View deployment

SakshiKekre mentioned this pull request May 27, 2026

Register the three dormant typed tools (UK) #55

Draft

Add 2026-05-27 eval results writeup

5c715fb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 15:50 View deployment

Add Test A grading aggregates to writeup

3c11d7b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 15:53 View deployment

SakshiKekre mentioned this pull request May 27, 2026

Add UK chat integration (drawer on reports + standalone page) PolicyEngine/policyengine-app-v2#1036

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval harness scaffold: spec, scenarios, fixtures dir#52

Add eval harness scaffold: spec, scenarios, fixtures dir#52
SakshiKekre wants to merge 11 commits into
mainfrom
feat/eval-harness

SakshiKekre commented May 15, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SakshiKekre commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

What's out of scope

How to reproduce

Scenarios

Uh oh!

vercel Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SakshiKekre commented May 15, 2026 •

edited

Loading

vercel Bot commented May 15, 2026 •

edited

Loading