Add eval harness scaffold: spec, scenarios, fixtures dir#52
Open
SakshiKekre wants to merge 11 commits into
Open
Add eval harness scaffold: spec, scenarios, fixtures dir#52SakshiKekre wants to merge 11 commits into
SakshiKekre wants to merge 11 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Beta preview is ready.
|
Moves the eval design doc into the repo as evals/SPEC.md and lays out the directory structure the harness will use. Ten hand-authored scenarios are included as YAML — five Test A (chat as supplement) and five Test B (chat as alternative). Each scenario covers a distinct question shape and stress-tests a specific failure mode. No runner yet — that's the next PR. This PR is just the data and schema. See evals/README.md for layout and evals/SPEC.md for design, thresholds, and roadmap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
86af2b0 to
765afb5
Compare
Two changes:
1. Replace the two made-up B scenarios (B2 PA+HRT, B5 Scotland) with
ones drawn from Vahid Ahmadi's published UK analyses:
- B2 → stacked NI/IT/threshold-freeze reform (Nov 2025 pre-Budget)
with reference figures from uk-income-tax-ni-reforms-2025.md
- B5 → remove the two-child benefit limit (Autumn Budget 2025)
with reference figures from uk-two-child-limit.md
This shifts Test B from "does chat match a one-off API call I made"
to "does chat reproduce PolicyEngine's published analyses" — a much
stronger framing.
2. Add `anchor` blocks to every scenario. Anchors carry:
- must_mention: phrases a good answer must include
- must_not_say: claims that would be wrong
- ideal_explanation / ideal_finding: prose sketch the grader uses
In v1, anchors are human-grader aids. In v2 they become inputs to
an automated LLM-judge.
Per-scenario anchor sourcing documented in SPEC.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_fixtures.py: fetches PE-API responses for B1/B2/B3/B5 and computes B4 locally via policyengine_uk (PE-API has no MTR endpoint). Output JSONs are committed so the grader doesn't refetch on every run. - Generated fixtures for B3 (household calc) and B4 (MTR schedule). - grade.py: split scalar vs list-of-dicts field comparison. List shape uses `key_by` (row identifier) + `compare` (field to diff). Adds a per-row extractor that locates the key in chat prose and pulls the nearby percentage. - b4_mtr_schedule.yaml: switch fields_to_compare to the new shape so the grader diffs combined_mtr per gross-income row.
5 tasks
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- summarise_events() now extracts tool_call_sequence, tool_call_counts_by_name, and tool_failure_count from the SSE stream. - run_all() surfaces these in each manifest row so you don't have to re-read per-run meta.json to see what Claude called. - New tool_usage.py prints a per-scenario tool-routing table from a finished run's manifest. Accepts one or more run dirs for A/B comparison. The point: when we register a new typed tool (calculate_household etc.), we need to see whether Claude actually picked it vs falling back to run_python. Reading 60 SSE logs by hand doesn't scale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Open
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end eval harness for uk-chat: scenarios, fixtures, runner, grader, and a first run's results. Originally scoped as just the scaffold + scenarios — grew to cover the full pipeline because the follow-up PRs would have been small and the harness only earns its keep once you can actually run it.
Pre-committed thresholds in
evals/SPEC.md; first eval pass against PR #51's preview written up inevals/RESULTS-2026-05-27.md.What's in the PR
Spec + scenarios (
evals/SPEC.md,evals/scenarios/)policyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law. Drop documented inevals/fixtures/drift_report.md.Runner (
evals/runner/run.py)--concurrency Nfor parallel scenarios (default sequential).tool_call_sequence,tool_call_counts_by_name,tool_failure_count,model_backendper run — needed for A/B'ing tool-registration changes.UK_CHAT_BACKEND_URLenv or--backend-urlflag; supports Vercel protection bypass.Fixture builder (
evals/runner/build_fixtures.py)policyengine+policyengine_uk 2.88.20against the EFRS 2023-24 dataset under real reform IDs (83092, 94906, 94910, 94938, 94911)./uk/policy/<id>(read-only DB endpoint, unaffected by the May 12 API outage) rather than hand-rolling reform specs.evals/fixtures/drift_report.md.policyengine 0.13.0+policyengine_uk 2.88.20+policyengine_core 3.26.10. Separaterequirements-fixtures.txtso the runner's runtime stays minimal.Grader (
evals/runner/grade.py)--threshold-checkaggregates afterwards.must_mention/must_not_sayfor methodology drift, and applies the SPEC's pre-committed thresholds.Tool-usage aggregator (
evals/runner/tool_usage.py) — rolls uptool_call_counts_by_nameacross an entire run directory. Used for comparing PR #51 (one execution tool) vs PR #55 (typed tools registered) to see whether Claude actually picks the typed surface.Results from first pass (
evals/RESULTS-2026-05-27.md)What's out of scope
How to reproduce
Fixtures are pre-built and committed under
evals/fixtures/pe_api/— only re-build them if you bumppolicyengine_ukor want to retest drift against new published figures.🤖 Generated with Claude Code
Scenarios
Test A — supplement positioning (chat seeded with a report's context, asked open-ended follow-ups):
a1_mechanisma2_subset_slicea3_multiparam_what_ifa4_out_of_scopea5_factual_lookupTest B — alternative positioning (chat replicates what app-v2 reports compute):
b1_society_wide_pab2_ni_it_stackedb3_household_calcpolicyengine_uk.Simulationb4_mtr_scheduleDropped:
b5_two_child_limit(the reform removing the two-child limit is a no-op vspolicyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law). Seeevals/fixtures/drift_report.md.