Skip to content

WIP: US Python backend (latency spike — do not merge)#54

Open
SakshiKekre wants to merge 2 commits into
feat/model-backend-selectorfrom
feat/us-backend
Open

WIP: US Python backend (latency spike — do not merge)#54
SakshiKekre wants to merge 2 commits into
feat/model-backend-selectorfrom
feat/us-backend

Conversation

@SakshiKekre
Copy link
Copy Markdown
Collaborator

Purpose

Spike to measure how long US simulations take in the chat interface. Not for merge — exists to deploy a preview where we can run timed prompts against the US backend.

Branches off PR #51 (feat/model-backend-selector) so it includes the backend-selector + scenario_context plumbing already.

What's in the diff

  • USPolicyEnginePythonBackend in backend/model_backends.py, mirrors the UK Python backend
  • policyengine_us added to backend/requirements.txt
  • Frontend label map: UK (Compiled) / UK (Python) / US (Python)

Known gaps deferred for the latency test

  • reference.md is UK-compiled-only. Claude sees UK API docs when US backend is selected. Will write some wrong code on the first attempt — that's part of what we want to measure (recovery latency).
  • System prompt still says "British English" and the title-generation route still calls itself "a UK tax and benefit policy assistant."
  • Modal region is eu — US response latency will reflect the transatlantic hop, not a US-optimised deploy.
  • No preloaded US dataset. Microsimulation questions will be very slow on first call.

Latency numbers to capture

  • Cold start (first message after Modal app spin-up)
  • Warm response on a simple US household question (e.g. CA single earner, federal EITC)
  • Warm response on a multi-state comparison
  • Warm response on a microsimulation question (expected slow — no preloaded dataset)
  • UK baseline for comparison: same warm household question on uk_python

What this PR does NOT do

- USPolicyEnginePythonBackend in model_backends.py: mirrors the UK Python
  backend, swaps to policyengine_us, capabilities() lists US variables
  and parameter roots, prompt notes the state_code requirement.
- Add policyengine_us to backend/requirements.txt.
- Frontend label map: UK (Compiled) / UK (Python) / US (Python).

Known gaps deferred:
- reference.md is still UK-compiled-only — Claude sees UK API docs on
  US backend. Acceptable for the initial latency smoke test.
- System prompt still says "British English" and the title-generation
  route still calls itself "a UK tax and benefit policy assistant".
- Modal region is "eu" — US response latency will reflect transatlantic
  hop, not a US-optimised deploy.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment May 27, 2026 12:57pm

Request Review

@SakshiKekre SakshiKekre changed the base branch from main to feat/model-backend-selector May 27, 2026 12:40
@SakshiKekre SakshiKekre changed the base branch from feat/model-backend-selector to main May 27, 2026 12:53
@SakshiKekre SakshiKekre marked this pull request as ready for review May 27, 2026 12:54
@SakshiKekre SakshiKekre changed the base branch from main to feat/model-backend-selector May 27, 2026 12:55
@github-actions
Copy link
Copy Markdown

Beta preview is ready.

@SakshiKekre
Copy link
Copy Markdown
Collaborator Author

Smoke test findings

Tested on preview: https://policyengine-uk-chat-git-feat-us-backend-policy-engine.vercel.app
Backend: https://policyengine--peukchat-feat-us-backend-web.modal.run

What works ✅

  • policyengine_us installs and imports on the Modal image (build job: 2m58s)
  • Backend selector routes correctly to the US adapter
  • State code (CA) plumbs through Simulation.situation correctly
  • Federal tax calc returns the right number

Verification — tight prompt:

"Single adult, age 30, $30,000 employment income in 2024, California, no dependents. What's the federal income tax? Just the number."

Returned $1,616.00, which matches hand calc exactly (standard deduction $14,600 → taxable $15,400 → 10% × $11,600 + 12% × $3,800).

What doesn't work yet ⚠️

  • Open-ended prompts cause ~30+ tool calls of dir(pe) / guess-variable-name / retry behaviour before Claude converges. Eventually does converge on a sensible answer (verified $50k CA single-no-kids returning $4,016 federal IT and $0 EITC, both correct) — but the UX is poor.
  • Root cause: reference.md is generated only from policyengine_uk_compiled. Claude has no API cheat sheet for policyengine_us, so it explores via tool calls instead. The UK backends short-circuit this because their reference doc lists every variable, parameter class, and method up front.

Other known gaps (deferred for the latency test, still present)

  • System prompt still says "British English" — confirmed in responses
  • Title-generation route still calls itself "a UK tax and benefit policy assistant"
  • Modal region is eu — transatlantic latency on every call
  • No preloaded US dataset → microsimulation questions will be very slow first-call

Suggested next steps (priority order, not yet started)

  1. Generalise build_reference.py to take a package name and emit reference_uk.md / reference_us.md. Biggest single unlock — should eliminate the tool-call wandering.
  2. Conditionally load the matching reference doc based on model_backend in routes/chatbot.py.
  3. Fix UK-hardcoded prompt language (system prompt + title generator) so US responses don't read as British.
  4. Defer everything else (Modal region, US engine preload, US eval scenarios) until we decide whether this is a real product direction.

PR remains draft / not for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant