Skip to content

feat(0.31.0): backend-integrity guard#65

Merged
drewstone merged 3 commits into
mainfrom
feat/backend-integrity-guard
May 20, 2026
Merged

feat(0.31.0): backend-integrity guard#65
drewstone merged 3 commits into
mainfrom
feat/backend-integrity-guard

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

When a canonical eval reports `0/N passed`, today there's no way to tell whether the agent failed every persona or whether the LLM backend was never actually called. We hit this running a 4-vertical parallel eval — legal-agent's default sandbox backend returned hard-coded 33-char strings from a `legal-sandbox-stub` model and the eval reported `0/36` as if it were an agent-quality problem.

This PR adds a pure substrate primitive to differentiate the cases.

What's new

  • `summarizeBackendIntegrity(records)` → `BackendIntegrityReport` with verdict `'real' | 'mixed' | 'stub'`. Inspects RunRecord token usage + costUsd. Also flags uncosted records (output tokens but costUsd=0 — the gtm/creative cost-ledger gap we just found).
  • `assertRealBackend(records, opts)` — throws `BackendIntegrityError` on pure stub-mode runs. `opts.allowMixed = false` also rejects partial-failure runs (recommended for CI gates).
  • `BackendIntegrityError extends AgentEvalError` with stable code `'backend_integrity'`.

Pure functions, no I/O.

Test plan

  • `pnpm test` — 130 files / 1208 tests pass (12 new)
  • `pnpm typecheck` — clean
  • All 5 verdicts covered: stub, real, mixed, uncosted-cost-ledger, empty-input
  • Error shape covered — `code === 'backend_integrity'`, `report` attached

Follow-up

Separate PR wires `assertRealBackend(records)` into each agent's canonical eval (tax/legal/gtm/creative) immediately after the records-write block, so a misconfigured backend fails the run with a clear diagnosis instead of silently scoring 0%.

drewstone added 3 commits May 20, 2026 16:12
…ent failure

When a canonical eval reports "0/N passed", today there's no way to tell
whether the agent failed every persona or whether the LLM backend was
never actually called. We hit this running a 4-vertical parallel eval —
legal-agent's default sandbox backend returned hard-coded 33-char strings
from a "legal-sandbox-stub" model, and the eval dutifully reported 0/36
as if it were an agent-quality problem.

Adds:

- summarizeBackendIntegrity(records) → BackendIntegrityReport
  Inspects RunRecord token usage + costUsd. Verdict: 'real' | 'mixed' | 'stub'.
  Also flags uncosted records (output tokens but costUsd=0 — the gtm/creative
  cost-ledger gap we just found).

- assertRealBackend(records, opts) — throws BackendIntegrityError on pure
  stub-mode runs. opts.allowMixed=false also rejects partial-failure runs
  (recommended for CI gates).

- BackendIntegrityError extends AgentEvalError with code 'backend_integrity'
  so consumers can pattern-match.

Pure functions, no I/O. 12 new tests cover stub/real/mixed/uncosted verdicts,
the empty-input edge case, the partial-usage-propagation edge case, and the
error shape. Total suite: 130 files, 1208 tests passing.

Next step (separate PR): wire `assertRealBackend(records)` into each agent's
canonical eval immediately after the records-write block so a misconfigured
backend fails the run with a clear diagnosis instead of silently scoring 0%.
@drewstone drewstone merged commit 3fef590 into main May 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant