feat(0.31.0): backend-integrity guard by drewstone · Pull Request #65 · tangle-network/agent-eval

drewstone · 2026-05-20T13:12:59Z

Summary

When a canonical eval reports `0/N passed`, today there's no way to tell whether the agent failed every persona or whether the LLM backend was never actually called. We hit this running a 4-vertical parallel eval — legal-agent's default sandbox backend returned hard-coded 33-char strings from a `legal-sandbox-stub` model and the eval reported `0/36` as if it were an agent-quality problem.

This PR adds a pure substrate primitive to differentiate the cases.

What's new

`summarizeBackendIntegrity(records)` → `BackendIntegrityReport` with verdict `'real' | 'mixed' | 'stub'`. Inspects RunRecord token usage + costUsd. Also flags uncosted records (output tokens but costUsd=0 — the gtm/creative cost-ledger gap we just found).
`assertRealBackend(records, opts)` — throws `BackendIntegrityError` on pure stub-mode runs. `opts.allowMixed = false` also rejects partial-failure runs (recommended for CI gates).
`BackendIntegrityError extends AgentEvalError` with stable code `'backend_integrity'`.

Pure functions, no I/O.

Test plan

`pnpm test` — 130 files / 1208 tests pass (12 new)
`pnpm typecheck` — clean
All 5 verdicts covered: stub, real, mixed, uncosted-cost-ledger, empty-input
Error shape covered — `code === 'backend_integrity'`, `report` attached

Follow-up

Separate PR wires `assertRealBackend(records)` into each agent's canonical eval (tax/legal/gtm/creative) immediately after the records-write block, so a misconfigured backend fails the run with a clear diagnosis instead of silently scoring 0%.

…ent failure When a canonical eval reports "0/N passed", today there's no way to tell whether the agent failed every persona or whether the LLM backend was never actually called. We hit this running a 4-vertical parallel eval — legal-agent's default sandbox backend returned hard-coded 33-char strings from a "legal-sandbox-stub" model, and the eval dutifully reported 0/36 as if it were an agent-quality problem. Adds: - summarizeBackendIntegrity(records) → BackendIntegrityReport Inspects RunRecord token usage + costUsd. Verdict: 'real' | 'mixed' | 'stub'. Also flags uncosted records (output tokens but costUsd=0 — the gtm/creative cost-ledger gap we just found). - assertRealBackend(records, opts) — throws BackendIntegrityError on pure stub-mode runs. opts.allowMixed=false also rejects partial-failure runs (recommended for CI gates). - BackendIntegrityError extends AgentEvalError with code 'backend_integrity' so consumers can pattern-match. Pure functions, no I/O. 12 new tests cover stub/real/mixed/uncosted verdicts, the empty-input edge case, the partial-usage-propagation edge case, and the error shape. Total suite: 130 files, 1208 tests passing. Next step (separate PR): wire `assertRealBackend(records)` into each agent's canonical eval immediately after the records-write block so a misconfigured backend fails the run with a clear diagnosis instead of silently scoring 0%.

drewstone added 3 commits May 20, 2026 16:12

ci: biome organizeImports auto-fix

b8c77fa

Merge branch 'main' into feat/backend-integrity-guard

ae002f7

drewstone merged commit 3fef590 into main May 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.31.0): backend-integrity guard#65

feat(0.31.0): backend-integrity guard#65
drewstone merged 3 commits into
mainfrom
feat/backend-integrity-guard

drewstone commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented May 20, 2026

Summary

What's new

Test plan

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant