Skip to content

feat(0.31.0): JudgeScoresRecord on RunRecord.outcome#66

Merged
tangletools merged 1 commit into
mainfrom
feat/0.31.0-judge-scores-record
May 20, 2026
Merged

feat(0.31.0): JudgeScoresRecord on RunRecord.outcome#66
tangletools merged 1 commit into
mainfrom
feat/0.31.0-judge-scores-record

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

The gap this closes

agent-builder PR #179 just landed a forge-chat judge rubric that
computes three per-dim scores (helpfulness, clarity, on_topic) per
cell, but only persists the composite to records.jsonl because
RunRecord.outcome doesn't have a per-judge / per-dim slot. The same
problem hits every consumer that wants ensemble scoring across the
five product agents (tax / creative / legal / gtm / agent-builder).

Consumers were either dropping the breakdown on the floor or
smuggling it through stringly-typed outcome.raw keys like
judge_kimi_helpfulness — neither survives a corpus-IRR run.
corpusInterRaterAgreement (0.27.2) expects structured per-judge
per-dim records, not parsed strings.

What this ships

  • JudgeScoresRecord type (src/run-record.ts):
    • perJudge[judgeId][dim]: number — canonical store
    • perDimMean[dim]: number — convenience projection
    • composite: number — mirrors the score the gate sees
    • failedJudges?: string[] — explicit dead-judge ids
    • notes?: string — panel prose
  • RunOutcome.judgeScores?: JudgeScoresRecord — additive on the
    outcome; existing single-judge runs leave it unset.
  • CampaignRunOutcome.judgeScores? wired through runEvalCampaign
    so per-cell ensemble outcomes land on RunRecord.outcome.judgeScores
    unchanged.
  • Validator extended in validateRunRecord: per-judge / per-dim /
    composite scores must be finite (no silent NaN-as-zero);
    failedJudges entries must be non-empty strings.
  • Tests in tests/run-record.test.ts and tests/eval-campaign.test.ts
    cover all four shapes (full, partial with failedJudges, missing,
    with notes) plus a fail-loud case where one judge throws and the
    record carries the dead-judge id, not a silent zero.
  • Consumer contract (tests/consumer-contract.test.ts) pins
    JudgeScoresRecord as a type-level export so consumer code stops
    compiling if the field gets renamed.

Design tradeoffs

  • perDimMean and composite are precomputed projections of
    perJudge. Storing both costs a few bytes per record but spares
    every reporter and IRR primitive a re-aggregation; the trade is on
    the right side for the read-heavy access pattern.
  • failedJudges?: string[] is the typed-outcome answer to partial
    failures. Missing keys in perJudge would be ambiguous (silent
    zero vs not run); the explicit list is fail-loud.
  • Field is optional on RunOutcome so the 0.30.0 surface is
    preserved. Scalar-only runs leave it unset.

What consumers gain

  • Forge-chat (agent-builder) stops dropping per-dim scores;
    corpusInterRaterAgreement consumes the records directly.
  • Tax / creative / legal / gtm agents inherit the same typed slot
    without each implementing their own conventions on outcome.raw.

Version bumps (lockstep)

  • package.json 0.30.0 → 0.31.0
  • clients/python/pyproject.toml 0.30.0 → 0.31.0
  • clients/python/src/agent_eval_rpc/__init__.py 0.30.0 → 0.31.0

Test plan

  • pnpm typecheck clean
  • pnpm test — 1208 tests pass (5 new in eval-campaign, 5 new in run-record, 1 new in consumer-contract)
  • pnpm build clean (tsup + openapi)
  • After merge: human tags v0.31.0 and pushes to fire the publish workflow.

Ensemble-judge consumers were dropping per-judge per-dim scores on the
floor because RunOutcome only had a slot for the composite. Adds a
typed `judgeScores?: JudgeScoresRecord` field, threaded through
runEvalCampaign and pinned in the consumer-contract test. Validator
rejects NaN scores and non-string failedJudges entries; fail-loud
test covers a panel where one judge throws.

Bumps TS + Python clients to 0.31.0 in lockstep.
@tangletools tangletools merged commit 51f6e74 into main May 20, 2026
1 check passed
@tangletools tangletools deleted the feat/0.31.0-judge-scores-record branch May 20, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants