Skip to content

fix(phoenix): make full dry-run structural parity actionable #1283

@christso

Description

@christso

Objective

Resolve or explicitly exclude current Phoenix adapter full dry-run failures so the dry-run report can become a reliable regression signal.

Acceptance Signals

  • Full dry-run exits successfully, or reports only explicitly documented non-blocking exclusions.
  • The four known failures are investigated and either fixed at source/baseline level or excluded with clear reasons:
    • examples/features/matrix-evaluation/evals/dataset.eval.yaml
    • examples/features/prompt-template-sdk/evals/dataset.eval.yaml
    • examples/features/tool-trajectory-simple/evals/dataset.eval.yaml
    • examples/features/weighted-graders/evals/dataset.eval.yaml
  • Parity report output distinguishes conversion crashes from source/baseline drift and documented exclusions.
  • packages/phoenix-adapter/docs/e2e-verification.md reflects the current expected full dry-run result.

Implementation Notes

Prefer fixing stale baselines or source references over adding new mechanisms. If an eval intentionally diverges, encode a small documented exclusion in adapter parity reporting.

Relevant files:

  • packages/phoenix-adapter/src/parity/compare.ts
  • packages/phoenix-adapter/src/parity/report.ts
  • packages/phoenix-adapter/test/parity.test.ts
  • packages/phoenix-adapter/docs/e2e-verification.md
  • The affected example evals/baselines listed above

Non-goals

  • Do not accept both old and new wire key names unless the old form has shipped and needs compatibility.
  • Do not turn dry-run into a live Phoenix or live provider test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    phoenixPhoenix integration and observability work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions