From 0e0147c9407782d6a1fc60bcd328b8369b9be554 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Wed, 3 Jun 2026 06:18:09 +0200
Subject: [PATCH 1/2] docs(phoenix): plan integration completion

---
 ...eat-phoenix-integration-completion-plan.md | 355 ++++++++++++++++++
 1 file changed, 355 insertions(+)
 create mode 100644 docs/plans/2026-06-03-001-feat-phoenix-integration-completion-plan.md

diff --git a/docs/plans/2026-06-03-001-feat-phoenix-integration-completion-plan.md b/docs/plans/2026-06-03-001-feat-phoenix-integration-completion-plan.md
new file mode 100644
index 00000000..e732c912
--- /dev/null
+++ b/docs/plans/2026-06-03-001-feat-phoenix-integration-completion-plan.md
@@ -0,0 +1,355 @@
+---
+title: "feat: Complete the AgentV Phoenix integration"
+type: "feat"
+status: "active"
+date: "2026-06-03"
+---
+
+# feat: Complete the AgentV Phoenix integration
+
+## Summary
+
+Complete AgentV's Phoenix integration as two intentionally bounded surfaces: a first-class Phoenix OTLP observability preset for normal `agentv eval` runs, and a Phoenix dataset/experiment adapter that keeps AgentV eval YAML and AgentV scoring semantics authoritative. The current `@agentv/phoenix-adapter` package should stay repo-local/private until real AgentV execution, deterministic parity, documentation, and release-readiness criteria are satisfied.
+
+---
+
+## Problem Frame
+
+PR #1279 added the initial repo-local `packages/phoenix-adapter` package. It proves that AgentV-authored eval suites can be normalized through `@agentv/core`, converted into Phoenix dataset payloads, and run through a Phoenix experiment with deterministic CODE evaluator support for `contains`, `regex`, `equals`, and `is-json`.
+
+The integration is not yet complete from a user perspective. It has no AgentV CLI surface, no Phoenix OTLP backend preset, no publishable package posture, no real AgentV target execution inside Phoenix experiments, incomplete deterministic parity, large unsupported `llm-grader` / `code-grader` / trace-family gaps, and only a narrow CI smoke. Full dry-run currently reports 97 suites / 405 tests / 93 passed suites / 4 failed suites, with 217 unsupported entries across 122 distinct unsupported features.
+
+---
+
+## Requirements
+
+### Integration contract
+
+- R1. Define Phoenix integration as two complementary surfaces: OTLP trace export from normal AgentV eval runs, and a dataset/experiment adapter for AgentV-authored eval suites.
+- R2. Keep AgentV eval YAML as the source of truth for test discovery, case normalization, assertion parsing, interpolation, and metadata handling.
+- R3. Keep AgentV scoring authoritative for AgentV-specific semantics unless a Phoenix-native evaluator is explicitly proven equivalent and documented.
+- R4. Preserve AgentV's lightweight-core/plugin-extensibility boundary: do not reimplement workspace lifecycle, Docker sandboxing, target matrices, trials, or custom assertion discovery inside Phoenix unless a concrete need later justifies it.
+
+### User-facing behavior
+
+- R5. Users can export AgentV eval traces to Phoenix through a documented `phoenix` OTel backend preset.
+- R6. Users can understand what the Phoenix adapter supports, what it reports as unsupported, and how unsupported features affect scores/status.
+- R7. Phoenix experiment runs should execute real AgentV targets, or clearly declare a dry-run/reference mode that does not claim target parity.
+- R8. Repeated Phoenix adapter runs should avoid confusing dataset duplication and should preserve stable AgentV identifiers in Phoenix metadata.
+
+### Evaluator parity
+
+- R9. Deterministic assertion parity covers `contains`, `contains-any`, `contains-all`, `icontains`, `icontains-any`, `icontains-all`, `starts-with`, `ends-with`, `regex`, `equals`, and `is-json`.
+- R10. Deterministic scoring handles or explicitly declines `weight`, `required`, `min_score`, and `negate` semantics.
+- R11. `llm-grader` and `rubrics` support is designed around AgentV prompt/schema parity first, with Phoenix-native model evaluator reuse considered only where semantics remain clear.
+- R12. Trace and metric graders are supported only after Phoenix traces can be associated with AgentV test cases through stable trace IDs/spans.
+
+### Release and verification
+
+- R13. The repo-local/private package remains private until release and install expectations are met.
+- R14. If the package becomes publishable, release/version/publish scripts include it and package metadata exposes the intended CLI/API surface.
+- R15. Full dry-run structural parity is either green or has explicitly documented exclusions before it becomes a blocking CI gate.
+- R16. Live Phoenix verification covers both OTLP export and at least one experiment path before the integration is documented as complete.
+
+---
+
+## Key Technical Decisions
+
+- KTD1. **Treat Phoenix as observability plus experiment surface, not an alternate AgentV runtime:** Phoenix experiments can host runs and evaluations, but AgentV-specific YAML semantics, target execution, and scorer contracts should remain centralized in AgentV. This prevents duplicating complex runtime behavior in `packages/phoenix-adapter`.
+- KTD2. **Ship Phoenix OTLP preset before expanding adapter depth:** A `phoenix` backend preset gives users immediate value with normal `agentv eval` runs and uses existing OTel infrastructure in `packages/core/src/observability/otel-exporter.ts` and `apps/cli/src/commands/eval/run-eval.ts`.
+- KTD3. **Keep `@agentv/phoenix-adapter` private until real execution exists:** The package currently synthesizes task output in `packages/phoenix-adapter/src/phoenix/run-experiment.ts`; publishing before real AgentV execution risks users mistaking plumbing validation for true eval parity.
+- KTD4. **Prefer reusing AgentV evaluator logic over parallel adapter implementations:** The current adapter has its own deterministic evaluator implementation in `packages/phoenix-adapter/src/evaluators/deterministic.ts`; future work should reduce semantic drift by sharing or wrapping core grader behavior where feasible.
+- KTD5. **Make unsupported semantics visible and conservative:** Unsupported evaluator families should remain visible in reports and metadata. Scores should not overstate quality when unsupported assertions are present.
+- KTD6. **Use Phoenix trace IDs only after spans are available:** Phoenix trace-based evaluators are best planned around post-run evaluation so AgentV can fetch spans by trace ID and translate them into `TraceSummary`-like data for `tool-trajectory`, `execution-metrics`, `latency`, `cost`, and token usage graders.
+
+---
+
+## High-Level Technical Design
+
+```mermaid
+flowchart TB
+  A[AgentV eval YAML] --> B[@agentv/core loader]
+  B --> C[Normalized AgentV suite]
+  C --> D[Phoenix dataset payload]
+  D --> E[Phoenix experiment]
+  E --> F[AgentV target execution]
+  F --> G[AgentV-authored scores and metadata]
+  F --> H[OTLP spans to Phoenix]
+  H --> I[TraceId / span lookup]
+  I --> J[Trace and metric graders]
+  G --> K[Phoenix evaluation results]
+  J --> K
+```
+
+The integration should have two stable entry paths. Normal `agentv eval` runs export traces directly to Phoenix through OTel. The adapter path converts AgentV suites into Phoenix datasets and experiments, then runs AgentV targets/scorers while recording Phoenix experiment artifacts. The adapter should not become a second YAML parser, target runner, or sandbox system.
+
+---
+
+## Scope Boundaries
+
+### In scope
+
+- Phoenix OTel backend preset and documentation.
+- Support matrix and integration contract documentation.
+- Deterministic assertion parity for AgentV's common built-in deterministic primitives.
+- Real AgentV target execution inside Phoenix experiments.
+- LLM/rubric support where AgentV scorer semantics stay authoritative.
+- Trace-based grader support after trace IDs and spans are reliably wired.
+- Release-readiness changes if the package is later approved for publishing.
+
+### Deferred to Follow-Up Work
+
+- Phoenix-native equivalents for every AgentV custom/plugin evaluator.
+- Native Phoenix implementation of AgentV workspace lifecycle, Docker workspace setup, target matrices, and trials.
+- Dashboard-specific Phoenix UI beyond linking/exporting existing Phoenix artifacts.
+- Auto-opening issues from this plan; proposed issue bodies can be used later if requested.
+
+### Outside this integration's identity
+
+- Replacing AgentV's local result JSONL/artifact model with Phoenix as the sole results store.
+- Making Phoenix a required dependency for normal AgentV eval execution.
+- Adding provider-specific Phoenix config knobs that can be solved by existing OTel environment variables or plugin/wrapper patterns.
+
+---
+
+## Implementation Units
+
+### U1. Document the Phoenix integration contract
+
+- **Goal:** Establish a user-facing and contributor-facing contract for what the Phoenix integration is, what it supports today, and what “complete” means.
+- **Requirements:** R1, R2, R3, R4, R6, R13.
+- **Dependencies:** None.
+- **Files:**
+  - `apps/web/src/content/docs/docs/integrations/phoenix.mdx`
+  - `apps/web/src/content/docs/docs/evaluation/running-evals.mdx`
+  - `packages/phoenix-adapter/README.md`
+  - `packages/phoenix-adapter/docs/support-matrix.md`
+  - `packages/phoenix-adapter/docs/e2e-verification.md`
+  - `skills-data/agentv-eval-writer/SKILL.md`
+  - `skills-data/agentv-eval-writer/references/config-schema.json`
+- **Approach:** Add a Phoenix integration doc that distinguishes OTLP export from dataset/experiment adapter mode. State that AgentV YAML and scoring remain authoritative. Update the support matrix so unsupported families are grouped by reason instead of appearing as a flat first-pass list.
+- **Patterns to follow:** Existing Langfuse integration docs in `apps/web/src/content/docs/docs/integrations/langfuse.mdx`; OTel CLI docs in `apps/web/src/content/docs/docs/evaluation/running-evals.mdx`.
+- **Test scenarios:**
+  - Validate that docs include local Phoenix endpoint setup, API-key setup, project routing, privacy warning for content capture, adapter dry-run command, live experiment command, and unsupported-family behavior.
+  - Validate relative markdown links in new docs so the existing link checker can traverse them.
+  - Validate skill/config schema references use snake_case wire keys for any config examples.
+- **Verification:** A reader can tell when to use `--export-otel --otel-backend phoenix`, when to use the adapter, which evaluator families are supported, and why the adapter remains private.
+
+### U2. Add a Phoenix OTel backend preset
+
+- **Goal:** Allow normal AgentV eval runs to stream traces to Phoenix without using package-internal adapter commands.
+- **Requirements:** R1, R5, R16.
+- **Dependencies:** U1 for documentation alignment.
+- **Files:**
+  - `packages/core/src/observability/otel-exporter.ts`
+  - `packages/core/src/observability/types.ts`
+  - `apps/cli/src/commands/eval/run-eval.ts`
+  - `apps/cli/src/commands/eval/commands/run.ts`
+  - `packages/core/test/observability/otel-exporter.test.ts`
+  - `apps/cli/test/commands/eval/run.test.ts`
+- **Approach:** Extend `OTEL_BACKEND_PRESETS` with `phoenix`. Use `PHOENIX_COLLECTOR_ENDPOINT` when set and otherwise default to the local Phoenix OTLP traces endpoint. Add `Authorization: Bearer <PHOENIX_API_KEY>` when present and `x-project-name` when a Phoenix project name is configured. Keep generic `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_EXPORTER_OTLP_HEADERS` behavior intact.
+- **Patterns to follow:** Existing `langfuse`, `braintrust`, and `confident` presets in `packages/core/src/observability/otel-exporter.ts`; CLI OTel option resolution in `apps/cli/src/commands/eval/run-eval.ts`.
+- **Test scenarios:**
+  - With no Phoenix env vars, selecting the `phoenix` preset resolves to the local Phoenix OTLP traces endpoint.
+  - With `PHOENIX_COLLECTOR_ENDPOINT`, the preset uses the configured endpoint without appending duplicate path segments.
+  - With `PHOENIX_API_KEY`, the preset emits bearer auth headers.
+  - With a configured project name, the preset emits `x-project-name` and does not discard generic OTLP headers.
+  - Unknown OTel backend behavior remains unchanged.
+- **Verification:** `agentv eval ... --export-otel --otel-backend phoenix` can be documented as the primary observability path, with existing OTel export behavior unchanged for other backends.
+
+### U3. Complete deterministic evaluator parity in the adapter
+
+- **Goal:** Bring Phoenix adapter deterministic support up to AgentV's common deterministic assertion surface.
+- **Requirements:** R2, R3, R9, R10, R15.
+- **Dependencies:** U1.
+- **Files:**
+  - `packages/phoenix-adapter/src/evaluators/deterministic.ts`
+  - `packages/phoenix-adapter/src/evaluators/registry.ts`
+  - `packages/phoenix-adapter/src/evaluators/types.ts`
+  - `packages/phoenix-adapter/test/evaluators/deterministic.test.ts`
+  - `packages/phoenix-adapter/test/evaluators/registry.test.ts`
+  - `packages/phoenix-adapter/docs/support-matrix.md`
+- **Approach:** Add `contains-any`, `contains-all`, `icontains`, `icontains-any`, `icontains-all`, `starts-with`, and `ends-with`. Decide whether the adapter can call AgentV core deterministic graders directly; if not, mirror current semantics deliberately and add tests that compare expected outcomes against AgentV examples. Treat score-affecting fields (`weight`, `required`, `min_score`, `negate`) explicitly rather than silently ignoring them.
+- **Execution note:** Add characterization tests for current supported deterministic types before expanding behavior so regressions are visible.
+- **Patterns to follow:** Core deterministic factories in `packages/core/src/evaluation/registry/builtin-graders.ts`; current adapter shape in `packages/phoenix-adapter/src/evaluators/deterministic.ts`.
+- **Test scenarios:**
+  - `contains-any` passes when at least one configured string is present and fails when none are present.
+  - `contains-all` passes only when every configured string is present.
+  - `icontains` and `icontains-*` ignore case consistently.
+  - `starts-with` and `ends-with` follow AgentV trimming behavior.
+  - `negate` reverses pass/fail and score for deterministic assertions.
+  - Missing or malformed assertion values produce fail/unsupported explanations that are visible in Phoenix metadata.
+- **Verification:** The full dry-run unsupported report no longer lists extended deterministic string families as unsupported, and deterministic examples remain structurally green.
+
+### U4. Make full dry-run structural parity actionable
+
+- **Goal:** Resolve or explicitly exclude current full dry-run failures so the dry-run report can become a stronger regression signal.
+- **Requirements:** R15.
+- **Dependencies:** U1, U3.
+- **Files:**
+  - `packages/phoenix-adapter/src/parity/compare.ts`
+  - `packages/phoenix-adapter/src/parity/report.ts`
+  - `packages/phoenix-adapter/test/parity.test.ts`
+  - `packages/phoenix-adapter/docs/e2e-verification.md`
+  - `examples/features/matrix-evaluation/evals/dataset.eval.yaml`
+  - `examples/features/prompt-template-sdk/evals/dataset.eval.yaml`
+  - `examples/features/tool-trajectory-simple/evals/dataset.eval.yaml`
+  - `examples/features/weighted-graders/evals/dataset.eval.yaml`
+- **Approach:** Investigate the four known failures as source/baseline or loader-resolution issues, not Phoenix conversion crashes. Prefer fixing stale baselines or source references. If an eval intentionally diverges, encode a documented exclusion in adapter parity reporting rather than letting the full dry-run stay ambiguously red.
+- **Patterns to follow:** Baseline parsing in `packages/phoenix-adapter/src/parity/baselines.ts`; e2e notes in `packages/phoenix-adapter/docs/e2e-verification.md`.
+- **Test scenarios:**
+  - Matrix evaluation dry-run reports the expected source/baseline relationship after drift is resolved or excluded.
+  - Prompt-template SDK dry-run resolves prompt paths from the eval source context or documents a deliberate exclusion.
+  - Tool-trajectory-simple baseline count matches normalized cases or is excluded with a clear reason.
+  - Weighted-graders naming drift is resolved without accepting both `evaluator` and `grader` wire names as a new compatibility surface unless already shipped.
+- **Verification:** Full dry-run exits successfully or reports only explicitly documented non-blocking exclusions.
+
+### U5. Run real AgentV targets inside Phoenix experiments
+
+- **Goal:** Replace synthetic adapter task outputs with actual AgentV target execution so Phoenix experiments represent real AgentV behavior.
+- **Requirements:** R2, R3, R7, R8, R16.
+- **Dependencies:** U1, U3, U4.
+- **Files:**
+  - `packages/phoenix-adapter/src/phoenix/run-experiment.ts`
+  - `packages/phoenix-adapter/src/run/options.ts`
+  - `packages/phoenix-adapter/src/run/run-suite.ts`
+  - `packages/phoenix-adapter/src/agentv/load-spec.ts`
+  - `packages/phoenix-adapter/src/phoenix/types.ts`
+  - `packages/phoenix-adapter/test/phoenix-run-experiment.test.ts`
+  - `packages/phoenix-adapter/test/agentv-execution.test.ts`
+- **Approach:** Add an execution mode that invokes AgentV's programmatic evaluation/runtime for each Phoenix example. Preserve `agentv_test_id`, target, scores, assertions, duration, cost, token usage, and trace summary in Phoenix run/evaluation metadata. Keep dry-run/reference behavior separate and clearly named so it cannot be confused with live target parity.
+- **Technical design:** Directional guidance only: Phoenix task receives a dataset example, resolves the AgentV test case by stable metadata, invokes AgentV execution, returns actual candidate output, and stores AgentV result metadata for the evaluator to log.
+- **Patterns to follow:** Programmatic API in `packages/core/src/evaluation/evaluate.ts`; CLI orchestration in `apps/cli/src/commands/eval/run-eval.ts`; adapter payload metadata in `packages/phoenix-adapter/src/phoenix/datasets.ts`.
+- **Test scenarios:**
+  - Mock AgentV target returns a deterministic output and Phoenix task returns that output instead of synthesized assertion output.
+  - AgentV scores/assertions are preserved in Phoenix evaluation metadata with snake_case boundary keys where serialized.
+  - Missing target/configuration returns a clear run error and does not masquerade as an evaluator failure.
+  - Dry-run mode remains network-free and does not invoke Phoenix or real targets.
+- **Verification:** A live Phoenix smoke against a deterministic example creates Phoenix runs whose outputs match AgentV target outputs, not expected-output synthesis.
+
+### U6. Support LLM graders and rubrics with AgentV-authoritative scoring
+
+- **Goal:** Address the largest unsupported evaluator gap while preserving AgentV prompt/schema semantics.
+- **Requirements:** R3, R6, R11, R16.
+- **Dependencies:** U5.
+- **Files:**
+  - `packages/phoenix-adapter/src/evaluators/registry.ts`
+  - `packages/phoenix-adapter/src/evaluators/types.ts`
+  - `packages/phoenix-adapter/src/phoenix/run-experiment.ts`
+  - `packages/phoenix-adapter/test/evaluators/llm-grader.test.ts`
+  - `packages/phoenix-adapter/docs/support-matrix.md`
+  - `packages/phoenix-adapter/docs/e2e-verification.md`
+- **Approach:** First pass should run AgentV's `llm-grader` / `rubrics` path and log the resulting score, verdict, assertion details, and evidence into Phoenix evaluation metadata. Defer Phoenix-native model evaluator templates until exact semantic differences are understood and documented.
+- **Patterns to follow:** AgentV LLM grader implementation in `packages/core/src/evaluation/graders/llm-grader.ts`; prompt assembly in `packages/core/src/evaluation/graders/llm-grader-prompt.ts`; Phoenix evaluator wrapper in `packages/phoenix-adapter/src/phoenix/run-experiment.ts`.
+- **Test scenarios:**
+  - Checklist rubric result preserves per-rubric assertions and evidence in Phoenix metadata.
+  - Score-range rubric result preserves score, verdict, and details.
+  - LLM grader provider failure is surfaced as an evaluation error with clear explanation.
+  - Unsupported custom prompt modes remain visible if they cannot safely run in Phoenix adapter context.
+- **Verification:** A small rubric eval can run through Phoenix with AgentV-equivalent grader scores and visible rubric evidence.
+
+### U7. Add trace and metric grader support through Phoenix trace IDs
+
+- **Goal:** Enable trace-derived AgentV graders once Phoenix experiment task spans can be associated with each example.
+- **Requirements:** R3, R5, R12, R16.
+- **Dependencies:** U2, U5.
+- **Files:**
+  - `packages/phoenix-adapter/src/phoenix/run-experiment.ts`
+  - `packages/phoenix-adapter/src/phoenix/types.ts`
+  - `packages/phoenix-adapter/src/evaluators/registry.ts`
+  - `packages/phoenix-adapter/test/evaluators/trace-metrics.test.ts`
+  - `packages/core/src/evaluation/trace.ts`
+  - `apps/cli/src/commands/inspect/utils.ts`
+- **Approach:** Evaluate trace-based graders after Phoenix spans are available. Use Phoenix evaluator context trace IDs to fetch spans, translate them into AgentV trace summary/metric inputs, then run or mirror AgentV trace-family graders. Start with one `tool-trajectory` happy path and one `execution-metrics` threshold before broadening.
+- **Patterns to follow:** Trace summary logic in `packages/core/src/evaluation/trace.ts`; OTLP-derived trace parsing in `apps/cli/src/commands/inspect/utils.ts`; existing trace grader contracts in `packages/core/src/evaluation/graders/tool-trajectory.ts` and `packages/core/src/evaluation/graders/execution-metrics.ts`.
+- **Test scenarios:**
+  - A Phoenix trace with tool spans maps to expected tool-call counts and names.
+  - A missing trace ID produces a clear unsupported/failed explanation rather than an empty pass.
+  - Execution metrics can evaluate duration/token/cost fields only when the data is present.
+  - Trace lookup latency or Phoenix API failure is surfaced as an evaluation error with retry/defer guidance.
+- **Verification:** Trace-based adapter smoke demonstrates at least one `tool-trajectory` and one `execution-metrics` score generated from Phoenix-ingested span data.
+
+### U8. Decide and implement package publishing posture
+
+- **Goal:** Either intentionally keep the adapter private with clear repo-local usage, or make it publishable with complete release machinery.
+- **Requirements:** R13, R14.
+- **Dependencies:** U1, U3, U5; U6 if LLM/rubric support is part of the public promise.
+- **Files:**
+  - `packages/phoenix-adapter/package.json`
+  - `package.json`
+  - `scripts/release.ts`
+  - `scripts/publish.ts`
+  - `tsconfig.build.json`
+  - `.github/workflows/validate.yml`
+  - `packages/phoenix-adapter/README.md`
+  - `packages/phoenix-adapter/test/publish-smoke.test.ts`
+- **Approach:** Keep `@agentv/phoenix-adapter` private until the public contract is ready. If publishing is approved, remove `private`, add a package CLI/bin if needed, add `prepublishOnly`, include the package in release/publish scripts and build references, and add install smoke coverage.
+- **Patterns to follow:** Release script package lists in `scripts/release.ts` and `scripts/publish.ts`; package metadata in `packages/core/package.json` and `packages/eval/package.json`.
+- **Test scenarios:**
+  - Private posture: release and publish scripts intentionally omit the adapter and README documents repo-local usage.
+  - Publishable posture: release script updates adapter version with the other packages.
+  - Publishable posture: publish script includes the adapter only after build and package metadata are complete.
+  - Package install smoke imports the exported API and invokes the CLI help without relying on workspace-only paths.
+- **Verification:** Maintainers can tell whether the adapter is private by policy or publishable by release machinery, with no half-published state.
+
+---
+
+## Phased Delivery
+
+1. **Phase A: User-visible observability foundation** — U1 and U2. This gives users a supported Phoenix path through existing AgentV eval execution without overcommitting the adapter.
+2. **Phase B: Adapter correctness foundation** — U3 and U4. This reduces semantic drift and makes dry-run parity useful as a guardrail.
+3. **Phase C: Real experiment execution** — U5. This is the boundary where Phoenix experiments become trustworthy as AgentV eval runs.
+4. **Phase D: High-value evaluator depth** — U6 and U7. Add LLM/rubric and trace/metric support after execution and trace identity are in place.
+5. **Phase E: Release posture** — U8. Decide whether to publish only after the public promise is true.
+
+---
+
+## System-Wide Impact
+
+- **CLI and docs:** Adding a Phoenix OTel preset changes the documented backend list and must not regress existing `langfuse`, `braintrust`, `confident`, custom OTLP, or `--otel-file` behavior.
+- **Core observability:** Phoenix export should reuse existing OTel exporter architecture instead of adding Phoenix-specific SDK dependencies to normal eval execution.
+- **Adapter package:** Moving from synthetic task outputs to real AgentV execution changes the adapter from a conversion smoke into an execution integration; tests and docs must make that boundary obvious.
+- **Release process:** Publishing the adapter affects version synchronization, npm metadata, package install expectations, and CI validation.
+
+---
+
+## Risks & Dependencies
+
+- **Phoenix API/version churn:** The adapter currently pins `@arizeai/phoenix-client` and `@arizeai/phoenix-evals`; trace/evaluator APIs may evolve. Mitigate by isolating Phoenix API calls under `packages/phoenix-adapter/src/phoenix/`.
+- **Semantic drift from duplicated graders:** Adapter-local evaluator logic can diverge from AgentV core. Mitigate by wrapping core graders where feasible and adding parity tests when duplication remains.
+- **False confidence from unsupported scoring:** Ignoring unsupported assertions in averages can make results look better than they are. Mitigate by making unsupported status explicit and conservative.
+- **Live Phoenix test fragility:** CI may not have a Phoenix server. Mitigate with unit/contract tests in CI and optional live e2e documented separately.
+- **Scope creep toward alternate runtime:** Workspace, matrix, Docker, trials, and custom plugin semantics are tempting to adapt natively. Keep those out unless a later issue proves they are required.
+
+---
+
+## Open Decisions
+
+- **OD1. Public package timing:** Should `@agentv/phoenix-adapter` remain private until U5/U6 are complete, or be published earlier as experimental? Recommendation: keep private.
+- **OD2. CLI surface:** Should users run a separate adapter CLI or an integrated `agentv phoenix` subcommand? Recommendation: begin with docs/root scripts while private; consider `agentv phoenix` only if the adapter becomes public.
+- **OD3. LLM scorer authority:** Should Phoenix-native model evaluators ever be the primary score for AgentV-authored rubrics? Recommendation: AgentV-authoritative first, Phoenix-native only for documented optional comparisons.
+- **OD4. Dataset idempotency:** Should repeated adapter runs append to stable datasets, upsert examples, or create timestamped datasets? Recommendation: stable dataset names with explicit experiment runs, plus documented cleanup behavior.
+
+---
+
+## Documentation and Operational Notes
+
+- Add Phoenix setup to the docs site rather than only `packages/phoenix-adapter/README.md` so users can discover it alongside Langfuse and general OTel docs.
+- Document privacy implications of `--otel-capture-content`; Phoenix traces may include prompts, outputs, and tool I/O when content capture is enabled.
+- Keep `packages/phoenix-adapter/docs/e2e-verification.md` current whenever smoke or full dry-run expectations change.
+- Do not run unrelated dashboard deployment setup while working on this integration plan or implementation. If dashboard deployment setup is ever needed for separate work, note that `scripts/setup-dashboard-deployment.sh` supports `--no-start`, not `--skip-install`.
+
+---
+
+## Sources & Research
+
+- `packages/phoenix-adapter/**` — current adapter implementation, tests, support matrix, and e2e verification notes.
+- `packages/core/src/observability/otel-exporter.ts` — existing OTel backend preset architecture and span export behavior.
+- `apps/cli/src/commands/eval/run-eval.ts` and `apps/cli/src/commands/eval/commands/run.ts` — CLI OTel option parsing and exporter initialization.
+- `scripts/release.ts` and `scripts/publish.ts` — current package versioning and npm publishing scope.
+- `.github/workflows/validate.yml` — current Phoenix smoke in CI.
+- `packages/core/src/evaluation/registry/builtin-graders.ts` and `packages/core/src/evaluation/graders/**` — AgentV evaluator semantics to preserve.
+- Phoenix TypeScript experiments docs — `runExperiment`, `asExperimentEvaluator`, evaluator inputs, and `traceId` support.
+- Phoenix evaluator docs — CODE and model-backed evaluator expectations.
+- Phoenix OTel docs and endpoint FAQ — local endpoint, collector endpoint env vars, API key, shutdown lifecycle.
+- Phoenix OTLP project routing release note — `x-project-name` header support.

From 8dde43350d8e8b7be2b9e71b4caff7045a2ea68f Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Wed, 3 Jun 2026 06:41:57 +0200
Subject: [PATCH 2/2] feat(phoenix): add otel backend preset

---
 apps/cli/src/commands/eval/commands/run.ts    |   2 +-
 apps/cli/src/commands/eval/run-eval.ts        |   3 +-
 .../docs/docs/evaluation/running-evals.mdx    |  24 +++-
 .../docs/docs/integrations/phoenix.mdx        | 127 ++++++++++++++++++
 .../core/src/observability/otel-exporter.ts   |  23 ++++
 packages/core/src/observability/types.ts      |   2 +-
 .../test/observability/otel-exporter.test.ts  |  71 +++++++++-
 packages/phoenix-adapter/README.md            |   6 +
 .../phoenix-adapter/docs/e2e-verification.md  |  12 ++
 .../phoenix-adapter/docs/support-matrix.md    |  25 ++--
 skills-data/agentv-eval-writer/SKILL.md       |   3 +
 .../references/config-schema.json             |  21 +++
 12 files changed, 302 insertions(+), 17 deletions(-)
 create mode 100644 apps/web/src/content/docs/docs/integrations/phoenix.mdx

diff --git a/apps/cli/src/commands/eval/commands/run.ts b/apps/cli/src/commands/eval/commands/run.ts
index 38cabbb3..c6898e16 100644
--- a/apps/cli/src/commands/eval/commands/run.ts
+++ b/apps/cli/src/commands/eval/commands/run.ts
@@ -143,7 +143,7 @@ export const evalRunCommand = command({
     otelBackend: option({
       type: optional(string),
       long: 'otel-backend',
-      description: 'Use a backend preset (langfuse, braintrust, confident)',
+      description: 'Use a backend preset (langfuse, braintrust, confident, phoenix)',
     }),
     otelCaptureContent: flag({
       long: 'otel-capture-content',
diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts
index 139d4780..1cb6e087 100644
--- a/apps/cli/src/commands/eval/run-eval.ts
+++ b/apps/cli/src/commands/eval/run-eval.ts
@@ -1172,7 +1172,8 @@ export async function runEvalCommand(
       if (options.otelBackend) {
         const preset = OTEL_BACKEND_PRESETS[options.otelBackend];
         if (preset) {
-          endpoint = preset.endpoint;
+          endpoint =
+            typeof preset.endpoint === 'function' ? preset.endpoint(process.env) : preset.endpoint;
           headers = preset.headers(process.env);
         } else {
           console.warn(`Unknown OTel backend preset: ${options.otelBackend}`);
diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
index f25cf6fa..2862014c 100644
--- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
+++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
@@ -125,7 +125,7 @@ OpenTelemetry-compatible backend.
 Stream traces directly to an observability backend during evaluation using `--export-otel`:
 
 ```bash
-# Use a backend preset (braintrust, langfuse, confident)
+# Use a backend preset (braintrust, langfuse, confident, phoenix)
 agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust
 
 # Include message content and tool I/O in spans (disabled by default for privacy)
@@ -179,6 +179,22 @@ export LANGFUSE_SECRET_KEY=sk-...
 agentv eval evals/my-eval.yaml --export-otel --otel-backend langfuse --otel-capture-content
 ```
 
+#### Phoenix
+
+```bash
+# Local Phoenix defaults to http://localhost:6006/v1/traces
+agentv eval evals/my-eval.yaml --export-otel --otel-backend phoenix
+
+# Hosted or remote Phoenix
+export PHOENIX_COLLECTOR_ENDPOINT=https://app.phoenix.arize.com/s/my-space
+export PHOENIX_API_KEY=px-...
+export PHOENIX_PROJECT_NAME=agentv-evals
+
+agentv eval evals/my-eval.yaml --export-otel --otel-backend phoenix --otel-capture-content
+```
+
+See [Phoenix](/docs/integrations/phoenix/) for project routing, privacy notes, and the separate repo-local dataset/experiment adapter.
+
 #### Custom OTLP Endpoint
 
 For backends not covered by presets, configure via environment variables:
@@ -400,6 +416,8 @@ Project-local YAML config takes precedence over home/global YAML config. AgentV
 execution:
   verbose: true
   keep_workspaces: false
+  export_otel: true
+  otel_backend: phoenix
   otel_file: .agentv/results/otel-{timestamp}.json
 ```
 
@@ -407,7 +425,11 @@ execution:
 |-------|---------------|------|---------|-------------|
 | `verbose` | `--verbose` | boolean | `false` | Enable verbose logging |
 | `keep_workspaces` | `--keep-workspaces` | boolean | `false` | Always keep temp workspaces after eval |
+| `export_otel` | `--export-otel` | boolean | `false` | Stream traces via OTLP/HTTP |
+| `otel_backend` | `--otel-backend` | string | none | Backend preset: `braintrust`, `langfuse`, `confident`, or `phoenix` |
 | `otel_file` | `--otel-file` | string | none | Write OTLP JSON trace to file |
+| `otel_capture_content` | `--otel-capture-content` | boolean | `false` | Include message and tool content in exported spans |
+| `otel_group_turns` | `--otel-group-turns` | boolean | `false` | Group multi-turn messages under `agentv.turn.N` spans |
 
 ### TypeScript config (`agentv.config.ts`)
 
diff --git a/apps/web/src/content/docs/docs/integrations/phoenix.mdx b/apps/web/src/content/docs/docs/integrations/phoenix.mdx
new file mode 100644
index 00000000..d8149abb
--- /dev/null
+++ b/apps/web/src/content/docs/docs/integrations/phoenix.mdx
@@ -0,0 +1,127 @@
+---
+title: Phoenix
+description: Export AgentV traces to Phoenix and understand the repo-local Phoenix adapter
+sidebar:
+  order: 2
+---
+
+AgentV integrates with [Arize Phoenix](https://arize.com/docs/phoenix/) through two separate surfaces:
+
+- **OTLP trace export** from normal `agentv eval` runs. This is the primary supported path for observing AgentV executions in Phoenix.
+- **Repo-local dataset/experiment adapter** in `packages/phoenix-adapter`. This keeps AgentV eval YAML as the source of truth while converting suites into Phoenix dataset and experiment payloads. The adapter is private and intentionally limited while parity work continues.
+
+AgentV scoring remains authoritative for AgentV-authored evals. Phoenix receives traces, run metadata, and adapter experiment artifacts; it does not replace AgentV's YAML loader, target runner, workspace lifecycle, or grader semantics.
+
+## Quick Start: Trace Export
+
+Start Phoenix locally or point AgentV at a hosted Phoenix collector endpoint:
+
+```bash
+# Local Phoenix default: http://localhost:6006
+agentv eval evals/my-eval.yaml --export-otel --otel-backend phoenix
+
+# Hosted or remote Phoenix
+export PHOENIX_COLLECTOR_ENDPOINT=https://app.phoenix.arize.com/s/my-space
+export PHOENIX_API_KEY=px-...
+export PHOENIX_PROJECT_NAME=agentv-evals
+
+agentv eval evals/my-eval.yaml --export-otel --otel-backend phoenix
+```
+
+The `phoenix` preset sends standard OTLP/HTTP traces to `{PHOENIX_COLLECTOR_ENDPOINT}/v1/traces`. If `PHOENIX_COLLECTOR_ENDPOINT` already ends in `/v1/traces`, AgentV uses it as-is. When unset, AgentV defaults to `http://localhost:6006/v1/traces`.
+
+## Environment Variables
+
+| Variable | Required | Description |
+| --- | --- | --- |
+| `PHOENIX_COLLECTOR_ENDPOINT` | no | Phoenix collector base URL or full OTLP traces URL. Defaults to `http://localhost:6006`. |
+| `PHOENIX_API_KEY` | hosted Phoenix | Adds `Authorization: Bearer ...` to OTLP exports. |
+| `PHOENIX_PROJECT_NAME` | no | Adds `x-project-name` for Phoenix project routing. |
+| `PHOENIX_PROJECT` | no | Fallback project name if `PHOENIX_PROJECT_NAME` is unset. |
+| `OTEL_EXPORTER_OTLP_HEADERS` | no | Extra OTLP headers, merged after preset headers. |
+
+Phoenix project routing via `x-project-name` requires Phoenix's OTLP HTTP endpoint support for that header. See Phoenix's [project setup docs](https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/setup-projects) for the current behavior.
+
+## Config.yaml Alternative
+
+Set default Phoenix export in `.agentv/config.yaml`:
+
+```yaml
+execution:
+  export_otel: true
+  otel_backend: phoenix
+```
+
+Add content capture only when your Phoenix instance is approved to store prompts, outputs, and tool I/O:
+
+```yaml
+execution:
+  export_otel: true
+  otel_backend: phoenix
+  otel_capture_content: true
+```
+
+:::caution[Privacy]
+`--otel-capture-content` sends full message and tool content to Phoenix. Leave it disabled unless the data and Phoenix deployment meet your privacy requirements.
+:::
+
+## What Appears in Phoenix
+
+Each eval test case produces an `agentv.eval` trace with AgentV attributes such as test ID, suite, target, score, duration, token usage, and tool summary. With streaming providers, AgentV also emits model and tool spans. With `--otel-group-turns`, multi-turn eval messages are grouped under `agentv.turn.N` spans.
+
+```bash
+agentv eval evals/my-eval.yaml \
+  --export-otel \
+  --otel-backend phoenix \
+  --otel-group-turns
+```
+
+## Dataset/Experiment Adapter
+
+The repo-local `@agentv/phoenix-adapter` package converts AgentV eval YAML suites into Phoenix dataset payloads and can run Phoenix experiments for adapter verification:
+
+```bash
+bun --filter @agentv/phoenix-adapter phoenix:assert-smoke
+bun --filter @agentv/phoenix-adapter phoenix:dry-run
+```
+
+Use the adapter when you are developing or verifying Phoenix dataset/experiment parity. Use normal `agentv eval --export-otel --otel-backend phoenix` when you want to observe real AgentV eval runs.
+
+Current adapter support is intentionally small:
+
+| Family | Status |
+| --- | --- |
+| `contains`, `regex`, `equals`, `is-json` | Supported by the deterministic adapter |
+| Other deterministic string variants | Planned parity work |
+| `llm-grader`, `rubrics`, `code-grader`, trace and metric graders | Reported as unsupported |
+| Custom/plugin graders | Reported as unsupported by family name |
+
+Unsupported adapter entries stay visible in reports and do not block conversion unless `--fail-on-unsupported` is set. They should not be interpreted as passing scores.
+
+## Integration Contract
+
+- AgentV eval YAML remains the source of truth for test discovery, interpolation, assertion parsing, and metadata.
+- AgentV scoring remains authoritative unless a Phoenix-native evaluator is explicitly proven equivalent and documented.
+- Phoenix is optional observability and experiment infrastructure; it is not required for normal AgentV eval execution.
+- The adapter remains private until real AgentV target execution, deterministic parity, and release expectations are complete.
+
+## Troubleshooting
+
+### Traces do not appear
+
+Verify the collector endpoint and that Phoenix is listening:
+
+```bash
+echo "$PHOENIX_COLLECTOR_ENDPOINT"
+agentv eval evals/my-eval.yaml --export-otel --otel-backend phoenix
+```
+
+For local Phoenix, the preset expects Phoenix at `http://localhost:6006`.
+
+### Hosted Phoenix returns 401 or 403
+
+Check that `PHOENIX_API_KEY` is set and valid for the target Phoenix space.
+
+### Traces appear in the wrong project
+
+Set `PHOENIX_PROJECT_NAME` to the project that should receive the spans. Extra headers in `OTEL_EXPORTER_OTLP_HEADERS` are merged after preset headers, so they can override the preset if needed.
diff --git a/packages/core/src/observability/otel-exporter.ts b/packages/core/src/observability/otel-exporter.ts
index 73f1a98b..74e6b513 100644
--- a/packages/core/src/observability/otel-exporter.ts
+++ b/packages/core/src/observability/otel-exporter.ts
@@ -12,6 +12,13 @@ export type { OtelExportOptions, OtelBackendPreset };
 // Backend presets
 // ---------------------------------------------------------------------------
 
+function normalizePhoenixCollectorEndpoint(endpoint: string | undefined): string {
+  const base = (endpoint?.trim() || 'http://localhost:6006').replace(/\/+$/, '');
+  if (base.endsWith('/v1/traces')) return base;
+  if (base.endsWith('/v1')) return `${base}/traces`;
+  return `${base}/v1/traces`;
+}
+
 export const OTEL_BACKEND_PRESETS: Record<string, OtelBackendPreset> = {
   langfuse: {
     name: 'langfuse',
@@ -49,6 +56,22 @@ export const OTEL_BACKEND_PRESETS: Record<string, OtelBackendPreset> = {
       'x-confident-api-key': env.CONFIDENT_API_KEY ?? '',
     }),
   },
+  phoenix: {
+    name: 'phoenix',
+    endpoint: (env) => normalizePhoenixCollectorEndpoint(env.PHOENIX_COLLECTOR_ENDPOINT),
+    headers: (env) => {
+      const headers: Record<string, string> = {};
+      const apiKey = env.PHOENIX_API_KEY?.trim();
+      if (apiKey) {
+        headers.Authorization = `Bearer ${apiKey}`;
+      }
+      const projectName = (env.PHOENIX_PROJECT_NAME ?? env.PHOENIX_PROJECT)?.trim();
+      if (projectName) {
+        headers['x-project-name'] = projectName;
+      }
+      return headers;
+    },
+  },
 };
 
 // ---------------------------------------------------------------------------
diff --git a/packages/core/src/observability/types.ts b/packages/core/src/observability/types.ts
index 84fe94fd..94c76813 100644
--- a/packages/core/src/observability/types.ts
+++ b/packages/core/src/observability/types.ts
@@ -17,6 +17,6 @@ export interface OtelExportOptions {
 /** Preset configuration for a known observability backend. */
 export interface OtelBackendPreset {
   readonly name: string;
-  readonly endpoint: string;
+  readonly endpoint: string | ((env: Record<string, string | undefined>) => string);
   readonly headers: (env: Record<string, string | undefined>) => Record<string, string>;
 }
diff --git a/packages/core/test/observability/otel-exporter.test.ts b/packages/core/test/observability/otel-exporter.test.ts
index 9e2035a1..9322a4cf 100644
--- a/packages/core/test/observability/otel-exporter.test.ts
+++ b/packages/core/test/observability/otel-exporter.test.ts
@@ -5,23 +5,33 @@
 
 import { afterEach, describe, expect, it } from 'bun:test';
 import { OTEL_BACKEND_PRESETS, OtelTraceExporter } from '../../src/observability/otel-exporter.js';
+import type { OtelBackendPreset } from '../../src/observability/types.js';
 
 // ---------------------------------------------------------------------------
 // Backend presets
 // ---------------------------------------------------------------------------
 
 describe('OTel backend presets', () => {
+  function resolveEndpoint(
+    preset: OtelBackendPreset,
+    env: Record<string, string | undefined> = {},
+  ): string {
+    return typeof preset.endpoint === 'function' ? preset.endpoint(env) : preset.endpoint;
+  }
+
   describe('OTEL_BACKEND_PRESETS registry', () => {
-    it('contains langfuse, braintrust, and confident entries', () => {
+    it('contains langfuse, braintrust, confident, and phoenix entries', () => {
       expect(OTEL_BACKEND_PRESETS).toHaveProperty('langfuse');
       expect(OTEL_BACKEND_PRESETS).toHaveProperty('braintrust');
       expect(OTEL_BACKEND_PRESETS).toHaveProperty('confident');
+      expect(OTEL_BACKEND_PRESETS).toHaveProperty('phoenix');
     });
 
     it('each preset has name, endpoint, and headers function', () => {
       for (const [key, preset] of Object.entries(OTEL_BACKEND_PRESETS)) {
         expect(preset.name).toBe(key);
-        expect(typeof preset.endpoint).toBe('string');
+        expect(['function', 'string']).toContain(typeof preset.endpoint);
+        expect(typeof resolveEndpoint(preset)).toBe('string');
         expect(typeof preset.headers).toBe('function');
       }
     });
@@ -90,6 +100,63 @@ describe('OTel backend presets', () => {
       expect(preset.endpoint).toBe('https://otel.confident-ai.com/v1/traces');
     });
   });
+
+  describe('phoenix preset', () => {
+    const preset = OTEL_BACKEND_PRESETS.phoenix;
+
+    it('uses the local Phoenix OTLP traces endpoint by default', () => {
+      expect(resolveEndpoint(preset)).toBe('http://localhost:6006/v1/traces');
+    });
+
+    it('appends the OTLP traces path to PHOENIX_COLLECTOR_ENDPOINT', () => {
+      expect(
+        resolveEndpoint(preset, {
+          PHOENIX_COLLECTOR_ENDPOINT: 'https://app.phoenix.arize.com/s/my-space',
+        }),
+      ).toBe('https://app.phoenix.arize.com/s/my-space/v1/traces');
+    });
+
+    it('does not append duplicate OTLP traces path segments', () => {
+      expect(
+        resolveEndpoint(preset, {
+          PHOENIX_COLLECTOR_ENDPOINT: 'https://phoenix.example.com/v1/traces',
+        }),
+      ).toBe('https://phoenix.example.com/v1/traces');
+      expect(
+        resolveEndpoint(preset, {
+          PHOENIX_COLLECTOR_ENDPOINT: 'https://phoenix.example.com/v1/',
+        }),
+      ).toBe('https://phoenix.example.com/v1/traces');
+    });
+
+    it('adds bearer auth only when PHOENIX_API_KEY is set', () => {
+      expect(preset.headers({})).toEqual({});
+      expect(preset.headers({ PHOENIX_API_KEY: 'px-key-123' })).toEqual({
+        Authorization: 'Bearer px-key-123',
+      });
+    });
+
+    it('adds x-project-name from Phoenix project env vars', () => {
+      expect(preset.headers({ PHOENIX_PROJECT_NAME: 'agentv-evals' })).toEqual({
+        'x-project-name': 'agentv-evals',
+      });
+      expect(preset.headers({ PHOENIX_PROJECT: 'fallback-project' })).toEqual({
+        'x-project-name': 'fallback-project',
+      });
+    });
+
+    it('combines auth and project headers', () => {
+      expect(
+        preset.headers({
+          PHOENIX_API_KEY: 'px-key-123',
+          PHOENIX_PROJECT_NAME: 'agentv-evals',
+        }),
+      ).toEqual({
+        Authorization: 'Bearer px-key-123',
+        'x-project-name': 'agentv-evals',
+      });
+    });
+  });
 });
 
 // ---------------------------------------------------------------------------
diff --git a/packages/phoenix-adapter/README.md b/packages/phoenix-adapter/README.md
index 528400be..a740bc37 100644
--- a/packages/phoenix-adapter/README.md
+++ b/packages/phoenix-adapter/README.md
@@ -2,6 +2,12 @@
 
 Converts AgentV eval YAML suites into Phoenix datasets and can run Phoenix experiments while keeping AgentV eval files as the source of truth.
 
+This package is repo-local and private while Phoenix experiment parity is still being completed. For observing real AgentV eval runs in Phoenix, use the core OTel preset instead:
+
+```bash
+agentv eval evals/my-eval.yaml --export-otel --otel-backend phoenix
+```
+
 Current adapter support is intentionally small: deterministic `contains`, `regex`, `equals`, and `is-json` assertions run through a Phoenix CODE evaluator. LLM, code, trace, composite, metric, and custom evaluator families are reported as unsupported instead of being silently mapped.
 
 ```bash
diff --git a/packages/phoenix-adapter/docs/e2e-verification.md b/packages/phoenix-adapter/docs/e2e-verification.md
index bf0cad46..db99f15e 100644
--- a/packages/phoenix-adapter/docs/e2e-verification.md
+++ b/packages/phoenix-adapter/docs/e2e-verification.md
@@ -1,5 +1,17 @@
 # E2E Verification
 
+## Live OTel Export
+
+Normal AgentV eval runs can export traces to Phoenix without using the adapter package:
+
+```bash
+agentv eval examples/features/assert/evals/dataset.eval.yaml \
+  --export-otel \
+  --otel-backend phoenix
+```
+
+The `phoenix` OTel preset defaults to `http://localhost:6006/v1/traces`. For remote Phoenix, set `PHOENIX_COLLECTOR_ENDPOINT`, `PHOENIX_API_KEY`, and optionally `PHOENIX_PROJECT_NAME`.
+
 ## Dry-Run Conversion
 
 Dry-run mode discovers AgentV example evals, normalizes cases through `@agentv/core`, creates Phoenix dataset payloads in memory, and compares test IDs against AgentV baselines where present.
diff --git a/packages/phoenix-adapter/docs/support-matrix.md b/packages/phoenix-adapter/docs/support-matrix.md
index 6726bbf0..51db8970 100644
--- a/packages/phoenix-adapter/docs/support-matrix.md
+++ b/packages/phoenix-adapter/docs/support-matrix.md
@@ -2,22 +2,25 @@
 
 This workspace converts AgentV example evals into Phoenix dataset and experiment payloads.
 
-| AgentV family | Phoenix status |
+For observing real AgentV eval runs in Phoenix, use the core OTel preset:
+
+```bash
+agentv eval evals/my-eval.yaml --export-otel --otel-backend phoenix
+```
+
+The adapter remains repo-local/private until real AgentV execution and broader scorer parity are complete.
+
+| AgentV family | Phoenix adapter status |
 | --- | --- |
 | `contains` | Supported by deterministic adapter |
 | `regex` | Supported by deterministic adapter |
 | `equals` | Supported by deterministic adapter |
 | `is-json` | Supported by deterministic adapter |
-| `llm-grader` | Reported as unsupported in first pass |
-| `rubrics` | Reported as unsupported in first pass |
-| `code-grader` | Reported as unsupported in first pass |
-| `composite` | Reported as unsupported in first pass |
-| `field-accuracy` | Reported as unsupported in first pass |
-| `execution-metrics` | Reported as unsupported in first pass |
-| `tool-trajectory` | Reported as unsupported in first pass |
-| `cost` | Reported as unsupported in first pass |
-| `latency` | Reported as unsupported in first pass |
-| `trial-output-consistency` | Reported as unsupported in first pass |
+| `contains-any`, `contains-all`, `icontains`, `icontains-any`, `icontains-all`, `starts-with`, `ends-with` | Planned deterministic parity work |
+| `llm-grader`, `rubrics` | Unsupported until AgentV-authoritative LLM/rubric scoring is wired into Phoenix experiment evaluation |
+| `code-grader` | Unsupported until adapter runs real AgentV execution and code-grader context |
+| `composite`, `field-accuracy`, `trial-output-consistency` | Unsupported until composed scorer semantics are mapped without changing AgentV scoring authority |
+| `execution-metrics`, `tool-trajectory`, `cost`, `latency` | Unsupported until Phoenix trace IDs/spans can be associated with AgentV test cases |
 | Other custom families | Reported as unsupported with the family name |
 
 Unsupported does not block conversion unless `--fail-on-unsupported` is set. The report keeps unsupported families visible so parity gaps are explicit.
diff --git a/skills-data/agentv-eval-writer/SKILL.md b/skills-data/agentv-eval-writer/SKILL.md
index 9291b304..3d52893b 100644
--- a/skills-data/agentv-eval-writer/SKILL.md
+++ b/skills-data/agentv-eval-writer/SKILL.md
@@ -543,6 +543,9 @@ agentv eval <file.yaml> [--test-id <id>] [--target <name>] [--dry-run] [--thresh
 # Run with OTLP JSON file (importable by OTel backends)
 agentv eval <file.yaml> --otel-file traces/eval.otlp.json
 
+# Stream traces to an OTel backend preset
+agentv eval <file.yaml> --export-otel --otel-backend phoenix
+
 # Run a single assertion in isolation (no API keys needed)
 agentv eval assert <grader-name> --agent-output "..." --agent-input "..."
 
diff --git a/skills-data/agentv-eval-writer/references/config-schema.json b/skills-data/agentv-eval-writer/references/config-schema.json
index f95da9c4..4cc4a77c 100644
--- a/skills-data/agentv-eval-writer/references/config-schema.json
+++ b/skills-data/agentv-eval-writer/references/config-schema.json
@@ -37,10 +37,31 @@
           "description": "Always keep temp workspaces after eval (equivalent to --keep-workspaces)",
           "default": false
         },
+        "export_otel": {
+          "type": "boolean",
+          "description": "Stream traces via OTLP/HTTP to the configured endpoint or backend preset (equivalent to --export-otel)",
+          "default": false
+        },
+        "otel_backend": {
+          "type": "string",
+          "description": "Use a known OTel backend preset (equivalent to --otel-backend).",
+          "enum": ["braintrust", "langfuse", "confident", "phoenix"],
+          "examples": ["phoenix"]
+        },
         "otel_file": {
           "type": "string",
           "description": "Write OTLP JSON trace to this path (equivalent to --otel-file). Supports {timestamp} placeholder.",
           "examples": [".agentv/results/otel-{timestamp}.json"]
+        },
+        "otel_capture_content": {
+          "type": "boolean",
+          "description": "Include message and tool content in exported OTel spans (equivalent to --otel-capture-content). Disabled by default for privacy.",
+          "default": false
+        },
+        "otel_group_turns": {
+          "type": "boolean",
+          "description": "Group multi-turn messages under agentv.turn.N spans (equivalent to --otel-group-turns).",
+          "default": false
         }
       },
       "additionalProperties": false