Skip to content

feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315

Merged
wu6u3tw merged 6 commits into
mlcommons:mainfrom
wu6u3tw:feat/vbench-eval
May 22, 2026
Merged

feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315
wu6u3tw merged 6 commits into
mlcommons:mainfrom
wu6u3tw:feat/vbench-eval

Conversation

@wu6u3tw
Copy link
Copy Markdown
Collaborator

@wu6u3tw wu6u3tw commented May 18, 2026

Summary

  • Adds VBenchScorer (scorer_id="vbench") that scores video-generation outputs on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and returns the mean of the per-dimension aggregates.
  • Runs VBench in an isolated uv subproject under examples/09_Wan22_VideoGen_Example/accuracy/ (vbench pins transformers==4.33.2 + numpy<2, incompatible with the parent env). VBenchScorer.score() shells out to vbench_runner.py via uv run --project, so the benchmark process never imports vbench.
  • VideoGenAdapter.decode_response now mirrors video_path into response_output so the event log carries it to the scorer.
  • Wires ScorerMethod.VBENCH into the config enum and adds offline_wan22_accuracy.yaml as a peer to the existing perf example.

Test plan

  • Unit: tests/unit/evaluation/test_scoring.py covers registration, mean-of-6-dims with mocked subprocess.run, missing-subproject FileNotFoundError.
  • Unit: tests/unit/videogen/test_adapter.py + integration adapter test updated for the new response_output contract.
  • examples/09_Wan22_VideoGen_Example/offline_wan22_accuracy.yaml validates as OfflineBenchmarkConfig.
  • Subproject uv lock resolves (115 packages).
  • pre-commit run --all-files clean.
  • End-to-end VBench run on a GPU host with uv sync in the accuracy subproject — not yet executed.

Draft for code review.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@wu6u3tw wu6u3tw marked this pull request as ready for review May 18, 2026 20:57
@wu6u3tw wu6u3tw requested a review from a team May 18, 2026 20:57
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the VBenchScorer to evaluate video generation accuracy across six MLPerf dimensions. To manage conflicting dependencies, VBench is executed within an isolated uv subproject using a standalone runner script. The changes include a comprehensive runbook, configuration templates, and updates to the VideoGenAdapter to pass video paths to the scorer. Review feedback highlights the need for more robust error handling when parsing VBench results and ensuring that symlink creation is resilient to prompts containing directory separators.

Comment thread src/inference_endpoint/evaluation/scoring.py Outdated
Comment thread src/inference_endpoint/evaluation/scoring.py Outdated
@wu6u3tw wu6u3tw requested review from arekay-nv and nv-alicheng May 19, 2026 18:01
Comment thread AGENTS.md
Copy link
Copy Markdown
Collaborator

@viraatc viraatc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

Copy link
Copy Markdown
Collaborator

@viraatc viraatc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

13 issues posted inline after re-verifying each finding against current HEAD and deduping against the 3 existing comments. Findings tagged [Codex+Claude] were flagged independently by both reviewers (boosted confidence).

Highest-priority: a path-traversal vector in _stage_videos (prompt-as-filename allows .. escape of staged_dir, then unlink + symlink_to mutate paths outside report_dir). Also: failed video generations silently dropped from the VBench denominator (overstates accuracy), and the example YAML cannot actually run because the benchmark requires a performance dataset.

Comment thread src/inference_endpoint/evaluation/scoring.py Outdated
Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_accuracy.yaml
Comment thread src/inference_endpoint/evaluation/scoring.py
Comment thread src/inference_endpoint/evaluation/scoring.py Outdated
Comment thread src/inference_endpoint/evaluation/scoring.py Outdated
Comment thread src/inference_endpoint/evaluation/scoring.py
Comment thread src/inference_endpoint/evaluation/scoring.py Outdated
Comment thread src/inference_endpoint/evaluation/scoring.py
Comment thread examples/09_Wan22_VideoGen_Example/accuracy/vbench_runner.py
Comment thread examples/09_Wan22_VideoGen_Example/accuracy/vbench_runner.py
@viraatc
Copy link
Copy Markdown
Collaborator

viraatc commented May 20, 2026

Review Council — Summary

Reviewed by: Codex + Claude | Depth: thorough

Found 13 issues across 3 files. All findings re-verified against HEAD (8b32e3a9) and deduped against the 3 existing inline comments. [Codex+Claude] tags mark issues flagged independently by both reviewers.

🔴 Must Fix (critical/high)

File Line Reviewer(s) Category Summary
evaluation/scoring.py 948 Claude security Path-traversal via .. in prompt-as-filename (related to gemini's L951 / finding but strictly worse)
offline_wan22_accuracy.yaml 36 Codex bug Accuracy-only YAML cannot run — setup_benchmark() requires a performance dataset
evaluation/scoring.py 986 Codex+Claude data-integrity Failed generations dropped from VBench denominator → overstated accuracy + n_repeats=0 + NaN mean
evaluation/scoring.py 1026 Claude bug KeyError on missing VBench dimension crashes finalize after ~30 min scoring
evaluation/scoring.py 975 Claude error-handling subprocess.run no timeout, no captured output, inherits stdio → hangs / opaque failures
evaluation/scoring.py 951 Claude bug src.resolve(strict=False) silently produces dangling symlinks

🟡 Should Fix (medium)

File Line Reviewer(s) Category Summary
evaluation/scoring.py 944 Codex+Claude bug Stale symlinks in vbench_videos/ across re-scoring with fewer repeats
evaluation/scoring.py 922 Claude api-contract VBench-specific ctor args not plumbed through AccuracyConfig
evaluation/scoring.py 893 Claude api-contract extractor required-but-unused for VBench; example YAML lies with identity_extractor
evaluation/scoring.py 977 Claude testing No tests for subprocess failure, malformed results, missing dim, prompts with //..

🔵 Consider (low)

File Line Reviewer(s) Category Summary
evaluation/scoring.py 859 Codex+Claude design _DEFAULT_VBENCH_PROJECT_PATH breaks under wheel install
accuracy/vbench_runner.py 60 Claude bug Silent CPU fallback for a GPU-required evaluator
accuracy/vbench_runner.py 75 Claude error-handling No try/except wrapping vb.evaluate

Inline thread: #315 (review)

wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request May 21, 2026
Addresses 13 review findings from the Codex+Claude review council and
gemini-code-assist on PR mlcommons#315. Highlights:

- Sanitize prompt-as-filename (reject "..", replace "/") so a hostile
  prompt cannot escape the staged dir.
- src.resolve(strict=True) and staged_dir wipe so missing videos and
  stale "{prompt}-{N}.mp4" from prior runs surface immediately.
- Subprocess hardening: DEVNULL stdin, captured stdout tee'd to
  vbench_subprocess.log, configurable timeout, stderr tail in error.
- Named ValueError on missing VBench dim instead of bare KeyError.
- score() returns (None, n_repeats) on empty/all-failed; n_repeats
  computed from pre-filter total so single failures do not zero it out.
- VBench-specific kwargs plumbed via AccuracyConfig.extras; extractor
  optional when Scorer.REQUIRES_EXTRACTOR=False (set on VBenchScorer).
- Wheel-safe project-path resolution via $VBENCH_PROJECT_PATH env var.
- vbench_runner.py fails fast on missing CUDA (with --allow-cpu opt-out)
  and wraps vb.evaluate with structured JSON error on stderr.
- Example YAML adds the required performance phase; drops the
  meaningless identity_extractor placeholder.
- 9 new unit tests cover the failure modes raised in review.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
wu6u3tw and others added 6 commits May 21, 2026 13:37
Scores generated videos on six VBench dimensions (subject_consistency,
background_consistency, motion_smoothness, dynamic_degree,
appearance_style, scene) and averages the per-dimension aggregates as
the accuracy score. The MLPerf WAN 2.2 prompt set is a subset of
VBench standard prompt suite, so we use VBench default evaluate()
flow with bundled prompt-to-dimension lookup.

VBench pins transformers==4.33.2 and numpy<2, both incompatible with
the parent endpoints package (transformers==5.5.0, numpy==2.4.4). To
keep the parent lockfile solvable and the accuracy environment
reproducible, vbench lives in an isolated uv subproject under
examples/09_Wan22_VideoGen_Example/accuracy/, with its own
pyproject.toml and uv.lock. VBenchScorer.score() shells out to a
vbench_runner.py script in that subproject via uv run --project,
so the benchmark process never imports vbench.

VideoGenAdapter now mirrors video_path into response_output so the
event log carries it to the scorer (event publishing only forwards
response_output, not metadata). Added offline_wan22_accuracy.yaml as
a peer to the existing perf example, wired to eval_method: vbench.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- VBenchScorer.score(): drop empty-output rows (failed queries whose
  record.data is None produce output=\"\") before _stage_videos, so
  Path(\"\").resolve() never staged the repo cwd as a video and
  corrupted the run.
- _stage_videos: dst.unlink(missing_ok=True) before symlink, so
  re-scoring an existing report_dir is safe.
- vbench_runner.py: when --full-info-json is omitted, default to
  the VBench_full_info.json bundled in the vbench package. The
  default `vbench_standard` mode (required for scene + appearance_style)
  needs a real file path; None crashes inside VBench.
- offline_wan22_accuracy.yaml: prerequisites comment now points to
  `uv sync` in the accuracy subproject instead of `pip install vbench`,
  which would have broken the parent lockfile (vbench pins
  transformers==4.33.2 and numpy<2).
- AGENTS.md VideoGen entry: replace stale \"switch to video_bytes for
  accuracy mode\" claim with a description of VBenchScorer and the
  out-of-process subproject pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3-stage progressive validation for the accuracy subproject:
1. vbench_runner.py arg plumbing + VBench bundled JSON resolution
2. VBenchScorer.score() against a hand-picked video subset
3. End-to-end inference-endpoint benchmark from-config

Captures the unit-test gap (mocks vbench entirely) so the next
person can validate VBench API drift, prompt-suite coverage, and
filename-convention mismatches before marking the PR ready.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Addresses 13 review findings from the Codex+Claude review council and
gemini-code-assist on PR mlcommons#315. Highlights:

- Sanitize prompt-as-filename (reject "..", replace "/") so a hostile
  prompt cannot escape the staged dir.
- src.resolve(strict=True) and staged_dir wipe so missing videos and
  stale "{prompt}-{N}.mp4" from prior runs surface immediately.
- Subprocess hardening: DEVNULL stdin, captured stdout tee'd to
  vbench_subprocess.log, configurable timeout, stderr tail in error.
- Named ValueError on missing VBench dim instead of bare KeyError.
- score() returns (None, n_repeats) on empty/all-failed; n_repeats
  computed from pre-filter total so single failures do not zero it out.
- VBench-specific kwargs plumbed via AccuracyConfig.extras; extractor
  optional when Scorer.REQUIRES_EXTRACTOR=False (set on VBenchScorer).
- Wheel-safe project-path resolution via $VBENCH_PROJECT_PATH env var.
- vbench_runner.py fails fast on missing CUDA (with --allow-cpu opt-out)
  and wraps vb.evaluate with structured JSON error on stderr.
- Example YAML adds the required performance phase; drops the
  meaningless identity_extractor placeholder.
- 9 new unit tests cover the failure modes raised in review.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage C real-world validation (18-node DP=72, 248 prompts × 5 generations)
surfaced two adapter gaps that this commit closes:

1. trtllm-serve builds we have in deployment return raw mp4 bytes for
   response_format=video_path. The previous adapter blindly called
   json.loads on the response and either raised UnicodeDecodeError (for
   bytes starting with NUL) or routed a 28 MB payload through the
   worker→main ZMQ frame and tripped the 16 MB recv buffer. The new
   decode path sniffs the ISO BMFF `ftyp` box at offset 4 and, when
   matched, persists the bytes to $INFERENCE_ENDPOINT_VIDEOGEN_FALLBACK_DIR
   inside the worker, returning a QueryResult that carries only the
   small path string. ftyp-only sniff means HTTP error bodies and
   malformed JSON still raise as before. Missing env var raises a
   RuntimeError naming the variable.

2. The MLPerf WAN 2.2 request schema includes num_frames=81 and
   boundary_ratio=0.875. Without them, trtllm-serve falls back to
   defaults that may differ from the published MLPerf submission
   configuration. Both are now declared on VideoPathRequest with the
   MLPerf-canonical defaults; dataset rows can still override.

Adds two unit tests covering the binary-fallback path (env set → file
written, env unset → RuntimeError). All 97 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/vbench-eval branch from 705f087 to ce9d048 Compare May 21, 2026 20:44
@wu6u3tw
Copy link
Copy Markdown
Collaborator Author

wu6u3tw commented May 21, 2026

Modified code per review. Please check it @viraatc @arekay-nv.

Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@wu6u3tw wu6u3tw merged commit 3131730 into mlcommons:main May 22, 2026
7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 22, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants