feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation by wu6u3tw · Pull Request #315 · mlcommons/endpoints

wu6u3tw · 2026-05-18T20:56:50Z

Summary

Adds VBenchScorer (scorer_id="vbench") that scores video-generation outputs on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and returns the mean of the per-dimension aggregates.
Runs VBench in an isolated uv subproject under examples/09_Wan22_VideoGen_Example/accuracy/ (vbench pins transformers==4.33.2 + numpy<2, incompatible with the parent env). VBenchScorer.score() shells out to vbench_runner.py via uv run --project, so the benchmark process never imports vbench.
VideoGenAdapter.decode_response now mirrors video_path into response_output so the event log carries it to the scorer.
Wires ScorerMethod.VBENCH into the config enum and adds offline_wan22_accuracy.yaml as a peer to the existing perf example.

Test plan

Unit: tests/unit/evaluation/test_scoring.py covers registration, mean-of-6-dims with mocked subprocess.run, missing-subproject FileNotFoundError.
Unit: tests/unit/videogen/test_adapter.py + integration adapter test updated for the new response_output contract.
examples/09_Wan22_VideoGen_Example/offline_wan22_accuracy.yaml validates as OfflineBenchmarkConfig.
Subproject uv lock resolves (115 packages).
pre-commit run --all-files clean.
End-to-end VBench run on a GPU host with uv sync in the accuracy subproject — not yet executed.

Draft for code review.

🤖 Generated with Claude Code

github-actions · 2026-05-18T20:57:00Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces the VBenchScorer to evaluate video generation accuracy across six MLPerf dimensions. To manage conflicting dependencies, VBench is executed within an isolated uv subproject using a standalone runner script. The changes include a comprehensive runbook, configuration templates, and updates to the VideoGenAdapter to pass video paths to the scorer. Review feedback highlights the need for more robust error handling when parsing VBench results and ensuring that symlink creation is resilient to prompts containing directory separators.

viraatc

lgtm, thanks!

viraatc

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

13 issues posted inline after re-verifying each finding against current HEAD and deduping against the 3 existing comments. Findings tagged [Codex+Claude] were flagged independently by both reviewers (boosted confidence).

Highest-priority: a path-traversal vector in _stage_videos (prompt-as-filename allows .. escape of staged_dir, then unlink + symlink_to mutate paths outside report_dir). Also: failed video generations silently dropped from the VBench denominator (overstates accuracy), and the example YAML cannot actually run because the benchmark requires a performance dataset.

viraatc · 2026-05-20T21:47:16Z

Review Council — Summary

Reviewed by: Codex + Claude | Depth: thorough

Found 13 issues across 3 files. All findings re-verified against HEAD (8b32e3a9) and deduped against the 3 existing inline comments. [Codex+Claude] tags mark issues flagged independently by both reviewers.

🔴 Must Fix (critical/high)

File	Line	Reviewer(s)	Category	Summary
`evaluation/scoring.py`	948	Claude	security	Path-traversal via `..` in prompt-as-filename (related to gemini's L951 `/` finding but strictly worse)
`offline_wan22_accuracy.yaml`	36	Codex	bug	Accuracy-only YAML cannot run — `setup_benchmark()` requires a performance dataset
`evaluation/scoring.py`	986	Codex+Claude	data-integrity	Failed generations dropped from VBench denominator → overstated accuracy + `n_repeats=0` + `NaN` mean
`evaluation/scoring.py`	1026	Claude	bug	`KeyError` on missing VBench dimension crashes finalize after ~30 min scoring
`evaluation/scoring.py`	975	Claude	error-handling	`subprocess.run` no timeout, no captured output, inherits stdio → hangs / opaque failures
`evaluation/scoring.py`	951	Claude	bug	`src.resolve(strict=False)` silently produces dangling symlinks

🟡 Should Fix (medium)

File	Line	Reviewer(s)	Category	Summary
`evaluation/scoring.py`	944	Codex+Claude	bug	Stale symlinks in `vbench_videos/` across re-scoring with fewer repeats
`evaluation/scoring.py`	922	Claude	api-contract	VBench-specific ctor args not plumbed through `AccuracyConfig`
`evaluation/scoring.py`	893	Claude	api-contract	`extractor` required-but-unused for VBench; example YAML lies with `identity_extractor`
`evaluation/scoring.py`	977	Claude	testing	No tests for subprocess failure, malformed results, missing dim, prompts with `/`/`..`

🔵 Consider (low)

File	Line	Reviewer(s)	Category	Summary
`evaluation/scoring.py`	859	Codex+Claude	design	`_DEFAULT_VBENCH_PROJECT_PATH` breaks under wheel install
`accuracy/vbench_runner.py`	60	Claude	bug	Silent CPU fallback for a GPU-required evaluator
`accuracy/vbench_runner.py`	75	Claude	error-handling	No try/except wrapping `vb.evaluate`

Inline thread: #315 (review)

Addresses 13 review findings from the Codex+Claude review council and gemini-code-assist on PR mlcommons#315. Highlights: - Sanitize prompt-as-filename (reject "..", replace "/") so a hostile prompt cannot escape the staged dir. - src.resolve(strict=True) and staged_dir wipe so missing videos and stale "{prompt}-{N}.mp4" from prior runs surface immediately. - Subprocess hardening: DEVNULL stdin, captured stdout tee'd to vbench_subprocess.log, configurable timeout, stderr tail in error. - Named ValueError on missing VBench dim instead of bare KeyError. - score() returns (None, n_repeats) on empty/all-failed; n_repeats computed from pre-filter total so single failures do not zero it out. - VBench-specific kwargs plumbed via AccuracyConfig.extras; extractor optional when Scorer.REQUIRES_EXTRACTOR=False (set on VBenchScorer). - Wheel-safe project-path resolution via $VBENCH_PROJECT_PATH env var. - vbench_runner.py fails fast on missing CUDA (with --allow-cpu opt-out) and wraps vb.evaluate with structured JSON error on stderr. - Example YAML adds the required performance phase; drops the meaningless identity_extractor placeholder. - 9 new unit tests cover the failure modes raised in review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Scores generated videos on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and averages the per-dimension aggregates as the accuracy score. The MLPerf WAN 2.2 prompt set is a subset of VBench standard prompt suite, so we use VBench default evaluate() flow with bundled prompt-to-dimension lookup. VBench pins transformers==4.33.2 and numpy<2, both incompatible with the parent endpoints package (transformers==5.5.0, numpy==2.4.4). To keep the parent lockfile solvable and the accuracy environment reproducible, vbench lives in an isolated uv subproject under examples/09_Wan22_VideoGen_Example/accuracy/, with its own pyproject.toml and uv.lock. VBenchScorer.score() shells out to a vbench_runner.py script in that subproject via uv run --project, so the benchmark process never imports vbench. VideoGenAdapter now mirrors video_path into response_output so the event log carries it to the scorer (event publishing only forwards response_output, not metadata). Added offline_wan22_accuracy.yaml as a peer to the existing perf example, wired to eval_method: vbench. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- VBenchScorer.score(): drop empty-output rows (failed queries whose record.data is None produce output=\"\") before _stage_videos, so Path(\"\").resolve() never staged the repo cwd as a video and corrupted the run. - _stage_videos: dst.unlink(missing_ok=True) before symlink, so re-scoring an existing report_dir is safe. - vbench_runner.py: when --full-info-json is omitted, default to the VBench_full_info.json bundled in the vbench package. The default `vbench_standard` mode (required for scene + appearance_style) needs a real file path; None crashes inside VBench. - offline_wan22_accuracy.yaml: prerequisites comment now points to `uv sync` in the accuracy subproject instead of `pip install vbench`, which would have broken the parent lockfile (vbench pins transformers==4.33.2 and numpy<2). - AGENTS.md VideoGen entry: replace stale \"switch to video_bytes for accuracy mode\" claim with a description of VBenchScorer and the out-of-process subproject pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

3-stage progressive validation for the accuracy subproject: 1. vbench_runner.py arg plumbing + VBench bundled JSON resolution 2. VBenchScorer.score() against a hand-picked video subset 3. End-to-end inference-endpoint benchmark from-config Captures the unit-test gap (mocks vbench entirely) so the next person can validate VBench API drift, prompt-suite coverage, and filename-convention mismatches before marking the PR ready. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Addresses 13 review findings from the Codex+Claude review council and gemini-code-assist on PR mlcommons#315. Highlights: - Sanitize prompt-as-filename (reject "..", replace "/") so a hostile prompt cannot escape the staged dir. - src.resolve(strict=True) and staged_dir wipe so missing videos and stale "{prompt}-{N}.mp4" from prior runs surface immediately. - Subprocess hardening: DEVNULL stdin, captured stdout tee'd to vbench_subprocess.log, configurable timeout, stderr tail in error. - Named ValueError on missing VBench dim instead of bare KeyError. - score() returns (None, n_repeats) on empty/all-failed; n_repeats computed from pre-filter total so single failures do not zero it out. - VBench-specific kwargs plumbed via AccuracyConfig.extras; extractor optional when Scorer.REQUIRES_EXTRACTOR=False (set on VBenchScorer). - Wheel-safe project-path resolution via $VBENCH_PROJECT_PATH env var. - vbench_runner.py fails fast on missing CUDA (with --allow-cpu opt-out) and wraps vb.evaluate with structured JSON error on stderr. - Example YAML adds the required performance phase; drops the meaningless identity_extractor placeholder. - 9 new unit tests cover the failure modes raised in review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Stage C real-world validation (18-node DP=72, 248 prompts × 5 generations) surfaced two adapter gaps that this commit closes: 1. trtllm-serve builds we have in deployment return raw mp4 bytes for response_format=video_path. The previous adapter blindly called json.loads on the response and either raised UnicodeDecodeError (for bytes starting with NUL) or routed a 28 MB payload through the worker→main ZMQ frame and tripped the 16 MB recv buffer. The new decode path sniffs the ISO BMFF `ftyp` box at offset 4 and, when matched, persists the bytes to $INFERENCE_ENDPOINT_VIDEOGEN_FALLBACK_DIR inside the worker, returning a QueryResult that carries only the small path string. ftyp-only sniff means HTTP error bodies and malformed JSON still raise as before. Missing env var raises a RuntimeError naming the variable. 2. The MLPerf WAN 2.2 request schema includes num_frames=81 and boundary_ratio=0.875. Without them, trtllm-serve falls back to defaults that may differ from the published MLPerf submission configuration. Both are now declared on VideoPathRequest with the MLPerf-canonical defaults; dataset rows can still override. Adds two unit tests covering the binary-fallback path (env set → file written, env unset → RuntimeError). All 97 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

wu6u3tw · 2026-05-21T20:48:08Z

Modified code per review. Please check it @viraatc @arekay-nv.

arekay-nv

LGTM. Thanks!

wu6u3tw marked this pull request as ready for review May 18, 2026 20:57

wu6u3tw requested a review from a team May 18, 2026 20:57

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

Comment thread src/inference_endpoint/evaluation/scoring.py Outdated

Comment thread src/inference_endpoint/evaluation/scoring.py Outdated

wu6u3tw requested review from arekay-nv and nv-alicheng May 19, 2026 18:01

viraatc reviewed May 20, 2026

View reviewed changes

Comment thread AGENTS.md

viraatc approved these changes May 20, 2026

View reviewed changes

viraatc reviewed May 20, 2026

View reviewed changes

wu6u3tw and others added 6 commits May 21, 2026 13:37

Apply suggestion from @gemini-code-assist[bot]

b7b274b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

wu6u3tw force-pushed the feat/vbench-eval branch from 705f087 to ce9d048 Compare May 21, 2026 20:44

arekay-nv approved these changes May 22, 2026

View reviewed changes

wu6u3tw merged commit 3131730 into mlcommons:main May 22, 2026
7 checks passed

github-actions Bot locked and limited conversation to collaborators May 22, 2026

Conversation

wu6u3tw commented May 18, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc left a comment

Choose a reason for hiding this comment

Uh oh!

viraatc left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc commented May 20, 2026

Review Council — Summary

🔴 Must Fix (critical/high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

wu6u3tw commented May 21, 2026

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 18, 2026 •

edited

Loading