feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces the VBenchScorer to evaluate video generation accuracy across six MLPerf dimensions. To manage conflicting dependencies, VBench is executed within an isolated uv subproject using a standalone runner script. The changes include a comprehensive runbook, configuration templates, and updates to the VideoGenAdapter to pass video paths to the scorer. Review feedback highlights the need for more robust error handling when parsing VBench results and ensuring that symlink creation is resilient to prompts containing directory separators.
viraatc
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Codex + Claude | Depth: thorough
13 issues posted inline after re-verifying each finding against current HEAD and deduping against the 3 existing comments. Findings tagged [Codex+Claude] were flagged independently by both reviewers (boosted confidence).
Highest-priority: a path-traversal vector in _stage_videos (prompt-as-filename allows .. escape of staged_dir, then unlink + symlink_to mutate paths outside report_dir). Also: failed video generations silently dropped from the VBench denominator (overstates accuracy), and the example YAML cannot actually run because the benchmark requires a performance dataset.
Review Council — SummaryReviewed by: Codex + Claude | Depth: thorough Found 13 issues across 3 files. All findings re-verified against HEAD ( 🔴 Must Fix (critical/high)
🟡 Should Fix (medium)
🔵 Consider (low)
Inline thread: #315 (review) |
Addresses 13 review findings from the Codex+Claude review council and gemini-code-assist on PR mlcommons#315. Highlights: - Sanitize prompt-as-filename (reject "..", replace "/") so a hostile prompt cannot escape the staged dir. - src.resolve(strict=True) and staged_dir wipe so missing videos and stale "{prompt}-{N}.mp4" from prior runs surface immediately. - Subprocess hardening: DEVNULL stdin, captured stdout tee'd to vbench_subprocess.log, configurable timeout, stderr tail in error. - Named ValueError on missing VBench dim instead of bare KeyError. - score() returns (None, n_repeats) on empty/all-failed; n_repeats computed from pre-filter total so single failures do not zero it out. - VBench-specific kwargs plumbed via AccuracyConfig.extras; extractor optional when Scorer.REQUIRES_EXTRACTOR=False (set on VBenchScorer). - Wheel-safe project-path resolution via $VBENCH_PROJECT_PATH env var. - vbench_runner.py fails fast on missing CUDA (with --allow-cpu opt-out) and wraps vb.evaluate with structured JSON error on stderr. - Example YAML adds the required performance phase; drops the meaningless identity_extractor placeholder. - 9 new unit tests cover the failure modes raised in review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Scores generated videos on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and averages the per-dimension aggregates as the accuracy score. The MLPerf WAN 2.2 prompt set is a subset of VBench standard prompt suite, so we use VBench default evaluate() flow with bundled prompt-to-dimension lookup. VBench pins transformers==4.33.2 and numpy<2, both incompatible with the parent endpoints package (transformers==5.5.0, numpy==2.4.4). To keep the parent lockfile solvable and the accuracy environment reproducible, vbench lives in an isolated uv subproject under examples/09_Wan22_VideoGen_Example/accuracy/, with its own pyproject.toml and uv.lock. VBenchScorer.score() shells out to a vbench_runner.py script in that subproject via uv run --project, so the benchmark process never imports vbench. VideoGenAdapter now mirrors video_path into response_output so the event log carries it to the scorer (event publishing only forwards response_output, not metadata). Added offline_wan22_accuracy.yaml as a peer to the existing perf example, wired to eval_method: vbench. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- VBenchScorer.score(): drop empty-output rows (failed queries whose record.data is None produce output=\"\") before _stage_videos, so Path(\"\").resolve() never staged the repo cwd as a video and corrupted the run. - _stage_videos: dst.unlink(missing_ok=True) before symlink, so re-scoring an existing report_dir is safe. - vbench_runner.py: when --full-info-json is omitted, default to the VBench_full_info.json bundled in the vbench package. The default `vbench_standard` mode (required for scene + appearance_style) needs a real file path; None crashes inside VBench. - offline_wan22_accuracy.yaml: prerequisites comment now points to `uv sync` in the accuracy subproject instead of `pip install vbench`, which would have broken the parent lockfile (vbench pins transformers==4.33.2 and numpy<2). - AGENTS.md VideoGen entry: replace stale \"switch to video_bytes for accuracy mode\" claim with a description of VBenchScorer and the out-of-process subproject pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3-stage progressive validation for the accuracy subproject: 1. vbench_runner.py arg plumbing + VBench bundled JSON resolution 2. VBenchScorer.score() against a hand-picked video subset 3. End-to-end inference-endpoint benchmark from-config Captures the unit-test gap (mocks vbench entirely) so the next person can validate VBench API drift, prompt-suite coverage, and filename-convention mismatches before marking the PR ready. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Addresses 13 review findings from the Codex+Claude review council and gemini-code-assist on PR mlcommons#315. Highlights: - Sanitize prompt-as-filename (reject "..", replace "/") so a hostile prompt cannot escape the staged dir. - src.resolve(strict=True) and staged_dir wipe so missing videos and stale "{prompt}-{N}.mp4" from prior runs surface immediately. - Subprocess hardening: DEVNULL stdin, captured stdout tee'd to vbench_subprocess.log, configurable timeout, stderr tail in error. - Named ValueError on missing VBench dim instead of bare KeyError. - score() returns (None, n_repeats) on empty/all-failed; n_repeats computed from pre-filter total so single failures do not zero it out. - VBench-specific kwargs plumbed via AccuracyConfig.extras; extractor optional when Scorer.REQUIRES_EXTRACTOR=False (set on VBenchScorer). - Wheel-safe project-path resolution via $VBENCH_PROJECT_PATH env var. - vbench_runner.py fails fast on missing CUDA (with --allow-cpu opt-out) and wraps vb.evaluate with structured JSON error on stderr. - Example YAML adds the required performance phase; drops the meaningless identity_extractor placeholder. - 9 new unit tests cover the failure modes raised in review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage C real-world validation (18-node DP=72, 248 prompts × 5 generations) surfaced two adapter gaps that this commit closes: 1. trtllm-serve builds we have in deployment return raw mp4 bytes for response_format=video_path. The previous adapter blindly called json.loads on the response and either raised UnicodeDecodeError (for bytes starting with NUL) or routed a 28 MB payload through the worker→main ZMQ frame and tripped the 16 MB recv buffer. The new decode path sniffs the ISO BMFF `ftyp` box at offset 4 and, when matched, persists the bytes to $INFERENCE_ENDPOINT_VIDEOGEN_FALLBACK_DIR inside the worker, returning a QueryResult that carries only the small path string. ftyp-only sniff means HTTP error bodies and malformed JSON still raise as before. Missing env var raises a RuntimeError naming the variable. 2. The MLPerf WAN 2.2 request schema includes num_frames=81 and boundary_ratio=0.875. Without them, trtllm-serve falls back to defaults that may differ from the published MLPerf submission configuration. Both are now declared on VideoPathRequest with the MLPerf-canonical defaults; dataset rows can still override. Adds two unit tests covering the binary-fallback path (env set → file written, env unset → RuntimeError). All 97 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Modified code per review. Please check it @viraatc @arekay-nv. |
Summary
VBenchScorer(scorer_id="vbench") that scores video-generation outputs on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and returns the mean of the per-dimension aggregates.uvsubproject underexamples/09_Wan22_VideoGen_Example/accuracy/(vbench pinstransformers==4.33.2+numpy<2, incompatible with the parent env).VBenchScorer.score()shells out tovbench_runner.pyviauv run --project, so the benchmark process never imports vbench.VideoGenAdapter.decode_responsenow mirrorsvideo_pathintoresponse_outputso the event log carries it to the scorer.ScorerMethod.VBENCHinto the config enum and addsoffline_wan22_accuracy.yamlas a peer to the existing perf example.Test plan
tests/unit/evaluation/test_scoring.pycovers registration, mean-of-6-dims with mockedsubprocess.run, missing-subprojectFileNotFoundError.tests/unit/videogen/test_adapter.py+ integration adapter test updated for the newresponse_outputcontract.examples/09_Wan22_VideoGen_Example/offline_wan22_accuracy.yamlvalidates asOfflineBenchmarkConfig.uv lockresolves (115 packages).pre-commit run --all-filesclean.uv syncin the accuracy subproject — not yet executed.Draft for code review.
🤖 Generated with Claude Code