feat(text-metrics): split qa_accuracy by davidberenstein1957 · Pull Request #645 · PrunaAI/pruna

davidberenstein1957 · 2026-04-28T13:03:56Z

Summary

Splits qa_accuracy into its own stacked PR, adds QAAccuracyMetric, and wires GenEval to qa_accuracy + clip_score.

Stack Position

Base: PR feat(infrastructure): add VLM base classes and utilities #638 (feat/vlm-pr-2-infrastructure)
Next: PR feat(text-metrics): split oneig_alignment #646 (feat/vlm-pr-3b-oneig-alignment)
Final integration: PR feat(e2e-tests): stacked e2e after split metrics #641 (feat/vlm-pr-5-e2e-tests)
Canonical umbrella reference: PR feat(evaluation): add VLMMetrics #545 (feat/metrics-vlm-support)

Files

src/pruna/evaluation/metrics/metric_qa_accuracy.py
src/pruna/evaluation/benchmarks.py

Test Plan

uv run pytest tests/evaluation/test_text_metrics.py -k qa_accuracy

Review Focus

Aggregation behavior (all_or_nothing)
GenEval benchmark wiring

Review Flow (Order)

Review the stack in this exact order:

feat(vendor): add LLM2Vec embedding model #637 vendor
feat(infrastructure): add VLM base classes and utilities #638 infrastructure
feat(text-metrics): split qa_accuracy #645 qa_accuracy
feat(text-metrics): split oneig_alignment #646 oneig_alignment
feat(text-metrics): split text_score pair #647 text_score pair
feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
feat(vision-metrics): split vqa #649 vqa
feat(vision-metrics): split vie_score #650 vie_score
feat(vision-metrics): split img_edit_score #651 img_edit_score
feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (3/10)

Review after PR feat(infrastructure): add VLM base classes and utilities #638.
Next PR to review: feat(text-metrics): split oneig_alignment #646.
Confirm this PR's tests and scope before continuing.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 15db155. Configure here.}

cursor · 2026-04-28T13:09:41Z

            "between prompts and generated images via VQA-style questions."
        ),
-        metrics=["clip_score"],  # §3.2: Mask2Former; not in Pruna
+        metrics=["qa_accuracy", "clip_score"],  # strict QA + CLIP score


GenEval benchmark uses lenient mean aggregation

Medium Severity

The GenEval benchmark's qa_accuracy metric is intended for "strict QA" (all-or-nothing aggregation). However, Task.from_benchmark doesn't pass the all_or_nothing kwarg, causing QAAccuracyMetric to default to mean aggregation. This leads to inflated scores that are not comparable to the paper's reference.

^{Reviewed by Cursor Bugbot for commit 15db155. Configure here.}

cursor · 2026-04-28T13:09:41Z

+            raise ValueError(
+                f"qa_accuracy aggregation must be one of {{'mean', 'all_or_nothing'}}. Got: {self.aggregation!r}."
+            )
+        self.metric_units = type(self).metric_units


Redundant metric_units self-assignment in __init__

Low Severity

Setting self.metric_units = type(self).metric_units is a no-op because metric_units is already declared as a class attribute and nothing earlier in __init__ (neither super().__init__ nor the surrounding code) overrides it. The line just shadows the class attribute with the same value and adds maintenance noise.

^{Reviewed by Cursor Bugbot for commit 15db155. Configure here.}

github-actions · 2026-05-19T00:29:28Z

This PR has been inactive for 10 days and is now marked as stale.

Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics. Made-with: Cursor

Co-authored-by: Cursor <cursoragent@cursor.com>

Remove redundant metric_units assignment; aggregation is keyword-only. Co-authored-by: Cursor <cursoragent@cursor.com>

davidberenstein1957 · 2026-06-02T17:30:59Z

Follow-up: aggregation is keyword-only; removed redundant metric_units line; exported QAAccuracyMetric in __init__.py. GenEval all_or_nothing wiring is in #641 task.py.

Drop the broken Intel uv index (aligned with main), fix QAAccuracy keyword-only aggregation syntax, pass single/y_gt call types correctly for OneIG alignment, and expose metric_units on results. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace forward-import VLM test module on pre-e2e branches with infrastructure-only tests; propagate docstring and conftest fixes. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Remove verify helper and duplicate infra test template from scripts/; tests live under tests/evaluation/ only. Co-authored-by: Cursor <cursoragent@cursor.com>

This was referenced Apr 28, 2026

feat(text-metrics): add text-based VLM judge metrics #639

Closed

feat(vision-metrics): add vision-based VLM judge metrics #640

Closed

cursor Bot reviewed Apr 28, 2026

View reviewed changes

davidberenstein1957 force-pushed the feat/vlm-pr-2-infrastructure branch from 21212de to 7054e53 Compare May 8, 2026 09:01

davidberenstein1957 force-pushed the feat/vlm-pr-3a-qa-accuracy branch 2 times, most recently from 04ab2e5 to 161223e Compare May 8, 2026 09:44

github-actions Bot added the stale label May 19, 2026

davidberenstein1957 and others added 3 commits June 2, 2026 19:26

feat(text-metrics): split qa_accuracy into dedicated PR branch

7f0966c

Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics. Made-with: Cursor

feat(text-metrics): export QAAccuracyMetric in split branch

f2d5c5a

Co-authored-by: Cursor <cursoragent@cursor.com>

fix(metrics): qa_accuracy keyword-only aggregation

e7fca2e

Remove redundant metric_units assignment; aggregation is keyword-only. Co-authored-by: Cursor <cursoragent@cursor.com>

davidberenstein1957 force-pushed the feat/vlm-pr-3a-qa-accuracy branch from 161223e to e7fca2e Compare June 2, 2026 17:30

davidberenstein1957 and others added 4 commits June 4, 2026 07:45

fix(ci): lint/docstrings and stack-appropriate VLM tests

1e36ec2

Replace forward-import VLM test module on pre-e2e branches with infrastructure-only tests; propagate docstring and conftest fixes. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(ci): ruff on infra VLM test template

de1cc64

Co-authored-by: Cursor <cursoragent@cursor.com>

chore: drop local-only scripts from PR scope

2dc8924

Remove verify helper and duplicate infra test template from scripts/; tests live under tests/evaluation/ only. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(text-metrics): split qa_accuracy#645

feat(text-metrics): split qa_accuracy#645
davidberenstein1957 wants to merge 7 commits into
feat/vlm-pr-2-infrastructurefrom
feat/vlm-pr-3a-qa-accuracy

davidberenstein1957 commented Apr 28, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

davidberenstein1957 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidberenstein1957 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack Position

Files

Test Plan

Review Focus

Review Flow (Order)

This PR in the flow (3/10)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

GenEval benchmark uses lenient mean aggregation

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

Redundant metric_units self-assignment in __init__

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

davidberenstein1957 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidberenstein1957 commented Apr 28, 2026 •

edited

Loading

Redundant `metric_units` self-assignment in `init`