Skip to content

feat(text-metrics): split qa_accuracy#645

Open
davidberenstein1957 wants to merge 7 commits into
feat/vlm-pr-2-infrastructurefrom
feat/vlm-pr-3a-qa-accuracy
Open

feat(text-metrics): split qa_accuracy#645
davidberenstein1957 wants to merge 7 commits into
feat/vlm-pr-2-infrastructurefrom
feat/vlm-pr-3a-qa-accuracy

Conversation

@davidberenstein1957

@davidberenstein1957 davidberenstein1957 commented Apr 28, 2026

Copy link
Copy Markdown
Member

Summary

Splits qa_accuracy into its own stacked PR, adds QAAccuracyMetric, and wires GenEval to qa_accuracy + clip_score.

Stack Position

Files

  • src/pruna/evaluation/metrics/metric_qa_accuracy.py
  • src/pruna/evaluation/benchmarks.py

Test Plan

uv run pytest tests/evaluation/test_text_metrics.py -k qa_accuracy

Review Focus

  • Aggregation behavior (all_or_nothing)
  • GenEval benchmark wiring

Review Flow (Order)

Review the stack in this exact order:

  1. feat(vendor): add LLM2Vec embedding model #637 vendor
  2. feat(infrastructure): add VLM base classes and utilities #638 infrastructure
  3. feat(text-metrics): split qa_accuracy #645 qa_accuracy
  4. feat(text-metrics): split oneig_alignment #646 oneig_alignment
  5. feat(text-metrics): split text_score pair #647 text_score pair
  6. feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
  7. feat(vision-metrics): split vqa #649 vqa
  8. feat(vision-metrics): split vie_score #650 vie_score
  9. feat(vision-metrics): split img_edit_score #651 img_edit_score
  10. feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (3/10)

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 15db155. Configure here.

"between prompts and generated images via VQA-style questions."
),
metrics=["clip_score"], # §3.2: Mask2Former; not in Pruna
metrics=["qa_accuracy", "clip_score"], # strict QA + CLIP score

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenEval benchmark uses lenient mean aggregation

Medium Severity

The GenEval benchmark's qa_accuracy metric is intended for "strict QA" (all-or-nothing aggregation). However, Task.from_benchmark doesn't pass the all_or_nothing kwarg, causing QAAccuracyMetric to default to mean aggregation. This leads to inflated scores that are not comparable to the paper's reference.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 15db155. Configure here.

raise ValueError(
f"qa_accuracy aggregation must be one of {{'mean', 'all_or_nothing'}}. Got: {self.aggregation!r}."
)
self.metric_units = type(self).metric_units

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant metric_units self-assignment in __init__

Low Severity

Setting self.metric_units = type(self).metric_units is a no-op because metric_units is already declared as a class attribute and nothing earlier in __init__ (neither super().__init__ nor the surrounding code) overrides it. The line just shadows the class attribute with the same value and adds maintenance noise.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 15db155. Configure here.

@github-actions

Copy link
Copy Markdown

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions Bot added the stale label May 19, 2026
davidberenstein1957 and others added 3 commits June 2, 2026 19:26
Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics.

Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
Remove redundant metric_units assignment; aggregation is keyword-only.

Co-authored-by: Cursor <cursoragent@cursor.com>
@davidberenstein1957 davidberenstein1957 force-pushed the feat/vlm-pr-3a-qa-accuracy branch from 161223e to e7fca2e Compare June 2, 2026 17:30
@davidberenstein1957

Copy link
Copy Markdown
Member Author

Follow-up: aggregation is keyword-only; removed redundant metric_units line; exported QAAccuracyMetric in __init__.py. GenEval all_or_nothing wiring is in #641 task.py.

davidberenstein1957 and others added 4 commits June 4, 2026 07:45
Drop the broken Intel uv index (aligned with main), fix QAAccuracy
keyword-only aggregation syntax, pass single/y_gt call types correctly
for OneIG alignment, and expose metric_units on results.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace forward-import VLM test module on pre-e2e branches with
infrastructure-only tests; propagate docstring and conftest fixes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Remove verify helper and duplicate infra test template from scripts/;
tests live under tests/evaluation/ only.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant