Skip to content

Fixing inconsistent error traces across failure modes.#61

Merged
cemde merged 2 commits into
mainfrom
fix-exception-reporting
May 12, 2026
Merged

Fixing inconsistent error traces across failure modes.#61
cemde merged 2 commits into
mainfrom
fix-exception-reporting

Conversation

@cemde
Copy link
Copy Markdown
Collaborator

@cemde cemde commented May 12, 2026

Description

Benchmark.run() documents a stable report schema, but it wasn't actually stable. Reports built on the setup-failure and setup-timeout early returns, and on the except Exception fallback in _run_parallel, were missing the usage and task keys (and had empty traces/config), so a row that failed in setup looked structurally different from a row that succeeded, breaking any consumer that indexes report["task"] / report["usage"] on every row. Separately, in parallel runs the fail_on_setup_error / fail_on_task_error / fail_on_evaluation_error flags were effectively no-ops: when _execute_task_repetition re-raised, _run_parallel swallowed the exception into a degraded report and kept going, instead of aborting the run the way a sequential run does.

This PR makes every report carry the same keys regardless of outcome, and makes parallel fail-fast behave like sequential fail-fast.

What changed

  • New Benchmark._build_report(...): the single constructor for repetition reports. Every report (success, setup failure, setup timeout, agent/environment/user/unknown execution error, evaluation failure, or an unexpected worker failure in parallel mode) now always contains task_id, repeat_idx, status, error, traces, config, usage, eval, task. traces/config default to {} when unavailable; error is None only for SUCCESS and is otherwise always populated (a generic placeholder is synthesised as a backstop if a caller ever omits it).
  • _run_parallel now re-raises (and shuts the pool down, cancelling queued work) when a worker future raises and any fail_on_* flag is set — aborting the run like _run_sequential. Only genuinely unexpected worker failures (no fail_on_* set) are turned into a degraded-but-full-schema UNKNOWN_EXECUTION_ERROR report so the rest of the batch continues.
  • Updated the run() docstring (Returns/Raises) to describe the full schema and the fail-fast behaviour.

Tests

New tests/test_core/test_benchmark/test_report_schema.py:

  • Schema invariance over {success, setup_failure, setup_timeout, execution_failure, evaluation_failure} × {sequential, parallel}, plus a mixed-outcome run and an "error always populated when status != SUCCESS" check.
  • Parallel runs abort on each of fail_on_setup_error / fail_on_task_error / fail_on_evaluation_error.
  • An unexpected worker failure yields a full-schema degraded report (graceful mode) and propagates when a fail_on_* flag is set.

11 of these were red before the fix. just all is green (ruff format + check, ty check maseval tests, full default + offline suites — 2631 passed, 1 skipped, 2 xfailed).

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

Documentation

  • Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
  • Updated relevant documentation in docs/ (if applicable)
  • Tag github issue with this PR (if applicable)

Changelog

  • Added entry to CHANGELOG.md under [Unreleased] section
    • Use Added section for new features
    • Use Changed section for modifications to existing functionality
    • Use Fixed section for bug fixes
    • Use Removed section for deprecated/removed features
  • OR this is a documentation-only change (no changelog needed)

Architecture (if applicable)

  • Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
  • Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

Behaviour change worth flagging for reviewers: under num_workers > 1, a fail_on_* flag now aborts the run (re-raising the original exception) instead of being silently absorbed into a report. This matches the documented intent and the sequential path; anyone who was (perhaps unknowingly) relying on the old swallow-and-continue behaviour in parallel mode should set the flag to False and inspect report status instead.

Adding keys to the report dict is additive and shouldn't break consumers that look up keys by name; consumers that asserted on the exact key set will see the two new keys (usage, task) on previously-degraded rows.

@github-actions
Copy link
Copy Markdown

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  maseval/core
  benchmark.py 559
Project Total  

This report was generated by python-coverage-comment-action

@cemde cemde merged commit 1c52f23 into main May 12, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant