Skip to content

feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674

Open
ajay-kesavan wants to merge 6 commits into
mainfrom
feat/eval-classifier-evaluator
Open

feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674
ajay-kesavan wants to merge 6 commits into
mainfrom
feat/eval-classifier-evaluator

Conversation

@ajay-kesavan
Copy link
Copy Markdown

@ajay-kesavan ajay-kesavan commented May 21, 2026

What

Adds run-level aggregators to the eval framework — starting with a classification aggregator that builds a confusion matrix + precision/recall/F1 across a fixed class list. Works for both coded and low-code agents.

Architectural decisions (full record in Confluence: Design for Precision and Recall §5)

1. Aggregator config lives ON the evaluator, not as a separate evaluator.
We rejected a standalone Classifier evaluator (with a source_evaluator ID pointer) because it forced users to add two evaluators and copy an opaque ID, and the pointer broke during low-code→coded conversion. Config now lives on ExactMatch.aggregators — single source of truth, no cross-evaluator FK, travels with the evaluator JSON file.

2. Transport via per-datapoint justification, not the evaluator snapshot.
The snapshot mechanism is coded-only (low-code uses a different entity). The justification is persisted by BOTH pipelines, so it's the portable channel. ExactMatchJustification / the legacy details string carry the aggregators list per datapoint (identical content; deduped downstream).

Changes in this PR (SDK)

  • ExactMatchEvaluatorConfig.aggregators (optional) + ExactMatchJustification.aggregators; evaluate() embeds it.
  • New _aggregators.py: AggregatorSpec / ClassificationAggregatorSpec.
  • LegacyExactMatchEvaluator (low-code): gains an aggregators field; emits a JSON-string details of {expected, actual, aggregators} (legacy J=str, so the string passes through _serialize_justification verbatim → lands in EvalScore.Justification on the C# side).
  • Reverted the _build_evaluator_snapshot aggregators extension (no longer the transport).
  • Deleted the old standalone ClassifierEvaluator + EvaluatorType.CLASSIFIER.
  • Version 2.10.70 → 2.10.72.

Compatibility

Optional config field (default None) → existing evaluators unchanged. When unset, no aggregators in the justification and the downstream pass no-ops.

Test plan

  • mypy clean
  • ExactMatch round-trips config with/without aggregators
  • End-to-end with coded ExactMatch + classification aggregator
  • End-to-end with low-code (legacy) ExactMatch + classification aggregator

Adds a new evaluator type whose role is to carry a `classes` list and a
`source_evaluator` name to downstream consumers. It does not compute
classification metrics per datapoint — that work moves to the Studio Web
C# backend, which reads each datapoint's agent output and the source
evaluator's expected label after the per-datapoint loop finishes, scans
the output for each configured class, and builds the confusion matrix.

The per-datapoint evaluate() returns score=0.0 with a
ClassifierJustification(classes, source_evaluator) details payload. This
payload survives the existing CLI -> backend wire path via
StudioWebProgressReporter._serialize_justification (json.dumps of the
model_dump), arriving in the backend as a JSON string inside
CodedEvaluatorScore.Justification where the C# layer can read it.

Replaces the design in earlier draft PRs #1669 and #5307: the SDK no
longer owns the dataset-level computation. The pure-config approach is
~50 LOC instead of ~1500 LOC of dataset-evaluator framework + worker
workflow + factory + child workflow plumbing.

Files:
  src/uipath/eval/evaluators/classifier_evaluator.py  new (~90 LOC)
  src/uipath/eval/evaluators/__init__.py              re-export + EVALUATORS list
  src/uipath/eval/evaluators/evaluator.py             discriminator + Union entry
  src/uipath/eval/models/models.py                    EvaluatorType.CLASSIFIER
  tests/evaluators/test_classifier_evaluator.py       9 unit tests, all passing

Verified:
  pytest tests/evaluators tests/cli/eval --no-cov  -> 824 passed
  ruff check / ruff format / mypy                  -> clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 21, 2026
A minimal 3-class intent classification agent (book / cancel / reschedule)
that exercises the new ClassifierEvaluator end-to-end via `uipath eval`.
Mirrors the wire shape Studio Web will see once the C# backend and frontend
PRs land, so SDK changes can be validated standalone before the full stack
is brought up.

Layout:
  main.py             — keyword classifier returning {"intent": "..."}
  evaluations/
    eval-sets/main.json
    evaluators/
      intent_match.json       per-datapoint ExactMatch on .intent
      intent_classifier.json  new uipath-classifier with classes + sourceEvaluator
  README.md           — Path A (SDK CLI) + Path B (Studio Web) instructions

Each datapoint has `evaluationCriterias.intent_classifier: {}` (the runtime
skips evaluators that aren't keyed there). 6/9 datapoints are correctly
classified by design; the resulting (expected, actual) pairs flow through
the existing CLI -> backend wire path inside the classifier's justification
payload as classes/source_evaluator metadata.

Verified live:
  - ExactMatch averages to 0.7 (6/9 correct).
  - ClassifierEvaluator emits {"expected":"","actual":"","classes":[...],
    "source_evaluator":"intent_match"} per datapoint.
  - Plugging the (expected, actual) pairs from the resulting output into the
    same confusion-matrix math the C# helper implements yields macro F1 of
    0.667 on this fixture — the number Studio Web's Aggregations panel
    would render once the backend pipeline is live.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajay-kesavan ajay-kesavan marked this pull request as ready for review May 21, 2026 17:34
Pydantic's generic resolution leaves T = typing.Any when a TypeVar is
parameterized with its own bound (BaseEvaluationCriteria here), so
BaseEvaluator[BaseEvaluationCriteria, ...] tripped the runtime's
"X must be a subclass of BaseEvaluationCriteria" guard at load time:

  Failed to create evaluator from file 'evaluations/evaluators/classifier-*.json':
  typing.Any must be a subclass of BaseEvaluationCriteria.

Introduce an empty ClassifierEvaluationCriteria(BaseEvaluationCriteria)
subclass and parameterize Config + Evaluator with it. Mirrors how every
other built-in evaluator (ExactMatch via OutputEvaluationCriteria, etc.)
provides a concrete criteria type even when no per-datapoint fields are
needed.
Replaces the standalone ClassifierEvaluator with an `aggregators` config
field on per-datapoint evaluators (ExactMatch first). Run-level classification
metrics are now driven by the host evaluator's config, not by a separate
evaluator with a source-evaluator ID reference.

Design rationale (see Confluence "Design for Precision and Recall" §5.2):
the standalone evaluator forced users to add TWO evaluators and copy an
opaque ID between them. Moving aggregator config onto the evaluator that
already emits the labels keeps the source of truth in one place and makes
the JSON file portable across conversions (e.g. low-code -> coded).

- New module `_aggregators.py` with AggregatorSpec / ClassificationAggregatorSpec
- ExactMatchEvaluatorConfig gains optional `aggregators: list[AggregatorSpec] | None`
  The Python runtime ignores the field; it's metadata for the downstream
  C# aggregation pass.
- `_progress_reporter.py:_build_evaluator_snapshot` now also emits `aggregators`
  so the field flows into EvaluatorRun.EvaluatorSnapshot and the C# layer can
  discover it without consulting the eval set definition file separately.
  Bug fix: previously the builder only emitted prompt+model (LLM-judge only),
  so for ExactMatch the dict was empty and the snapshot ended up null in
  the wire payload.
- ClassifierEvaluator, ClassifierEvaluationCriteria, ClassifierJustification,
  ClassifierEvaluatorConfig: all deleted.
- EvaluatorType.CLASSIFIER enum value removed.
- Discriminator union in evaluator.py drops the Classifier branch.

Version bump 2.10.70 -> 2.10.72 (the previous .71 was an unused dev cache-bust).
The new ExactMatch.aggregators field is a public API change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 90%)

See analysis details on SonarQube Cloud

ajay-kesavan and others added 2 commits May 27, 2026 09:28
Switches aggregator transport from the evaluator snapshot to the per-datapoint
justification (the snapshot path was coded-only; the justification path works
for both coded and low-code).

- ExactMatchJustification gains an optional `aggregators` field; evaluate()
  embeds config.aggregators into the justification it already emits.
- Reverts the _build_evaluator_snapshot extension (no longer the transport).

Design: aggregator config lives on the evaluator (single source of truth, no
cross-evaluator FK), travels per-datapoint in the justification, and is computed
once by the C# post-pass. See Confluence "Design for Precision and Recall" §5.

uv.lock: sync uipath 2.10.70 -> 2.10.72 (version bumped for the public
ExactMatch.aggregators field + to invalidate uv's build cache during dev).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds run-level aggregator support to the legacy (low-code) ExactMatch so it
reaches parity with the coded ExactMatch.

- LegacyExactMatchEvaluatorConfig / the evaluator gain an optional `aggregators`
  field (top-level, aliased "aggregators"), deserialized from the legacy
  evaluator JSON authored by the low-code editor.
- evaluate() emits a JSON-string `details` of {expected, actual, aggregators}
  when aggregators are configured. Legacy justification is typed `str` (J=str),
  so _serialize_justification passes the string through verbatim — it lands in
  EvalScore.Justification on the C# side, where the low-code aggregation pass
  reads it.

No behavioral change when aggregators is unset (details stays None).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-integrations test:uipath-langchain Triggers tests in the uipath-langchain-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant