Skip to content

Consistency improvements between adk eval and AgentEvaluator #4410

@ftnext

Description

@ftnext

Please make sure you read the contribution guide and file the issues in the right place.
Contribution guide.

🔴 Required Information

Please ensure all items in this section are completed to allow for efficient
triaging. Requests without complete information may be rejected / deprioritized.
If an item is not applicable to you - please mark it as N/A

Is your feature request related to a specific problem?

Yes. In v1.24.1, evaluation behavior differs between adk eval and AgentEvaluator.evaluate,
which makes it harder to run and compare evals consistently across CLI and API/test workflows.

Concrete gaps I hit:

  • adk eval has no --num_runs, so I need external loops to repeat runs.
  • AgentEvaluator.evaluate does not provide built-in result persistence, so historical comparison is less convenient than CLI-based runs.
  • In adk eval, --config_file_path is optional, but when omitted it falls back to in-code default criteria and does not auto-discover per-test test_config.json next to each eval/test file.
  • adk eval resolves only agent_module.agent.root_agent and does not look for get_agent_async.

Describe the Solution You'd Like

I would like feature parity/consistency between both entry points:

  1. Add --num_runs to adk eval (default 1) and aggregate results per eval case across runs.
  2. Add optional result persistence support to AgentEvaluator.evaluate via an EvalSetResultsManager (local by default).
  3. In adk eval, when --config_file_path is omitted, auto-discover test_config.json adjacent to each input eval/test file, then fall back to default criteria if no file is found.
  4. Align agent resolution behavior between adk eval and AgentEvaluator: support both root_agent and get_agent_async consistently

Impact on your work

This impacts reproducibility and operational efficiency in day-to-day evals.

  • Repeated runs are needed to reduce nondeterminism, but currently require custom scripting for CLI.
  • Comparing historical outcomes is easier on one path than the other.
  • The same dataset/config setup behaves differently depending on whether I use CLI or programmatic evaluation.
  • Agent modules that are valid in one evaluation path can fail in the other, forcing workflow-specific agent definitions.

Willingness to contribute

Yes


🟡 Recommended Information

Describe Alternatives You've Considered

  • Keeping current differences and documenting them: this still leaves manual repetition/persistence work and split workflows.
  • Wrapper scripts outside ADK: works short-term, but duplicates logic and reduces maintainability.

Proposed API / Implementation

Pseudo-proposal:

# CLI
adk eval <agent_dir> <eval_set...> --num_runs=3 [--config_file_path=...]

# Internals (conceptual)
for i in range(num_runs):
  inference_results += run_inference(...)

aggregated = aggregate_eval_case_results(inference_results)
save_if_configured(aggregated)
# AgentEvaluator
await AgentEvaluator.evaluate(
    agent_module=...,
    eval_dataset_file_path_or_dir=...,
    num_runs=3,
    eval_set_results_manager=LocalEvalSetResultsManager(...),  # optional
)
# Config resolution for adk eval (conceptual)
if config_file_path:
  eval_config = load(config_file_path)
else:
  eval_config = discover_test_config_near_input_or_default(...)
# Conceptual shared loader used by both entry points.
agent_module = _get_agent_module(...)
if hasattr(agent_module.agent, "root_agent"):
  agent = agent_module.agent.root_agent
elif hasattr(agent_module.agent, "get_agent_async"):
  agent = await agent_module.agent.get_agent_async()
else:
  raise ValueError("Expected `root_agent` or `get_agent_async` in agent module.")

Additional Context

Goal: make eval behavior consistent regardless of entry point (adk eval vs AgentEvaluator.evaluate) so that the same eval assets produce comparable outcomes with less custom glue code.

Metadata

Metadata

Assignees

Labels

eval[Component] This issue is related to evaluation

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions