-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Please make sure you read the contribution guide and file the issues in the right place.
Contribution guide.
🔴 Required Information
Please ensure all items in this section are completed to allow for efficient
triaging. Requests without complete information may be rejected / deprioritized.
If an item is not applicable to you - please mark it as N/A
Is your feature request related to a specific problem?
Yes. In v1.24.1, evaluation behavior differs between adk eval and AgentEvaluator.evaluate,
which makes it harder to run and compare evals consistently across CLI and API/test workflows.
Concrete gaps I hit:
adk evalhas no--num_runs, so I need external loops to repeat runs.AgentEvaluator.evaluatedoes not provide built-in result persistence, so historical comparison is less convenient than CLI-based runs.- In
adk eval,--config_file_pathis optional, but when omitted it falls back to in-code default criteria and does not auto-discover per-testtest_config.jsonnext to each eval/test file. adk evalresolves onlyagent_module.agent.root_agentand does not look forget_agent_async.
Describe the Solution You'd Like
I would like feature parity/consistency between both entry points:
- Add
--num_runstoadk eval(default1) and aggregate results per eval case across runs. - Add optional result persistence support to
AgentEvaluator.evaluatevia anEvalSetResultsManager(local by default). - In
adk eval, when--config_file_pathis omitted, auto-discovertest_config.jsonadjacent to each input eval/test file, then fall back to default criteria if no file is found. - Align agent resolution behavior between
adk evalandAgentEvaluator: support bothroot_agentandget_agent_asyncconsistently
Impact on your work
This impacts reproducibility and operational efficiency in day-to-day evals.
- Repeated runs are needed to reduce nondeterminism, but currently require custom scripting for CLI.
- Comparing historical outcomes is easier on one path than the other.
- The same dataset/config setup behaves differently depending on whether I use CLI or programmatic evaluation.
- Agent modules that are valid in one evaluation path can fail in the other, forcing workflow-specific agent definitions.
Willingness to contribute
Yes
🟡 Recommended Information
Describe Alternatives You've Considered
- Keeping current differences and documenting them: this still leaves manual repetition/persistence work and split workflows.
- Wrapper scripts outside ADK: works short-term, but duplicates logic and reduces maintainability.
Proposed API / Implementation
Pseudo-proposal:
# CLI
adk eval <agent_dir> <eval_set...> --num_runs=3 [--config_file_path=...]
# Internals (conceptual)
for i in range(num_runs):
inference_results += run_inference(...)
aggregated = aggregate_eval_case_results(inference_results)
save_if_configured(aggregated)# AgentEvaluator
await AgentEvaluator.evaluate(
agent_module=...,
eval_dataset_file_path_or_dir=...,
num_runs=3,
eval_set_results_manager=LocalEvalSetResultsManager(...), # optional
)# Config resolution for adk eval (conceptual)
if config_file_path:
eval_config = load(config_file_path)
else:
eval_config = discover_test_config_near_input_or_default(...)# Conceptual shared loader used by both entry points.
agent_module = _get_agent_module(...)
if hasattr(agent_module.agent, "root_agent"):
agent = agent_module.agent.root_agent
elif hasattr(agent_module.agent, "get_agent_async"):
agent = await agent_module.agent.get_agent_async()
else:
raise ValueError("Expected `root_agent` or `get_agent_async` in agent module.")Additional Context
Goal: make eval behavior consistent regardless of entry point (adk eval vs AgentEvaluator.evaluate) so that the same eval assets produce comparable outcomes with less custom glue code.