Consistency improvements between adk eval and AgentEvaluator

**Please make sure you read the contribution guide and file the issues in the right place.**
[Contribution guide.](https://google.github.io/adk-docs/contributing-guide/)

## 🔴 Required Information
*Please ensure all items in this section are completed to allow for efficient
triaging. Requests without complete information may be rejected / deprioritized.
If an item is not applicable to you - please mark it as N/A*

### Is your feature request related to a specific problem?
Yes. In `v1.24.1`, evaluation behavior differs between `adk eval` and `AgentEvaluator.evaluate`,
which makes it harder to run and compare evals consistently across CLI and API/test workflows.

Concrete gaps I hit:
- `adk eval` has no `--num_runs`, so I need external loops to repeat runs.
- `AgentEvaluator.evaluate` does not provide built-in result persistence, so historical comparison is less convenient than CLI-based runs.
- In `adk eval`, `--config_file_path` is optional, but when omitted it falls back to in-code default criteria and does not auto-discover per-test `test_config.json` next to each eval/test file.
- `adk eval` resolves only `agent_module.agent.root_agent` and does not look for `get_agent_async`.

### Describe the Solution You'd Like
I would like feature parity/consistency between both entry points:

1. Add `--num_runs` to `adk eval` (default `1`) and aggregate results per eval case across runs.
2. Add optional result persistence support to `AgentEvaluator.evaluate` via an `EvalSetResultsManager` (local by default).
3. In `adk eval`, when `--config_file_path` is omitted, auto-discover `test_config.json` adjacent to each input eval/test file, then fall back to default criteria if no file is found.
4. Align agent resolution behavior between `adk eval` and `AgentEvaluator`: support both `root_agent` and `get_agent_async` consistently

### Impact on your work
This impacts reproducibility and operational efficiency in day-to-day evals.

- Repeated runs are needed to reduce nondeterminism, but currently require custom scripting for CLI.
- Comparing historical outcomes is easier on one path than the other.
- The same dataset/config setup behaves differently depending on whether I use CLI or programmatic evaluation.
- Agent modules that are valid in one evaluation path can fail in the other, forcing workflow-specific agent definitions.

### Willingness to contribute
Yes

---

## 🟡 Recommended Information

### Describe Alternatives You've Considered
- Keeping current differences and documenting them: this still leaves manual repetition/persistence work and split workflows.
- Wrapper scripts outside ADK: works short-term, but duplicates logic and reduces maintainability.

### Proposed API / Implementation
Pseudo-proposal:

```python
# CLI
adk eval <agent_dir> <eval_set...> --num_runs=3 [--config_file_path=...]

# Internals (conceptual)
for i in range(num_runs):
  inference_results += run_inference(...)

aggregated = aggregate_eval_case_results(inference_results)
save_if_configured(aggregated)
```

```python
# AgentEvaluator
await AgentEvaluator.evaluate(
    agent_module=...,
    eval_dataset_file_path_or_dir=...,
    num_runs=3,
    eval_set_results_manager=LocalEvalSetResultsManager(...),  # optional
)
```

```python
# Config resolution for adk eval (conceptual)
if config_file_path:
  eval_config = load(config_file_path)
else:
  eval_config = discover_test_config_near_input_or_default(...)
```

```python
# Conceptual shared loader used by both entry points.
agent_module = _get_agent_module(...)
if hasattr(agent_module.agent, "root_agent"):
  agent = agent_module.agent.root_agent
elif hasattr(agent_module.agent, "get_agent_async"):
  agent = await agent_module.agent.get_agent_async()
else:
  raise ValueError("Expected `root_agent` or `get_agent_async` in agent module.")
```

### Additional Context
Goal: make eval behavior consistent regardless of entry point (`adk eval` vs `AgentEvaluator.evaluate`) so that the same eval assets produce comparable outcomes with less custom glue code.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency improvements between adk eval and AgentEvaluator #4410

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consistency improvements between adk eval and AgentEvaluator #4410

Description

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions