Erinbranch by erinlimbogoogle · Pull Request #418 · GoogleCloudPlatform/evalbench

erinlimbogoogle · 2026-06-05T21:35:55Z

No description provided.

#398) * fix: surface eval failures instead of silently terminating or crashing Three issues found in prod logs (last 24h): - SimulatedUser silently returned "TERMINATE" on every turn when simulated_user_model_config was missing or failed to load, killing ~115 multi-turn scenarios with no visible cause. Now raises in __init__ so misconfiguration is caught at scenario start. - _process_results used `assert not results_df.empty`, which bubbled out of the Eval RPC as an unstructured INTERNAL error and triggered a client retry storm (~10s cadence) on a config-level problem. Now raises EmptyEvalResultError, translated to FAILED_PRECONDITION with a message pointing at dataset/dialect/database mismatch. - DB-queue acquire failures logged the bare exception, which is empty for queue.Empty timeouts ("...': "). Now falls back to the exception class name so the operator can see what failed. * Fix: convert bird-interact-lite from broken submodule to regular directory * chore: gitignore bird-interact-lite to prevent accidental submodule recommit The bird-interact-lite directory is populated at runtime by datasets/bird-interact/download_dataset.sh, which clones https://huggingface.co/datasets/birdsql/bird-interact-lite into it. Without this ignore, a subsequent 'git add' would re-record the embedded .git/ as a gitlink with no .gitmodules entry — which is exactly the broken-submodule state this PR is removing in the first place. --------- Co-authored-by: Prerna Kakkar <prernakakkar@google.com>

* feat(dea): define DataEngineeringAgentGenerator and integrate A2A SDK with GCP ADC Task 1.1: Define DataEngineeringAgentGenerator inheriting from QueryGenerator in data_engineering_agent.py. Task 1.2: Integrate A2A SDK and configure GcpAdcCredentialService to use GCP Application Default Credentials (ADC) for authentication. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * feat(dea): integrate A2A SDK dependency and add setup unit tests - Add `a2a-sdk>=1.0.3` to dependencies in `pyproject.toml` and update `uv.lock`. - Update `DataEngineeringAgentGenerator` to configure `endpoint`, `target_workspace` and clean up GCP ADC Credential Service logic with event loop safety (async refresh). - Register `DataEngineeringAgentGenerator` in the generators factory. - Add `evalbench/test/data_engineering_agent_test.py` to verify correct generator setup and configurations. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * refactor(dea): remove hardcoded defaults and add config validation - Remove hardcoded test URLs for `endpoint` and `target_workspace` from `DataEngineeringAgentGenerator`. - Raise `ValueError` if either configuration key is missing or empty. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * style: remove comments from data_engineering_agent.py TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * test(dea): verify credential scheme check and clean up imports - Add a new unit test `test_get_credentials_invalid_scheme` verifying that `GcpAdcCredentialService` raises `ValueError` for unsupported auth schemes. - Replace absolute generator import in `evalbench/generators/models/__init__.py` with a relative one. - Clean up unused imports in `data_engineering_agent.py`. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * fix(dea): revert relative import of QueryGenerator in factory - Revert relative import of `QueryGenerator` back to absolute import `generators.models.generator.QueryGenerator` in `evalbench/generators/models/__init__.py` to prevent runtime package resolution issues. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * feat(dea): throw auth errors and add error resilience tests - Update `GcpAdcCredentialService` to propagate `DefaultCredentialsError` and `RefreshError` up instead of catching them silently and returning `None`. - Add `test_generator_setup_missing_endpoint` and `test_generator_setup_missing_workspace` in `data_engineering_agent_test.py`. - Add `test_get_credentials_error_resiliency_default` and `test_get_credentials_error_resiliency_refresh` to verify error propagation. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * style: resolve all pycodestyle violations in tests - Add `# noqa: E402` to imports in `data_engineering_agent_test.py` to support custom `sys.path` modification before importing generators. - Wrap long test URLs using parenthesized string concatenation (E501). - Remove trailing blank line at EOF (W391). TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * feat(dea): make GCP ADC credential retrieval thread-safe and non-blocking - Configure `GcpAdcCredentialService` using `asyncio.Lock` to ensure concurrency safety across parallel requests. - Move gcloud auth default credentials initialization off the main event loop using `asyncio.to_thread` to prevent blocking. - Clean up unused subprocess import in `data_engineering_agent.py`. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * style(dea): simplify GcpAdcCredentialService docstring in generators - Clean up and shorten the docstring of `GcpAdcCredentialService` to remove unnecessary details and resolve E501 line-length warnings. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb --------- Co-authored-by: James Nguyen <jamesamn@google.com>

…ations (#407) - Add `EvalDeaRequest` representing a native conversational evaluation scenario request for the Data Engineering Agent. - Support mapping and unpacking Google3 serialization protobuf payloads cleanly. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb Co-authored-by: James Nguyen <jamesamn@google.com>

* feat: opt-in function-calling for the Gemini SDK judge Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that the GeminiGenerator can invoke via the google.genai function-calling loop. Tools are opt-in per judge via a `tools:` YAML list; configs without the key take the existing single-shot codepath unchanged. The fetch_url tool restricts to public HTTPS hosts (SSRF guard via ipaddress on each resolved IP), times out at 10s, caps responses at 50KB, and strips HTML to text. Tool errors are returned as `Error: ...` strings so the model can react rather than the judge crashing. The retry/backoff loop on rate-limit errors is extracted into a single _call_generate_content helper shared by both the single-shot and tool-loop paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5. Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip, HTTP error, timeout, URL error); 6 unit tests for the Gemini loop (single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool-exception-as-error-string); existing binaryrubricscorer tests still pass. Claude judge tool support deferred to a follow-up. * docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools. * address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable github-code-quality bot flagged three items on #409: 1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional. 2. tools_fetch_url_test.py: unused 'import pytest' removed. 3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.

…test suite and configuration datasets

…cli.py

…in models package

… dataset documentation for skill installation and scoring

…configuration into separate methods and improve logging practices

…epo pinning limitations

…erpretation of targets starting with dashes

…usage in configuration and documentation

… handling

…roject IDs in datasets and configurations

…g behavior

… Dockerfile

… agy_cli_test and tool_naming_test

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

…ove host PATH dependency

…date subprocess calls to use stdin=DEVNULL

…l JSON artifacts

…e to missing usage data

…ion and support v1.0.5+ binary changes

…ucket tagging

… pairing and latency tracking logic

IsmailMehdi · 2026-06-06T00:08:52Z

        .replace('sql: "', "")
        .replace("\\n", " ")
-        .replace("\\", "")
+        .replace("\\`", "`")


these changes might break other dialects, let's make it specific.

erinlimbogoogle and others added 30 commits May 27, 2026 22:29

fix: prevent silent errors on DB query timeouts and extend deadline

c1a85de

pystyle

fbfd042

fix sanitzation error

c7ad8cd

feat: add AgyCliGenerator support to evaluator and models, including …

d54885b

…test suite and configuration datasets

refactor: apply line length formatting and style improvements to agy_…

3347c9d

…cli.py

feat: pin Antigravity CLI version in Dockerfile.

3cfa33a

refactor: replace DataEngineeringAgentGenerator with AgyCliGenerator …

188da46

…in models package

refactor: clean up agy_cli test suite with shared fixtures and update…

c623f10

… dataset documentation for skill installation and scoring

refactor: decouple agy CLI path initialization, auth setup, and tool …

ff339fe

…configuration into separate methods and improve logging practices

refactor: expose agy version via class attribute and document skill r…

334d13a

…epo pinning limitations

fix: use -- delimiter in agy plugin install command to prevent misint…

f04383f

…erpretation of targets starting with dashes

docs: clarify setup.skills plugin behavior and standalone MCP server …

89a63b5

…usage in configuration and documentation

test: add unit tests for skill cloning timeouts and MCP configuration…

4aa9db3

… handling

refactor: support runtime environment variable substitution for GCP p…

9d5940d

…roject IDs in datasets and configurations

feat: add runtime check for agy binary and document token snapshottin…

e1231ca

…g behavior

docs: add trust model explanation for antigravity CLI installation in…

04017c6

… Dockerfile

fix: enforce strict casing for agy MCP tool names to match v1.0.3 schema

a95619c

chore: add synchronization notes between test argument definitions in…

3cb9d18

… agy_cli_test and tool_naming_test

docs: add note regarding path dependency for agy_cli configuration file

62d35f2

chore: increase persistent volume claim storage request to 1000Gi

15e6212

chore(main): release 1.8.0 (#380)

6e19105

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

refactor: move agy binary installation to per-session sandbox and rem…

93d72f1

…ove host PATH dependency

refactor: remove Secret Manager support for agy authentication and up…

b81171b

…date subprocess calls to use stdin=DEVNULL

refactor: validate MCP tool schema files by content to ignore non-too…

4812fe3

…l JSON artifacts

chore: comment out token_consumption in agy-cli-tools config files du…

84ae189

…e to missing usage data

refactor: update agy CLI harness to use --model flag for model select…

4c38422

…ion and support v1.0.5+ binary changes

omkargaikwad23 added 3 commits June 5, 2026 21:35

feat: recover resolved model label from agy cli logs for statistics b…

d42e577

…ucket tagging

refactor: modularize agy-cli transcript parsing and improve tool call…

b65c88c

… pairing and latency tracking logic

refactor: if-chain is now a dict dispatch with lambda values

1040990

erinlimbogoogle requested a review from IsmailMehdi as a code owner June 5, 2026 21:35

Merge branch 'main' into erinbranch

ce21703

IsmailMehdi reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erinbranch#418

Erinbranch#418
erinlimbogoogle wants to merge 34 commits into
mainfrom
erinbranch

erinlimbogoogle commented Jun 5, 2026

Uh oh!

IsmailMehdi Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

erinlimbogoogle commented Jun 5, 2026

Uh oh!

IsmailMehdi Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants