Skip to content

Erinbranch#418

Open
erinlimbogoogle wants to merge 34 commits into
mainfrom
erinbranch
Open

Erinbranch#418
erinlimbogoogle wants to merge 34 commits into
mainfrom
erinbranch

Conversation

@erinlimbogoogle
Copy link
Copy Markdown
Collaborator

No description provided.

erinlimbogoogle and others added 30 commits May 27, 2026 22:29
#398)

* fix: surface eval failures instead of silently terminating or crashing

Three issues found in prod logs (last 24h):

- SimulatedUser silently returned "TERMINATE" on every turn when
  simulated_user_model_config was missing or failed to load, killing
  ~115 multi-turn scenarios with no visible cause. Now raises in
  __init__ so misconfiguration is caught at scenario start.

- _process_results used `assert not results_df.empty`, which bubbled
  out of the Eval RPC as an unstructured INTERNAL error and triggered
  a client retry storm (~10s cadence) on a config-level problem. Now
  raises EmptyEvalResultError, translated to FAILED_PRECONDITION with
  a message pointing at dataset/dialect/database mismatch.

- DB-queue acquire failures logged the bare exception, which is empty
  for queue.Empty timeouts ("...': "). Now falls back to the exception
  class name so the operator can see what failed.

* Fix: convert bird-interact-lite from broken submodule to regular directory

* chore: gitignore bird-interact-lite to prevent accidental submodule recommit

The bird-interact-lite directory is populated at runtime by
datasets/bird-interact/download_dataset.sh, which clones
https://huggingface.co/datasets/birdsql/bird-interact-lite into it.

Without this ignore, a subsequent 'git add' would re-record the embedded
.git/ as a gitlink with no .gitmodules entry — which is exactly the
broken-submodule state this PR is removing in the first place.

---------

Co-authored-by: Prerna Kakkar <prernakakkar@google.com>
* feat(dea): define DataEngineeringAgentGenerator and integrate A2A SDK with GCP ADC

Task 1.1: Define DataEngineeringAgentGenerator inheriting from QueryGenerator in data_engineering_agent.py.
Task 1.2: Integrate A2A SDK and configure GcpAdcCredentialService to use GCP Application Default Credentials (ADC) for authentication.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* feat(dea): integrate A2A SDK dependency and add setup unit tests

- Add `a2a-sdk>=1.0.3` to dependencies in `pyproject.toml` and update `uv.lock`.
- Update `DataEngineeringAgentGenerator` to configure `endpoint`, `target_workspace` and clean up GCP ADC Credential Service logic with event loop safety (async refresh).
- Register `DataEngineeringAgentGenerator` in the generators factory.
- Add `evalbench/test/data_engineering_agent_test.py` to verify correct generator setup and configurations.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* refactor(dea): remove hardcoded defaults and add config validation

- Remove hardcoded test URLs for `endpoint` and `target_workspace` from `DataEngineeringAgentGenerator`.
- Raise `ValueError` if either configuration key is missing or empty.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* style: remove comments from data_engineering_agent.py

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* test(dea): verify credential scheme check and clean up imports

- Add a new unit test `test_get_credentials_invalid_scheme` verifying that `GcpAdcCredentialService` raises `ValueError` for unsupported auth schemes.
- Replace absolute generator import in `evalbench/generators/models/__init__.py` with a relative one.
- Clean up unused imports in `data_engineering_agent.py`.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* fix(dea): revert relative import of QueryGenerator in factory

- Revert relative import of `QueryGenerator` back to absolute import `generators.models.generator.QueryGenerator` in `evalbench/generators/models/__init__.py` to prevent runtime package resolution issues.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* feat(dea): throw auth errors and add error resilience tests

- Update `GcpAdcCredentialService` to propagate `DefaultCredentialsError` and `RefreshError` up instead of catching them silently and returning `None`.
- Add `test_generator_setup_missing_endpoint` and `test_generator_setup_missing_workspace` in `data_engineering_agent_test.py`.
- Add `test_get_credentials_error_resiliency_default` and `test_get_credentials_error_resiliency_refresh` to verify error propagation.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* style: resolve all pycodestyle violations in tests

- Add `# noqa: E402` to imports in `data_engineering_agent_test.py` to support custom `sys.path` modification before importing generators.
- Wrap long test URLs using parenthesized string concatenation (E501).
- Remove trailing blank line at EOF (W391).

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* feat(dea): make GCP ADC credential retrieval thread-safe and non-blocking

- Configure `GcpAdcCredentialService` using `asyncio.Lock` to ensure concurrency safety across parallel requests.
- Move gcloud auth default credentials initialization off the main event loop using `asyncio.to_thread` to prevent blocking.
- Clean up unused subprocess import in `data_engineering_agent.py`.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

* style(dea): simplify GcpAdcCredentialService docstring in generators

- Clean up and shorten the docstring of `GcpAdcCredentialService` to remove unnecessary details and resolve E501 line-length warnings.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

---------

Co-authored-by: James Nguyen <jamesamn@google.com>
…ations (#407)

- Add `EvalDeaRequest` representing a native conversational evaluation scenario request for the Data Engineering Agent.
- Support mapping and unpacking Google3 serialization protobuf payloads cleanly.

TAG=agy
CONV=aa927cc7-418a-41e3-b658-9b82915e18eb

Co-authored-by: James Nguyen <jamesamn@google.com>
* feat: opt-in function-calling for the Gemini SDK judge

Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that
the GeminiGenerator can invoke via the google.genai function-calling loop.
Tools are opt-in per judge via a `tools:` YAML list; configs without the
key take the existing single-shot codepath unchanged.

The fetch_url tool restricts to public HTTPS hosts (SSRF guard via
ipaddress on each resolved IP), times out at 10s, caps responses at 50KB,
and strips HTML to text. Tool errors are returned as `Error: ...` strings
so the model can react rather than the judge crashing.

The retry/backoff loop on rate-limit errors is extracted into a single
_call_generate_content helper shared by both the single-shot and tool-loop
paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5.

Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip,
HTTP error, timeout, URL error); 6 unit tests for the Gemini loop
(single-shot preserved, unknown tool fails fast, no-call-emit returns
text, tool invoked + FunctionResponse threaded back, iteration cap,
tool-exception-as-error-string); existing binaryrubricscorer tests still
pass.

Claude judge tool support deferred to a follow-up.

* docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools

Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools.

* address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable

github-code-quality bot flagged three items on #409:

1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional.

2. tools_fetch_url_test.py: unused 'import pytest' removed.

3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.
… dataset documentation for skill installation and scoring
…configuration into separate methods and improve logging practices
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
.replace('sql: "', "")
.replace("\\n", " ")
.replace("\\", "")
.replace("\\`", "`")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes might break other dialects, let's make it specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants