Erinbranch#418
Open
erinlimbogoogle wants to merge 34 commits into
Open
Conversation
#398) * fix: surface eval failures instead of silently terminating or crashing Three issues found in prod logs (last 24h): - SimulatedUser silently returned "TERMINATE" on every turn when simulated_user_model_config was missing or failed to load, killing ~115 multi-turn scenarios with no visible cause. Now raises in __init__ so misconfiguration is caught at scenario start. - _process_results used `assert not results_df.empty`, which bubbled out of the Eval RPC as an unstructured INTERNAL error and triggered a client retry storm (~10s cadence) on a config-level problem. Now raises EmptyEvalResultError, translated to FAILED_PRECONDITION with a message pointing at dataset/dialect/database mismatch. - DB-queue acquire failures logged the bare exception, which is empty for queue.Empty timeouts ("...': "). Now falls back to the exception class name so the operator can see what failed. * Fix: convert bird-interact-lite from broken submodule to regular directory * chore: gitignore bird-interact-lite to prevent accidental submodule recommit The bird-interact-lite directory is populated at runtime by datasets/bird-interact/download_dataset.sh, which clones https://huggingface.co/datasets/birdsql/bird-interact-lite into it. Without this ignore, a subsequent 'git add' would re-record the embedded .git/ as a gitlink with no .gitmodules entry — which is exactly the broken-submodule state this PR is removing in the first place. --------- Co-authored-by: Prerna Kakkar <prernakakkar@google.com>
* feat(dea): define DataEngineeringAgentGenerator and integrate A2A SDK with GCP ADC Task 1.1: Define DataEngineeringAgentGenerator inheriting from QueryGenerator in data_engineering_agent.py. Task 1.2: Integrate A2A SDK and configure GcpAdcCredentialService to use GCP Application Default Credentials (ADC) for authentication. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * feat(dea): integrate A2A SDK dependency and add setup unit tests - Add `a2a-sdk>=1.0.3` to dependencies in `pyproject.toml` and update `uv.lock`. - Update `DataEngineeringAgentGenerator` to configure `endpoint`, `target_workspace` and clean up GCP ADC Credential Service logic with event loop safety (async refresh). - Register `DataEngineeringAgentGenerator` in the generators factory. - Add `evalbench/test/data_engineering_agent_test.py` to verify correct generator setup and configurations. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * refactor(dea): remove hardcoded defaults and add config validation - Remove hardcoded test URLs for `endpoint` and `target_workspace` from `DataEngineeringAgentGenerator`. - Raise `ValueError` if either configuration key is missing or empty. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * style: remove comments from data_engineering_agent.py TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * test(dea): verify credential scheme check and clean up imports - Add a new unit test `test_get_credentials_invalid_scheme` verifying that `GcpAdcCredentialService` raises `ValueError` for unsupported auth schemes. - Replace absolute generator import in `evalbench/generators/models/__init__.py` with a relative one. - Clean up unused imports in `data_engineering_agent.py`. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * fix(dea): revert relative import of QueryGenerator in factory - Revert relative import of `QueryGenerator` back to absolute import `generators.models.generator.QueryGenerator` in `evalbench/generators/models/__init__.py` to prevent runtime package resolution issues. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * feat(dea): throw auth errors and add error resilience tests - Update `GcpAdcCredentialService` to propagate `DefaultCredentialsError` and `RefreshError` up instead of catching them silently and returning `None`. - Add `test_generator_setup_missing_endpoint` and `test_generator_setup_missing_workspace` in `data_engineering_agent_test.py`. - Add `test_get_credentials_error_resiliency_default` and `test_get_credentials_error_resiliency_refresh` to verify error propagation. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * style: resolve all pycodestyle violations in tests - Add `# noqa: E402` to imports in `data_engineering_agent_test.py` to support custom `sys.path` modification before importing generators. - Wrap long test URLs using parenthesized string concatenation (E501). - Remove trailing blank line at EOF (W391). TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * feat(dea): make GCP ADC credential retrieval thread-safe and non-blocking - Configure `GcpAdcCredentialService` using `asyncio.Lock` to ensure concurrency safety across parallel requests. - Move gcloud auth default credentials initialization off the main event loop using `asyncio.to_thread` to prevent blocking. - Clean up unused subprocess import in `data_engineering_agent.py`. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb * style(dea): simplify GcpAdcCredentialService docstring in generators - Clean up and shorten the docstring of `GcpAdcCredentialService` to remove unnecessary details and resolve E501 line-length warnings. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb --------- Co-authored-by: James Nguyen <jamesamn@google.com>
…ations (#407) - Add `EvalDeaRequest` representing a native conversational evaluation scenario request for the Data Engineering Agent. - Support mapping and unpacking Google3 serialization protobuf payloads cleanly. TAG=agy CONV=aa927cc7-418a-41e3-b658-9b82915e18eb Co-authored-by: James Nguyen <jamesamn@google.com>
* feat: opt-in function-calling for the Gemini SDK judge Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that the GeminiGenerator can invoke via the google.genai function-calling loop. Tools are opt-in per judge via a `tools:` YAML list; configs without the key take the existing single-shot codepath unchanged. The fetch_url tool restricts to public HTTPS hosts (SSRF guard via ipaddress on each resolved IP), times out at 10s, caps responses at 50KB, and strips HTML to text. Tool errors are returned as `Error: ...` strings so the model can react rather than the judge crashing. The retry/backoff loop on rate-limit errors is extracted into a single _call_generate_content helper shared by both the single-shot and tool-loop paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5. Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip, HTTP error, timeout, URL error); 6 unit tests for the Gemini loop (single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool-exception-as-error-string); existing binaryrubricscorer tests still pass. Claude judge tool support deferred to a follow-up. * docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools. * address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable github-code-quality bot flagged three items on #409: 1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional. 2. tools_fetch_url_test.py: unused 'import pytest' removed. 3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.
…test suite and configuration datasets
…in models package
… dataset documentation for skill installation and scoring
…configuration into separate methods and improve logging practices
…epo pinning limitations
…erpretation of targets starting with dashes
…usage in configuration and documentation
…roject IDs in datasets and configurations
… agy_cli_test and tool_naming_test
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
…ove host PATH dependency
…date subprocess calls to use stdin=DEVNULL
…e to missing usage data
…ion and support v1.0.5+ binary changes
… pairing and latency tracking logic
IsmailMehdi
reviewed
Jun 6, 2026
| .replace('sql: "', "") | ||
| .replace("\\n", " ") | ||
| .replace("\\", "") | ||
| .replace("\\`", "`") |
Collaborator
There was a problem hiding this comment.
these changes might break other dialects, let's make it specific.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.