fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1691
Closed
Chibionos wants to merge 1 commit into
Closed
fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1691Chibionos wants to merge 1 commit into
Chibionos wants to merge 1 commit into
Conversation
…models work Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini). The mockers requested structured output via OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`; for Claude that content is empty/None, so `json.loads(...)` raised. Switch both mockers to provider-agnostic function calling (mirrors llm_as_judge_evaluator): build a forced tool that wraps the output/input schema under a `response` property, force it via tool_choice, and read `tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested `$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary nested schemas survive (the ToolDefinition converter only emits flat properties). Regression introduced by UiPath#1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model, so non-OpenAI providers were never exercised on this path. Fixes AE-1646. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Superseded by #1692 — moved the branch into the main repo (UiPath/uipath-python) so CI can access the required secrets for the eval test-cases (fork PRs run without them). Same commits + version bumps. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with
AGENT_RUNTIME.UNEXPECTED_ERRORfor non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.Both eval mockers requested structured output via OpenAI-only
response_formatjson_schema and parsedresponse.choices[0].message.content. On the normalized LLM Gateway,response_formatstructured output is only honored for OpenAI models; for Claude the content comes back empty/None, sojson.loads(None)raised → wrapped asUiPathMockResponseGenerationError→AGENT_RUNTIME.UNEXPECTED_ERROR.Fixes AE-1646 (customer: Sarasota Memorial Health Care System).
Root cause / regression
Regression from #1555, which started routing the agent's model into simulations. Before that, simulation always used a fixed OpenAI model (
gpt_4_1_mini), so non-OpenAI providers were never exercised on this path — which is why Claude "worked before."Fix
Switch both mockers to provider-agnostic function calling, mirroring
llm_as_judge_evaluator(whose docstring already states function calling is the way to get structured output across OpenAI/Claude/Gemini):responseproperty, force it viatool_choice=required, and readtool_calls[0].arguments["response"](already a parsed dict).$defsto the tool-parameters root so$refs from nested Pydantic models still resolve once the schema is wrapped.chat_completionsnow accepts raw-dict tools (pass-through) so arbitrary nested schemas survive — theToolDefinitionconverter only emits flat properties.New shared helper
eval/mocks/_structured_output.pykeeps both mockers DRY.Tests
test_llm_mockable_structured_output_via_tool_call— parametrized overgpt-4.1-mini,anthropic.claude-sonnet-4-5,gemini-2.5-pro; reproduces AE-1646 (content=None + tool_calls) and asserts the new contract.test_build_response_tool_hoists_defs_to_root+ helper error-branch unit tests.test_raw_dict_tool_passthrough_mocked(platform) — asserts a nested array schema is forwarded byte-for-byte.tests/cli/evalsuite + platform mocked LLM tests green; ruff + mypy clean.Note for reviewers
The OpenAI path also moves to function-calling (no longer
response_format), matching the judge. Worth a live check that a Claude/Bedrock agent simulating a nested-model tool output round-trips through the gateway, since nested$defsin toolparametershas no prior precedent in this repo (the judge only used flat schemas).🤖 Generated with Claude Code