Skip to content

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1691

Closed
Chibionos wants to merge 1 commit into
UiPath:mainfrom
Chibionos:fix/ae-1646-mocker-non-openai-models
Closed

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1691
Chibionos wants to merge 1 commit into
UiPath:mainfrom
Chibionos:fix/ae-1646-mocker-non-openai-models

Conversation

@Chibionos
Copy link
Copy Markdown
Contributor

Summary

Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.

Both eval mockers requested structured output via OpenAI-only response_format json_schema and parsed response.choices[0].message.content. On the normalized LLM Gateway, response_format structured output is only honored for OpenAI models; for Claude the content comes back empty/None, so json.loads(None) raised → wrapped as UiPathMockResponseGenerationErrorAGENT_RUNTIME.UNEXPECTED_ERROR.

Fixes AE-1646 (customer: Sarasota Memorial Health Care System).

Root cause / regression

Regression from #1555, which started routing the agent's model into simulations. Before that, simulation always used a fixed OpenAI model (gpt_4_1_mini), so non-OpenAI providers were never exercised on this path — which is why Claude "worked before."

Fix

Switch both mockers to provider-agnostic function calling, mirroring llm_as_judge_evaluator (whose docstring already states function calling is the way to get structured output across OpenAI/Claude/Gemini):

  • Build a forced tool that wraps the output/input schema under a response property, force it via tool_choice=required, and read tool_calls[0].arguments["response"] (already a parsed dict).
  • Hoist nested $defs to the tool-parameters root so $refs from nested Pydantic models still resolve once the schema is wrapped.
  • The normalized gateway's chat_completions now accepts raw-dict tools (pass-through) so arbitrary nested schemas survive — the ToolDefinition converter only emits flat properties.

New shared helper eval/mocks/_structured_output.py keeps both mockers DRY.

Tests

  • test_llm_mockable_structured_output_via_tool_call — parametrized over gpt-4.1-mini, anthropic.claude-sonnet-4-5, gemini-2.5-pro; reproduces AE-1646 (content=None + tool_calls) and asserts the new contract.
  • test_build_response_tool_hoists_defs_to_root + helper error-branch unit tests.
  • test_raw_dict_tool_passthrough_mocked (platform) — asserts a nested array schema is forwarded byte-for-byte.
  • Existing mocker/input/span tests updated to the function-calling contract (behavior assertions preserved).
  • Full tests/cli/eval suite + platform mocked LLM tests green; ruff + mypy clean.

Note for reviewers

The OpenAI path also moves to function-calling (no longer response_format), matching the judge. Worth a live check that a Claude/Bedrock agent simulating a nested-model tool output round-trips through the gateway, since nested $defs in tool parameters has no prior precedent in this repo (the judge only used flat schemas).

🤖 Generated with Claude Code

…models work

Tool simulation and input generation in Studio Debug and Evaluation Set runs
failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic
Claude via Bedrock, Gemini). The mockers requested structured output via
OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`;
for Claude that content is empty/None, so `json.loads(...)` raised.

Switch both mockers to provider-agnostic function calling (mirrors
llm_as_judge_evaluator): build a forced tool that wraps the output/input schema
under a `response` property, force it via tool_choice, and read
`tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested
`$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still
resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary
nested schemas survive (the ToolDefinition converter only emits flat properties).

Regression introduced by UiPath#1555, which started routing the agent's model into
simulations; before that, simulation always used a fixed OpenAI model, so
non-OpenAI providers were never exercised on this path.

Fixes AE-1646.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Chibionos Chibionos requested review from AAgnihotry, akshaylive and bai-uipath and removed request for akshaylive May 29, 2026 07:14
@Chibionos
Copy link
Copy Markdown
Contributor Author

Superseded by #1692 — moved the branch into the main repo (UiPath/uipath-python) so CI can access the required secrets for the eval test-cases (fork PRs run without them). Same commits + version bumps.

@Chibionos Chibionos closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant