fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646) by Chibionos · Pull Request #1691 · UiPath/uipath-python

Chibionos · 2026-05-29T06:59:07Z

Summary

Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.

Both eval mockers requested structured output via OpenAI-only response_format json_schema and parsed response.choices[0].message.content. On the normalized LLM Gateway, response_format structured output is only honored for OpenAI models; for Claude the content comes back empty/None, so json.loads(None) raised → wrapped as UiPathMockResponseGenerationError → AGENT_RUNTIME.UNEXPECTED_ERROR.

Fixes AE-1646 (customer: Sarasota Memorial Health Care System).

Root cause / regression

Regression from #1555, which started routing the agent's model into simulations. Before that, simulation always used a fixed OpenAI model (gpt_4_1_mini), so non-OpenAI providers were never exercised on this path — which is why Claude "worked before."

Fix

Switch both mockers to provider-agnostic function calling, mirroring llm_as_judge_evaluator (whose docstring already states function calling is the way to get structured output across OpenAI/Claude/Gemini):

Build a forced tool that wraps the output/input schema under a response property, force it via tool_choice=required, and read tool_calls[0].arguments["response"] (already a parsed dict).
Hoist nested $defs to the tool-parameters root so $refs from nested Pydantic models still resolve once the schema is wrapped.
The normalized gateway's chat_completions now accepts raw-dict tools (pass-through) so arbitrary nested schemas survive — the ToolDefinition converter only emits flat properties.

New shared helper eval/mocks/_structured_output.py keeps both mockers DRY.

Tests

test_llm_mockable_structured_output_via_tool_call — parametrized over gpt-4.1-mini, anthropic.claude-sonnet-4-5, gemini-2.5-pro; reproduces AE-1646 (content=None + tool_calls) and asserts the new contract.
test_build_response_tool_hoists_defs_to_root + helper error-branch unit tests.
test_raw_dict_tool_passthrough_mocked (platform) — asserts a nested array schema is forwarded byte-for-byte.
Existing mocker/input/span tests updated to the function-calling contract (behavior assertions preserved).
Full tests/cli/eval suite + platform mocked LLM tests green; ruff + mypy clean.

Note for reviewers

The OpenAI path also moves to function-calling (no longer response_format), matching the judge. Worth a live check that a Claude/Bedrock agent simulating a nested-model tool output round-trips through the gateway, since nested $defs in tool parameters has no prior precedent in this repo (the judge only used flat schemas).

🤖 Generated with Claude Code

…models work Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini). The mockers requested structured output via OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`; for Claude that content is empty/None, so `json.loads(...)` raised. Switch both mockers to provider-agnostic function calling (mirrors llm_as_judge_evaluator): build a forced tool that wraps the output/input schema under a `response` property, force it via tool_choice, and read `tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested `$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary nested schemas survive (the ToolDefinition converter only emits flat properties). Regression introduced by UiPath#1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model, so non-OpenAI providers were never exercised on this path. Fixes AE-1646. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Chibionos · 2026-05-29T07:21:05Z

Superseded by #1692 — moved the branch into the main repo (UiPath/uipath-python) so CI can access the required secrets for the eval test-cases (fork PRs run without them). Same commits + version bumps.

Chibionos requested review from AAgnihotry, akshaylive and bai-uipath and removed request for akshaylive May 29, 2026 07:14

Chibionos closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1691

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1691
Chibionos wants to merge 1 commit into
UiPath:mainfrom
Chibionos:fix/ae-1646-mocker-non-openai-models

Chibionos commented May 29, 2026

Uh oh!

Chibionos commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chibionos commented May 29, 2026

Summary

Root cause / regression

Fix

Tests

Note for reviewers

Uh oh!

Chibionos commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant