Skip to content

feat(eval): add evaluate_full_response option to rubric-based evaluation#5316

Open
Siddhartha90 wants to merge 4 commits intogoogle:mainfrom
Siddhartha90:sid/evaluate-full-response
Open

feat(eval): add evaluate_full_response option to rubric-based evaluation#5316
Siddhartha90 wants to merge 4 commits intogoogle:mainfrom
Siddhartha90:sid/evaluate-full-response

Conversation

@Siddhartha90
Copy link
Copy Markdown

Fixes #5217

Summary

When an agent emits text before a tool call (e.g. presenting a plan), then calls a tool, then emits more text (e.g. an explanation), rubric_based_final_response_quality_v1 only sends the post-tool-call text to the judge as final_response. The pre-tool-call text is stored in intermediate_data.invocation_events but is never included in the judge prompt.

This means rubrics that check for content in the pre-tool-call text always fail, even though the agent correctly produced that content.

Changes

  • Added evaluate_full_response: bool = False to RubricsBasedCriterion (following the pattern of evaluate_intermediate_nl_responses on HallucinationsCriterion)
  • When enabled, the evaluator concatenates all NL text from invocation_events + final_response before sending to the judge

Usage

{
  "rubric_based_final_response_quality_v1": {
    "threshold": 0.8,
    "evaluate_full_response": true,
    "rubrics": [...]
  }
}

Motivation

We have a resume improvement agent that:

  1. Streams a plan to the user (text)
  2. Calls a tool (e.g. submit_improved_resume)
  3. Streams an explanation of changes (text)

From the user's perspective this is one continuous response. But the rubric evaluator only judges step 3. Rubrics checking for the plan (step 1) always fail.

With evaluate_full_response: true, the judge sees the complete agent output and can accurately evaluate all rubrics.

Backwards compatible

The flag defaults to false, so existing behavior is unchanged.

Test plan

Scenario: Agent emits text before and after a tool call within a single invocation

  • Without flag (default behavior preserved): Run rubric_based_final_response_quality_v1 without evaluate_full_response set. Confirm the judge only receives the post-tool-call text in <final_answer>. Rubrics checking for pre-tool-call content should fail. This validates no regression.
  • With evaluate_full_response: true: Run the same eval with the flag enabled. Confirm the judge receives the concatenated text from all invocation events + final_response in <final_answer>. Rubrics checking for pre-tool-call content should now pass.
  • Agent with no intermediate text: Run with the flag enabled against an agent that only emits a final response (no pre-tool-call text). Confirm behavior is identical to the default — the judge receives just the final_response text.
  • Agent with multiple intermediate text events: Run with the flag enabled against an agent that emits text → tool call → text → tool call → text. Confirm all three text segments are concatenated and sent to the judge.
  • Backwards compatibility: Confirm existing test_config.json files without evaluate_full_response continue to work unchanged (field defaults to false).

Pre-PR validation: We installed the lib from this PR (uv pip install "google-adk[eval] @ git+https://github.com/Siddhartha90/adk-python.git@feat/evaluate-full-response")

Then tested the core logic (concatenating text from invocation_events + final_response) against a production agent that emits a plan text → calls a tool → emits explanation text.

Without full-response concatenation, a couple of our rubrics - presents_plan and warm_acknowledgment which relied on pre-tool-call-content consistently scored 0.0. With full-response concatenation, all rubrics scored 1.0. The same logic is applied in this PR's changes to format_auto_rater_prompt.

🤖 Generated with Claude Code

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Apr 14, 2026
@rohityan rohityan self-assigned this Apr 14, 2026
@rohityan rohityan added the needs review [Status] The PR/issue is awaiting review from the maintainer label Apr 14, 2026
@rohityan
Copy link
Copy Markdown
Collaborator

Hi @Siddhartha90 , can you please fix the mypy-diff errors

@rohityan rohityan added request clarification [Status] The maintainer need clarification or more information from the author and removed needs review [Status] The PR/issue is awaiting review from the maintainer labels Apr 14, 2026
Copy link
Copy Markdown
Collaborator

@ankursharmas ankursharmas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over all I agree with the problem and to some degree with the direction of the solution as well.
We should add the boolean field that you have created to the BaseCriterion instead, as this a property that should be applicable to all the metrics, not just the one that deals with rubrics. As an eval owner, one would want to apply the how final response is considered consistently.

Field name: evaluate_full_response doesn't convey the intent clearly. I would recommend: include_intermediate_responses_in_final or something that clarifies the intent.

We should update get_text_from_content to have this new behavior. That way, other metrics could also re-use the implementation.

Lastly, please do add/update unit test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rubric_based_final_response_quality_v1 does not evaluate text emitted before tool calls

4 participants