feat(eval): add evaluate_full_response option to rubric-based evaluation#5316
feat(eval): add evaluate_full_response option to rubric-based evaluation#5316Siddhartha90 wants to merge 4 commits intogoogle:mainfrom
Conversation
|
Hi @Siddhartha90 , can you please fix the mypy-diff errors |
ankursharmas
left a comment
There was a problem hiding this comment.
Over all I agree with the problem and to some degree with the direction of the solution as well.
We should add the boolean field that you have created to the BaseCriterion instead, as this a property that should be applicable to all the metrics, not just the one that deals with rubrics. As an eval owner, one would want to apply the how final response is considered consistently.
Field name: evaluate_full_response doesn't convey the intent clearly. I would recommend: include_intermediate_responses_in_final or something that clarifies the intent.
We should update get_text_from_content to have this new behavior. That way, other metrics could also re-use the implementation.
Lastly, please do add/update unit test cases.
Fixes #5217
Summary
When an agent emits text before a tool call (e.g. presenting a plan), then calls a tool, then emits more text (e.g. an explanation),
rubric_based_final_response_quality_v1only sends the post-tool-call text to the judge asfinal_response. The pre-tool-call text is stored inintermediate_data.invocation_eventsbut is never included in the judge prompt.This means rubrics that check for content in the pre-tool-call text always fail, even though the agent correctly produced that content.
Changes
evaluate_full_response: bool = FalsetoRubricsBasedCriterion(following the pattern ofevaluate_intermediate_nl_responsesonHallucinationsCriterion)invocation_events+final_responsebefore sending to the judgeUsage
{ "rubric_based_final_response_quality_v1": { "threshold": 0.8, "evaluate_full_response": true, "rubrics": [...] } }Motivation
We have a resume improvement agent that:
submit_improved_resume)From the user's perspective this is one continuous response. But the rubric evaluator only judges step 3. Rubrics checking for the plan (step 1) always fail.
With
evaluate_full_response: true, the judge sees the complete agent output and can accurately evaluate all rubrics.Backwards compatible
The flag defaults to
false, so existing behavior is unchanged.Test plan
Scenario: Agent emits text before and after a tool call within a single invocation
rubric_based_final_response_quality_v1withoutevaluate_full_responseset. Confirm the judge only receives the post-tool-call text in<final_answer>. Rubrics checking for pre-tool-call content should fail. This validates no regression.evaluate_full_response: true: Run the same eval with the flag enabled. Confirm the judge receives the concatenated text from all invocation events + final_response in<final_answer>. Rubrics checking for pre-tool-call content should now pass.test_config.jsonfiles withoutevaluate_full_responsecontinue to work unchanged (field defaults tofalse).Pre-PR validation: We installed the lib from this PR (
uv pip install "google-adk[eval] @ git+https://github.com/Siddhartha90/adk-python.git@feat/evaluate-full-response")Then tested the core logic (concatenating text from
invocation_events+final_response) against a production agent that emits aplan text → calls a tool → emits explanation text.Without full-response concatenation, a couple of our rubrics -
presents_planandwarm_acknowledgmentwhich relied on pre-tool-call-content consistently scored 0.0. With full-response concatenation, all rubrics scored 1.0. The same logic is applied in this PR's changes toformat_auto_rater_prompt.🤖 Generated with Claude Code