fix(eval): fix search_tool correctness always scoring 0%#1675
Conversation
_args_match checked emit_widget (renamed to user_requested_search in the tool schema) and limit (optional, server-side default). Both mismatched on every real model call, so tool_correctness was always False regardless of whether the model used the right keywords and object types. Fix: evaluate only keywords (case-insensitive) and object_types — the two fields that actually determine whether the search was semantically correct.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
ChangesSearch Tool Argument Matching
Estimated code review effort: 1 (Trivial) | ~5 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`:
- Around line 12-16: The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1e3e175f-7f97-436d-9802-bd0a292bd4bb
📒 Files selected for processing (1)
packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1675 +/- ##
==========================================
- Coverage 79.21% 77.80% -1.41%
==========================================
Files 232 271 +39
Lines 15809 18602 +2793
==========================================
+ Hits 12523 14474 +1951
- Misses 3286 4128 +842 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
parsed_arguments() returns raw model-emitted JSON, so a bad tool call
like {"keywords":[1]} or mixed-type object_types would raise on
.lower()/sorted() and abort the whole eval run instead of scoring
tool_correctness=False. Add _normalize_str_list to drop non-string
entries defensively; valid comparisons (incl. case-insensitive keyword
match) are unchanged.
Addresses CodeRabbit review on PR #1675.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Question: This changes scoring behavior but adds no test locking it in. The two existing cases in Posted by Claude after discussing with @hkad98 |
Address review nits on search_tool _args_match: - reword _normalize_str_list comment: dropping non-string entries prevents a crash, it does not force a mismatch (surviving strings still compare) - note that object_types is compared case-sensitively on purpose (controlled ObjectType StrEnum values emitted verbatim) No behavior change. keywords/object_types are declared list[str] in the search_objects schema, so string-collapse false-negatives cannot occur. JIRA: TRIVIAL risk: nonprod
Summary
The
tool_correctnessmetric insearch_toolevaluations was alwaysFalse(0%), making the dashboard metric useless.Root cause:
_args_matchchecked two fields that never matched real model calls:emit_widget— this parameter was renamed touser_requested_searchin the tool schema; the model never sendsemit_widget, soactual_args.get("emit_widget")was alwaysNonevs the expectedfalselimit— declared as optional (int | None) in the schema; the model omits it and relies on the server default, while fixtures hardcode10Fix: Only check
keywords(case-insensitive) andobject_types— the two fields that determine whether the search was semantically correct.limitand the widget display flag are not part of search correctness.Test Plan
uv run pytest tests/ -x -qinpackages/gooddata-evaltool_correctnessnow reflects actual model behaviour (models that call with the right keywords and object types should score True)Risk
Low — single function change in the evaluator, no effect on eval runner or production code.
Summary by CodeRabbit