fix(eval): fix search_tool correctness always scoring 0% by Tomkess · Pull Request #1675 · gooddata/gooddata-python-sdk

Tomkess · 2026-06-27T20:34:01Z

Summary

The tool_correctness metric in search_tool evaluations was always False (0%), making the dashboard metric useless.

Root cause: _args_match checked two fields that never matched real model calls:

emit_widget — this parameter was renamed to user_requested_search in the tool schema; the model never sends emit_widget, so actual_args.get("emit_widget") was always None vs the expected false
limit — declared as optional (int | None) in the schema; the model omits it and relies on the server default, while fixtures hardcode 10

Fix: Only check keywords (case-insensitive) and object_types — the two fields that determine whether the search was semantically correct. limit and the widget display flag are not part of search correctness.

Test Plan

Run uv run pytest tests/ -x -q in packages/gooddata-eval
Re-run search_tool eval items and verify tool_correctness now reflects actual model behaviour (models that call with the right keywords and object types should score True)

Risk

Low — single function change in the evaluator, no effect on eval runner or production code.

Summary by CodeRabbit

Bug Fixes
- Improved matching of search tool calls to judge correctness using the most relevant inputs.
- Search keywords are now compared case-insensitively, and object types are normalized for more reliable matching.
- Search options such as limit and widget emission are no longer required to match exactly, reducing false mismatches.
- Malformed or non-string keyword/object type inputs are handled defensively, preventing matching-time errors.

_args_match checked emit_widget (renamed to user_requested_search in the tool schema) and limit (optional, server-side default). Both mismatched on every real model call, so tool_correctness was always False regardless of whether the model used the right keywords and object types. Fix: evaluate only keywords (case-insensitive) and object_types — the two fields that actually determine whether the search was semantically correct.

coderabbitai · 2026-06-27T20:34:19Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 06f1594e-3d82-4ea7-8527-f4506f0e3a4c

📥 Commits

Reviewing files that changed from the base of the PR and between f847679 and 53dc5d2.

📒 Files selected for processing (1)

packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

🚧 Files skipped from review as they are similar to previous changes (1)

packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

📝 Walkthrough

Walkthrough

_args_match now compares only normalized keywords and object_types, while malformed list inputs are normalized to empty lists instead of being matched through direct list handling.

Changes

Search Tool Argument Matching

Layer / File(s)	Summary
Normalize string lists `packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`	Adds a helper that filters list inputs to strings, optionally lowercases them, sorts them, and returns an empty list for malformed values.
Relax `_args_match` criteria `packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`	Compares normalized `keywords` and `object_types` only, and removes the previous `limit` and `emit_widget` equality checks.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Poem

🐇 I sniff the keywords, round and neat,
Object types line up to the beat.
Limits drift off into the mist,
Widget flags no longer insist.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly reflects the main change: fixing search_tool evaluation correctness that was incorrectly scoring 0%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`:
- Around line 12-16: The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e3e175f-7f97-436d-9802-bd0a292bd4bb

📥 Commits

Reviewing files that changed from the base of the PR and between 726a0b0 and 2ebc8b0.

📒 Files selected for processing (1)

packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

codecov · 2026-06-27T20:37:49Z

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 77.80%. Comparing base (653f5bc) to head (53dc5d2).
⚠️ Report is 14 commits behind head on master.

Files with missing lines	Patch %	Lines
...l/src/gooddata_eval/core/evaluators/search_tool.py	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1675      +/-   ##
==========================================
- Coverage   79.21%   77.80%   -1.41%     
==========================================
  Files         232      271      +39     
  Lines       15809    18602    +2793     
==========================================
+ Hits        12523    14474    +1951     
- Misses       3286     4128     +842

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

parsed_arguments() returns raw model-emitted JSON, so a bad tool call like {"keywords":[1]} or mixed-type object_types would raise on .lower()/sorted() and abort the whole eval run instead of scoring tool_correctness=False. Add _normalize_str_list to drop non-string entries defensively; valid comparisons (incl. case-insensitive keyword match) are unchanged. Addresses CodeRabbit review on PR #1675. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hkad98 · 2026-07-02T08:54:58Z

Question: This changes scoring behavior but adds no test locking it in. The two existing cases in test_search_tool_evaluator.py still pass only because their fixtures keep limit/emit_widget matching — nothing asserts the actual fix: a call where limit/emit_widget differ but keywords+object_types match should now score tool_correctness=True. Given this metric was silently wrong in production, could we add a regression test for that (plus case-insensitive keywords and the malformed-input path)?

Posted by Claude after discussing with @hkad98

Address review nits on search_tool _args_match: - reword _normalize_str_list comment: dropping non-string entries prevents a crash, it does not force a mismatch (surviving strings still compare) - note that object_types is compared case-sensitively on purpose (controlled ObjectType StrEnum values emitted verbatim) No behavior change. keywords/object_types are declared list[str] in the search_objects schema, so string-collapse false-negatives cannot occur. JIRA: TRIVIAL risk: nonprod

Tomkess requested review from hkad98, lupko and pcerny as code owners June 27, 2026 20:34

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py Outdated

hkad98 reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

hkad98 reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

hkad98 reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): fix search_tool correctness always scoring 0%#1675

fix(eval): fix search_tool correctness always scoring 0%#1675
Tomkess wants to merge 3 commits into
masterfrom
fix/eval-search-tool-correctness

Tomkess commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hkad98 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Tomkess commented Jun 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Risk

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hkad98 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tomkess commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

codecov Bot commented Jun 27, 2026 •

edited

Loading