Sometimes, LLMs disagree with their own thoughts by derekmisler · Pull Request #48 · docker/cagent-action

derekmisler · 2026-02-18T20:39:25Z

Summary

Adds structured output schemas and decision rules to the PR review agent to ensure consistent verdicts across multiple runs. The agent now returns JSON-formatted findings and applies mechanical rules to determine whether to APPROVE, COMMENT, or REQUEST_CHANGES based on severity and verification status.

Changes

review-pr/agents/pr-review.yaml: Added structured output schemas for drafter and verifier agents, enforcing JSON responses with required fields (file, line, severity, verdict, etc.). Introduced mandatory decision rules that map findings to review verdicts mechanically, preventing LLMs from overriding severity assessments. Added console output mode detection to support local testing without posting to GitHub.
review-pr/agents/evals/: Added 6 eval files (3 for a security vulnerability, 3 for a clean PR) to test consistency. Each eval runs the same PR through the agent and verifies the verdict matches expectations.
review-pr/README.md: Documented how to run the agent locally, how to run evals, and eval naming conventions (success-* for clean PRs, security-* for PRs with issues).
.gitignore: Excluded eval results directory from version control.

Breaking Changes

None. The agent still posts reviews to GitHub when running in CI and outputs to console when running locally.

How to Test

Run the evals: cagent eval review-pr/agents/pr-review.yaml review-pr/agents/evals/ -e GITHUB_TOKEN -e GH_TOKEN
Verify that all 3 runs of security-redirect-uri-*.json produce REQUEST_CHANGES verdicts
Verify that all 3 runs of success-*.json produce APPROVE verdicts

Closes: https://github.com/docker/gordon/issues/153

Signed-off-by: Derek Misler <derek.misler@docker.com>

derekmisler · 2026-02-18T20:39:46Z

/describe

docker-agent · 2026-02-18T20:40:22Z

✅ PR description has been generated and updated!

docker-agent

Review Summary

No issues found! This PR adds comprehensive eval testing infrastructure and documentation.

Changes Reviewed

✅ .gitignore entry for eval results
✅ Documentation for running evals locally
✅ Six eval test files (3 security, 3 success) with proper JSON structure
✅ All eval files have valid UUIDs, proper relevance assertions, and test fixtures

The added code is structurally sound, well-documented, and ready to merge.

Signed-off-by: Derek Misler <derek.misler@docker.com>

Sometimes, LLMs disagree with their own thoughts

fe74d99

Signed-off-by: Derek Misler <derek.misler@docker.com>

derekmisler self-assigned this Feb 18, 2026

derekmisler marked this pull request as ready for review February 18, 2026 20:39

derekmisler requested a review from a team as a code owner February 18, 2026 20:39

docker-agent bot approved these changes Feb 18, 2026

View reviewed changes

re-added example in readme

6047e5d

Signed-off-by: Derek Misler <derek.misler@docker.com>

rumpl approved these changes Feb 18, 2026

View reviewed changes

derekmisler merged commit 14e1c08 into docker:main Feb 19, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes, LLMs disagree with their own thoughts#48

Sometimes, LLMs disagree with their own thoughts#48
derekmisler merged 2 commits intodocker:mainfrom
derekmisler:sometimes-llms-disagree-with-their-own-thoughts

derekmisler commented Feb 18, 2026 •

edited by docker-agent bot

Loading

Uh oh!

derekmisler commented Feb 18, 2026

Uh oh!

docker-agent bot commented Feb 18, 2026

Uh oh!

docker-agent bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

derekmisler commented Feb 18, 2026 • edited by docker-agent bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Breaking Changes

How to Test

Uh oh!

derekmisler commented Feb 18, 2026

Uh oh!

docker-agent bot commented Feb 18, 2026

Uh oh!

docker-agent bot left a comment

Choose a reason for hiding this comment

Review Summary

Changes Reviewed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

derekmisler commented Feb 18, 2026 •

edited by docker-agent bot

Loading