Skip to content

docs(skill): improve description to fix 20 pp recall gap in LLM routing#409

Closed
pedropaulovc wants to merge 1 commit into
microsoft:mainfrom
pedropaulovc:docs-skill-description-invoke-rate
Closed

docs(skill): improve description to fix 20 pp recall gap in LLM routing#409
pedropaulovc wants to merge 1 commit into
microsoft:mainfrom
pedropaulovc:docs-skill-description-invoke-rate

Conversation

@pedropaulovc
Copy link
Copy Markdown

@pedropaulovc pedropaulovc commented May 18, 2026

Fixes #410

Summary

The current description causes the model to miss ~20% of valid invocations when routing from description text alone. This PR changes only the description field in skills/playwright-cli/SKILL.md.

Before (main):

Automate browser interactions, test web pages and work with Playwright tests.

After (this PR):

Use when the user says: "go to this URL", "click this button", "fill this
form", "take a screenshot", "scrape this page", "log in to X", "run my
Playwright tests", "this test is failing", "write a test for", "mock this
API", "record a demo". The skill opens a live browser with Playwright and drives it.

Benchmark

Tested against a 50-prompt set (40 should-trigger / 10 should-not-trigger) using OpenAI Codex as the routing judge (cross-model, independent of Anthropic's training data).

Methodology: skill renamed to fictional headless-pilot during evaluation so the judge cannot use prior knowledge of playwright-cli — forcing routing purely from description text. A first run with the real skill name produced identical scores for all variants (F1=0.987 each), revealing the judge was using training-data knowledge rather than the description. The isolated re-run produced differentiated, meaningful results.

Recall benchmark (isolated — fictional skill name)

Variant Recall Precision F1 Misses
main (baseline) 0.800 1.000 0.889 8
this PR 0.975 1.000 0.987 1
Δ +17.5 pp 0.0 pp +9.8 pp −7

All variants had perfect precision — no over-triggering. The improvement is entirely in recall.

The 8 prompts the baseline misses all have unambiguous playwright-cli solutions but don't contain "automate" or "browser interactions": running a specific .spec.ts file, recording video, exporting PDF, mocking an API endpoint, intercepting network requests, debugging a test by file path.

Precision audit — "write a test for" false-positive risk

The trigger phrase "write a test for" is intentionally broad. Tested 15 prompts that should NOT trigger the skill:

Framework Result
JUnit, MSTest, pytest, Jest, RSpec, Mocha, NUnit, unittest, Go testing, MockMvc, Testing Library, supertest 0 / 12 false positives
Generic ("write a test for this function") 0 / 2 false positives
Cypress (browser tool, not Playwright) Correctly excluded ✓

0 / 15 false positives. The model disambiguates using the framework name in the prompt and "with Playwright" in the closing sentence rather than keyword-matching the trigger phrase in isolation.


Appendix: test prompts

Should trigger (40 prompts)

Core interaction

  1. open https://example.com and click the sign-in button
  2. fill out the contact form at /contact with test data
  3. navigate to the homepage and check what links are in the nav
  4. can you click through this signup flow for me
  5. hover over the dropdown menu and list the options
  6. type 'hello world' into the search box and press Enter
  7. drag the card from column A to column B
  8. upload test.pdf to the file input on this page
  9. select 'United States' in the country dropdown
  10. open the browser and snapshot the current page
  11. check if the submit button is disabled
  12. double-click the filename to rename it
  13. dismiss the cookie consent dialog
  14. resize the viewport to 1920x1080 and take a screenshot
  15. right-click the image and check the context menu options
  16. tab through all the form fields and verify the labels

Test run / debug / generate
17. run my playwright tests and show me what fails
18. this playwright test is failing — help me fix it
19. write e2e tests for the checkout flow
20. generate a Playwright test for the login page
21. debug why tests/auth.spec.ts line 42 is broken
22. run tests/checkout.spec.ts in headed mode
23. heal the broken playwright tests after my refactor
24. help me write a spec plan for the dashboard feature
25. step through the test to see what the DOM looks like at failure
26. I need playwright tests for my new modal component

Screenshots / video / PDF
27. take a screenshot of https://anthropic.com
28. record a demo video of the checkout flow
29. export the pricing page as a PDF
30. screenshot the page at every responsive breakpoint for QA
31. record a video with chapter titles for the onboarding demo
32. capture a screenshot of just the hero section

Scraping / data extraction
33. scrape the product names and prices from this page
34. extract all the href links on the page
35. scrape pages 1 through 5 and collect all article titles
36. get the text content of the error messages on this form

Advanced (mock / auth / sessions)
37. mock the /api/users endpoint to return an empty array
38. save my authenticated session so I can reuse it later
39. test the admin and regular-user flows in separate browser sessions
40. intercept network requests and log which ones are slow

Should NOT trigger (10 prompts)
  1. write a Python script to parse a CSV file
  2. help me refactor this React component
  3. review my pull request
  4. run the security audit on the codebase
  5. what does this SQL query do
  6. update my CHANGELOG with the release notes
  7. do a code review of the diff on this branch
  8. analyze the webpack bundle size
  9. help me understand this sorting algorithm
  10. run my unit tests with jest
Precision audit — should NOT trigger (15 prompts)
  1. write a JUnit test for the UserService login method
  2. write an MSTest for the checkout controller
  3. write a pytest test for the API endpoint
  4. write a Jest unit test for my formatDate utility function
  5. create an RSpec test for the User model
  6. add a Mocha test for the payment service
  7. write a test for my Spring Boot controller using MockMvc
  8. write a test for my React component with Testing Library
  9. generate a test for the database pool using Go's testing package
  10. write a NUnit test for the calculator class
  11. write a test for this function — it takes an array and returns the sorted version
  12. write a test for the auth middleware using supertest
  13. add a test case for the edge case where the input is empty string
  14. write a test for my Python class using unittest
  15. write a Cypress test for the login flow

Copilot AI review requested due to automatic review settings May 18, 2026 21:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Expands the playwright-cli skill description to better enumerate supported Playwright CLI workflows and example user prompts.

Changes:

  • Replaced the single-line description with a folded multi-line YAML description.
  • Added concrete examples of when to use this skill (e.g., run/debug tests, scrape pages, mock APIs).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pedropaulovc pedropaulovc force-pushed the docs-skill-description-invoke-rate branch from 8c7665a to ce4ae00 Compare May 18, 2026 21:44
@pedropaulovc pedropaulovc changed the title docs(skill): expand description to improve LLM routing precision docs(skill): improve description to fix 20 pp recall gap in LLM routing May 18, 2026
The current one-liner ("Automate browser interactions...") causes the
model to miss 20% of valid invocations when routing from description
text alone — verified by benchmarking 5 variants with OpenAI Codex as
judge against a 50-prompt test set (40 should-trigger, 10 should-not).

Benchmark methodology: skill renamed to a fictional name (`headless-pilot`)
so the judge cannot use prior knowledge of playwright-cli — forcing it to
route purely from description text. All descriptions had perfect precision
(1.000); recall was what differed.

Results:
  baseline (current)   recall=0.800  F1=0.889  (8 misses)
  v1 explicit triggers recall=0.950  F1=0.974  (2 misses)
  v2 intent-first      recall=0.925  F1=0.961  (3 misses)
  v3 verb-dense        recall=0.950  F1=0.974  (2 misses)
  v4 this PR (winner)  recall=0.975  F1=0.987  (1 miss)

The 8 prompts the baseline misses all have unambiguous playwright-cli
solutions but don't contain "automate" or "browser interactions":
record video, export PDF, mock API endpoint, run a specific spec file,
intercept network requests, debug a failing test by file path.

The winning description uses "Use when the user says: ..." with quoted
natural-language trigger phrases. This gives the routing model a direct
string-match signal rather than requiring it to infer that "record a
demo" ≡ "automate browser interactions".

Only change: the `description` field in the YAML frontmatter.
@pedropaulovc pedropaulovc force-pushed the docs-skill-description-invoke-rate branch from ce4ae00 to cd5b7f3 Compare May 18, 2026 21:54
@Skn0tt
Copy link
Copy Markdown
Member

Skn0tt commented May 19, 2026

Hi! The source for the skill lives in https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/tools/cli-client/skill/SKILL.md, could you reopen the PR against that file? The numbers look good, thanks for the research!

@Skn0tt Skn0tt closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

skill: description causes ~20% recall gap in LLM routing (baseline recall=0.800)

3 participants