docs(skill): improve description to fix 20 pp recall gap in LLM routing by pedropaulovc · Pull Request #409 · microsoft/playwright-cli

pedropaulovc · 2026-05-18T21:23:16Z

Fixes #410

Summary

The current description causes the model to miss ~20% of valid invocations when routing from description text alone. This PR changes only the description field in skills/playwright-cli/SKILL.md.

Before (main):

Automate browser interactions, test web pages and work with Playwright tests.

After (this PR):

Use when the user says: "go to this URL", "click this button", "fill this
form", "take a screenshot", "scrape this page", "log in to X", "run my
Playwright tests", "this test is failing", "write a test for", "mock this
API", "record a demo". The skill opens a live browser with Playwright and drives it.

Benchmark

Tested against a 50-prompt set (40 should-trigger / 10 should-not-trigger) using OpenAI Codex as the routing judge (cross-model, independent of Anthropic's training data).

Methodology: skill renamed to fictional headless-pilot during evaluation so the judge cannot use prior knowledge of playwright-cli — forcing routing purely from description text. A first run with the real skill name produced identical scores for all variants (F1=0.987 each), revealing the judge was using training-data knowledge rather than the description. The isolated re-run produced differentiated, meaningful results.

Recall benchmark (isolated — fictional skill name)

Variant	Recall	Precision	F1	Misses
main (baseline)	0.800	1.000	0.889	8
this PR	0.975	1.000	0.987	1
Δ	+17.5 pp	0.0 pp	+9.8 pp	−7

All variants had perfect precision — no over-triggering. The improvement is entirely in recall.

The 8 prompts the baseline misses all have unambiguous playwright-cli solutions but don't contain "automate" or "browser interactions": running a specific .spec.ts file, recording video, exporting PDF, mocking an API endpoint, intercepting network requests, debugging a test by file path.

Precision audit — "write a test for" false-positive risk

The trigger phrase "write a test for" is intentionally broad. Tested 15 prompts that should NOT trigger the skill:

Framework	Result
JUnit, MSTest, pytest, Jest, RSpec, Mocha, NUnit, unittest, Go testing, MockMvc, Testing Library, supertest	0 / 12 false positives
Generic ("write a test for this function")	0 / 2 false positives
Cypress (browser tool, not Playwright)	Correctly excluded ✓

0 / 15 false positives. The model disambiguates using the framework name in the prompt and "with Playwright" in the closing sentence rather than keyword-matching the trigger phrase in isolation.

Appendix: test prompts

Should trigger (40 prompts)

Core interaction

open https://example.com and click the sign-in button
fill out the contact form at /contact with test data
navigate to the homepage and check what links are in the nav
can you click through this signup flow for me
hover over the dropdown menu and list the options
type 'hello world' into the search box and press Enter
drag the card from column A to column B
upload test.pdf to the file input on this page
select 'United States' in the country dropdown
open the browser and snapshot the current page
check if the submit button is disabled
double-click the filename to rename it
dismiss the cookie consent dialog
resize the viewport to 1920x1080 and take a screenshot
right-click the image and check the context menu options
tab through all the form fields and verify the labels

Test run / debug / generate
17. run my playwright tests and show me what fails
18. this playwright test is failing — help me fix it
19. write e2e tests for the checkout flow
20. generate a Playwright test for the login page
21. debug why tests/auth.spec.ts line 42 is broken
22. run tests/checkout.spec.ts in headed mode
23. heal the broken playwright tests after my refactor
24. help me write a spec plan for the dashboard feature
25. step through the test to see what the DOM looks like at failure
26. I need playwright tests for my new modal component

Screenshots / video / PDF
27. take a screenshot of https://anthropic.com
28. record a demo video of the checkout flow
29. export the pricing page as a PDF
30. screenshot the page at every responsive breakpoint for QA
31. record a video with chapter titles for the onboarding demo
32. capture a screenshot of just the hero section

Scraping / data extraction
33. scrape the product names and prices from this page
34. extract all the href links on the page
35. scrape pages 1 through 5 and collect all article titles
36. get the text content of the error messages on this form

Advanced (mock / auth / sessions)
37. mock the /api/users endpoint to return an empty array
38. save my authenticated session so I can reuse it later
39. test the admin and regular-user flows in separate browser sessions
40. intercept network requests and log which ones are slow

Should NOT trigger (10 prompts)

write a Python script to parse a CSV file
help me refactor this React component
review my pull request
run the security audit on the codebase
what does this SQL query do
update my CHANGELOG with the release notes
do a code review of the diff on this branch
analyze the webpack bundle size
help me understand this sorting algorithm
run my unit tests with jest

Precision audit — should NOT trigger (15 prompts)

write a JUnit test for the UserService login method
write an MSTest for the checkout controller
write a pytest test for the API endpoint
write a Jest unit test for my formatDate utility function
create an RSpec test for the User model
add a Mocha test for the payment service
write a test for my Spring Boot controller using MockMvc
write a test for my React component with Testing Library
generate a test for the database pool using Go's testing package
write a NUnit test for the calculator class
write a test for this function — it takes an array and returns the sorted version
write a test for the auth middleware using supertest
add a test case for the edge case where the input is empty string
write a test for my Python class using unittest
write a Cypress test for the login flow

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Expands the playwright-cli skill description to better enumerate supported Playwright CLI workflows and example user prompts.

Changes:

Replaced the single-line description with a folded multi-line YAML description.
Added concrete examples of when to use this skill (e.g., run/debug tests, scrape pages, mock APIs).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The current one-liner ("Automate browser interactions...") causes the model to miss 20% of valid invocations when routing from description text alone — verified by benchmarking 5 variants with OpenAI Codex as judge against a 50-prompt test set (40 should-trigger, 10 should-not). Benchmark methodology: skill renamed to a fictional name (`headless-pilot`) so the judge cannot use prior knowledge of playwright-cli — forcing it to route purely from description text. All descriptions had perfect precision (1.000); recall was what differed. Results: baseline (current) recall=0.800 F1=0.889 (8 misses) v1 explicit triggers recall=0.950 F1=0.974 (2 misses) v2 intent-first recall=0.925 F1=0.961 (3 misses) v3 verb-dense recall=0.950 F1=0.974 (2 misses) v4 this PR (winner) recall=0.975 F1=0.987 (1 miss) The 8 prompts the baseline misses all have unambiguous playwright-cli solutions but don't contain "automate" or "browser interactions": record video, export PDF, mock API endpoint, run a specific spec file, intercept network requests, debug a failing test by file path. The winning description uses "Use when the user says: ..." with quoted natural-language trigger phrases. This gives the routing model a direct string-match signal rather than requiring it to infer that "record a demo" ≡ "automate browser interactions". Only change: the `description` field in the YAML frontmatter.

Skn0tt · 2026-05-19T08:08:39Z

Hi! The source for the skill lives in https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/tools/cli-client/skill/SKILL.md, could you reopen the PR against that file? The numbers look good, thanks for the research!

Copilot AI review requested due to automatic review settings May 18, 2026 21:23

Copilot AI reviewed May 18, 2026

View reviewed changes

pedropaulovc mentioned this pull request May 18, 2026

skill: description causes ~20% recall gap in LLM routing (baseline recall=0.800) #410

Open

pedropaulovc force-pushed the docs-skill-description-invoke-rate branch from 8c7665a to ce4ae00 Compare May 18, 2026 21:44

pedropaulovc changed the title ~~docs(skill): expand description to improve LLM routing precision~~ docs(skill): improve description to fix 20 pp recall gap in LLM routing May 18, 2026

pedropaulovc force-pushed the docs-skill-description-invoke-rate branch from ce4ae00 to cd5b7f3 Compare May 18, 2026 21:54

Skn0tt closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(skill): improve description to fix 20 pp recall gap in LLM routing#409

docs(skill): improve description to fix 20 pp recall gap in LLM routing#409
pedropaulovc wants to merge 1 commit into
microsoft:mainfrom
pedropaulovc:docs-skill-description-invoke-rate

pedropaulovc commented May 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Skn0tt commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pedropaulovc commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Recall benchmark (isolated — fictional skill name)

Precision audit — "write a test for" false-positive risk

Appendix: test prompts

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Skn0tt commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pedropaulovc commented May 18, 2026 •

edited

Loading