Evaluation Driven Development (EDD) demonstration using promptfoo with multiple CLI providers and evaluation methodologies.
This project demonstrates how to use promptfoo for systematic prompt testing with different AI CLI providers and evaluation types, following Evaluation Driven Development principles.
promptfoo/
├── README.md # This file
├── promptfooconfig.yaml # Main promptfoo configuration
├── prompt-under-test.md # The prompt being tested
├── eval-script.js # Custom JavaScript evaluation functions
├── AGENTS.md # Project configuration notes
│
└── CLI Scripts (Cross-platform)
├── claude.js # Claude Code CLI (provider + grader)
├── devin.js # Devin CLI (provider + grader)
├── gh_copilot.js # GitHub Copilot CLI (provider + grader)
└── gemini.js # Gemini CLI (provider + grader)
npx promptfoo evalResults are generated in results.html after each evaluation run.
The configuration currently uses Devin CLI as the default provider for both generating responses and grading rubric evaluations.
All scripts are cross-platform Node.js files that work on Windows, Linux, and Mac:
- Claude Code CLI (
claude.js) - Anthropic's Claude Code assistant - Devin CLI (
devin.js) - Cognition's Devin AI assistant - GitHub Copilot CLI (
gh_copilot.js) - GitHub's Copilot assistant - Gemini CLI (
gemini.js) - Google's Gemini assistant
To switch providers, edit promptfooconfig.yaml and uncomment the desired provider.
Each CLI script combines both provider and grader functionality in one file, auto-detecting the mode based on prompt format:
- Provider mode: When the prompt is plain text, the script acts as a provider and passes the text directly to the CLI tool
- Grader mode: When the prompt is a JSON array
[{role, content}, ...], the script acts as a grader, parsing the chat array and routing system/user messages appropriately
This eliminates the need for separate provider and grader scripts while maintaining full functionality.
This demonstration includes three types of evaluations:
Simple pass/fail checks based on content presence or exact matches.
Example:
- vars:
query: 'What is the difference between let and const in JavaScript?'
assert:
- type: contains
value: 'let'
- type: contains
value: 'const'
- type: contains
value: 'reassign'Uses a judge LLM to score the response against natural language criteria. Each assertion evaluates a single, independent quality dimension, giving granular pass/fail per dimension in the web UI rather than one opaque score.
The judge LLM is configured under defaultTest.options.provider. This project uses devin.js (which auto-detects grader mode), but the combined scripts for Claude, GitHub Copilot, and Gemini are also available.
How grader mode works: When the combined scripts receive a JSON chat array (
[{"role":"system",...}, {"role":"user",...}]), they automatically enter grader mode, parse the array, and route each part correctly to the CLI tool.Claude Code supports
--system-promptto replace its default context entirely, keeping system and user messages separate. Devin, GitHub Copilot, and Gemini don't have an equivalent flag, so their scripts concatenate the system instructions and user message into one combined prompt instead.
Each llm-rubric assertion accepts these parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
value |
string | Yes | The natural language criterion to evaluate. Write it as an observable statement: "The response includes a code example" not "the response is good". |
threshold |
float 0.0–1.0 | No | Minimum score to pass this assertion. Defaults to 0.5 if omitted. |
metric |
string | No | Label shown in the promptfoo web UI results table — useful for identifying which dimension failed. |
weight |
number | No | Relative importance when computing the test's overall weighted score. weight: 2 counts twice as much as weight: 1. Defaults to 1. |
- vars:
query: 'Explain the concept of recursion in programming'
description: 'Recursion explanation - rubric with per-dimension scoring'
assert:
# Core definition — highest weight; must score at least 0.7
- type: llm-rubric
value: The response clearly defines what recursion is in plain language a non-expert could understand
threshold: 0.7
metric: clarity
weight: 2
# Code example — important for a technical explanation
- type: llm-rubric
value: The response includes at least one working code example that demonstrates recursion
threshold: 0.7
metric: code-example
weight: 1.5
# Base/recursive case — critical for correctness; strict threshold
- type: llm-rubric
value: The response explicitly explains both the base case (stopping condition) and the recursive case
threshold: 0.8
metric: base-case-explained
weight: 2
# Risks — nice to have; lower threshold and weight
- type: llm-rubric
value: The response mentions stack overflow or call stack depth as a potential pitfall
threshold: 0.4
metric: mentions-risks
weight: 1- Set
thresholdbased on criticality. Core requirements warrant0.7–0.8. Nice-to-have criteria can use0.4–0.5. - Use
weightto express relative importance. Don't just vary threshold — a critical dimension should also carry more weight so it influences the overall score proportionally. - One criterion per assertion. Splitting dimensions into separate
llm-rubricentries gives individual pass/fail per dimension in the web UI, rather than one opaque rubric score. - Write criteria as observable statements. "The response includes at least one working code example" is better than "the response is well-explained".
Custom deterministic evaluation logic written in JavaScript. Functions live in eval-script.js and are referenced directly from the config using file://path:functionName syntax.
Example:
- vars:
query: 'Write a function to validate an email address'
description: 'Email validation - script-based eval'
assert:
- type: javascript
value: file://eval-script.js:hasCodeFormatting
- type: javascript
value: file://eval-script.js:explainsWhy
- type: javascript
value: file://eval-script.js:mentionsBestPracticesFunctions in eval-script.js:
| Function | Checks |
|---|---|
hasCodeFormatting(output) |
Response contains a fenced code block |
explainsWhy(output) |
Response includes reasoning words (because, since, therefore…) |
mentionsBestPractices(output) |
Response mentions best practices or common patterns |
isConcise(output) |
Response is under 2000 characters |
Each function returns { pass, score, reason } so promptfoo can display a meaningful failure message.
Adding your own functions:
module.exports = {
yourCustomCheck: (output) => {
const pass = /* your logic */;
return { pass, score: pass ? 1 : 0, reason: 'Explanation shown on failure' };
}
};You can test the CLI scripts directly to verify functionality:
# Provider mode (plain text prompt)
node ./devin.js "What is 2+2?" '{"config": {"model": "SWE-1.6"}}' "context"
# Grader mode (JSON array prompt)
node ./devin.js '[{"role":"system","content":"You are an evaluator..."},{"role":"user","content":"Test message"}]' '{}'The prompt-under-test.md file contains the actual prompt being evaluated. Using an external file provides:
- Version control for prompt changes
- Easy editing without modifying configuration
- Reusability across different test configurations
- Clear separation between test logic and prompt content
- Define Success Criteria - Determine what makes a prompt "good" for your use case
- Create Evaluation Tests - Set up deterministic, rubric, and script-based tests
- Write Initial Prompt - Create your prompt in the external file
- Run Evaluations - Test against your criteria
- Iterate - Refine prompt based on evaluation results
- Regression Test - Ensure changes don't break existing successful tests
The current configuration includes 6 test cases:
- JavaScript concepts (deterministic) - Tests let vs const explanation
- Python string reversal (deterministic) - Tests Python coding knowledge
- Recursion explanation (rubric) - Evaluates completeness of recursion explanation
- Error handling best practices (rubric) - Assesses JavaScript error handling knowledge
- Email validation function (script-based) - Tests code generation quality
- Equality operators (script-based) - Tests explanation quality with custom assertions
- Replace
prompt-under-test.mdwith your own prompt - Modify test cases in
promptfooconfig.yamlto match your domain - Add custom evaluation functions to
eval-script.jsas needed - Adjust rubric criteria to reflect your quality standards
- Switch to your preferred AI provider in the configuration
Edit eval-script.js to add custom functions:
module.exports = {
yourCustomFunction: (output) => {
// Your evaluation logic
return {
pass: true/false,
score: 0-1,
reason: 'Explanation of result'
};
}
};- Node.js (for promptfoo and CLI scripts)
- CLI tools for the providers you want to use:
- Claude Code CLI (
claude) - Devin CLI (
devin) - GitHub Copilot CLI (
gh copilot) - Gemini CLI (
gemini)
- Claude Code CLI (
Ensure your chosen CLI tool is properly installed and accessible in your PATH.
If you encounter issues running the Node.js scripts:
- Verify Node.js is installed:
node --version - Check script permissions:
ls -la *.js(should be executable) - Test scripts directly using the examples in "Running Scripts Directly" section
If promptfoo reports configuration errors:
- Verify YAML syntax in
promptfooconfig.yaml - Check that the referenced script files exist
- Ensure model names match what your CLI tool supports
This demonstration is provided as-is for educational and development purposes.