Skip to content

improving/promptfoo

Repository files navigation

Promptfoo EDD Demonstration

Evaluation Driven Development (EDD) demonstration using promptfoo with multiple CLI providers and evaluation methodologies.

Overview

This project demonstrates how to use promptfoo for systematic prompt testing with different AI CLI providers and evaluation types, following Evaluation Driven Development principles.

Files Structure

promptfoo/
├── README.md                    # This file
├── promptfooconfig.yaml         # Main promptfoo configuration
├── prompt-under-test.md         # The prompt being tested
├── eval-script.js               # Custom JavaScript evaluation functions
├── AGENTS.md                    # Project configuration notes
│
└── CLI Scripts (Cross-platform)
    ├── claude.js                # Claude Code CLI (provider + grader)
    ├── devin.js                 # Devin CLI (provider + grader)
    ├── gh_copilot.js            # GitHub Copilot CLI (provider + grader)
    └── gemini.js                # Gemini CLI (provider + grader)

Quick Start

Run Evaluations

npx promptfoo eval

View Results

Results are generated in results.html after each evaluation run.

Configuration

Active Provider

The configuration currently uses Devin CLI as the default provider for both generating responses and grading rubric evaluations.

Available Providers

All scripts are cross-platform Node.js files that work on Windows, Linux, and Mac:

  • Claude Code CLI (claude.js) - Anthropic's Claude Code assistant
  • Devin CLI (devin.js) - Cognition's Devin AI assistant
  • GitHub Copilot CLI (gh_copilot.js) - GitHub's Copilot assistant
  • Gemini CLI (gemini.js) - Google's Gemini assistant

To switch providers, edit promptfooconfig.yaml and uncomment the desired provider.

How the Scripts Work

Each CLI script combines both provider and grader functionality in one file, auto-detecting the mode based on prompt format:

  • Provider mode: When the prompt is plain text, the script acts as a provider and passes the text directly to the CLI tool
  • Grader mode: When the prompt is a JSON array [{role, content}, ...], the script acts as a grader, parsing the chat array and routing system/user messages appropriately

This eliminates the need for separate provider and grader scripts while maintaining full functionality.

Evaluation Types

This demonstration includes three types of evaluations:

1. Deterministic Evaluations (contains/equiv)

Simple pass/fail checks based on content presence or exact matches.

Example:

- vars:
    query: 'What is the difference between let and const in JavaScript?'
  assert:
    - type: contains
      value: 'let'
    - type: contains
      value: 'const'
    - type: contains
      value: 'reassign'

2. Rubric-Based Evaluations (llm-rubric)

Uses a judge LLM to score the response against natural language criteria. Each assertion evaluates a single, independent quality dimension, giving granular pass/fail per dimension in the web UI rather than one opaque score.

Grader Provider

The judge LLM is configured under defaultTest.options.provider. This project uses devin.js (which auto-detects grader mode), but the combined scripts for Claude, GitHub Copilot, and Gemini are also available.

How grader mode works: When the combined scripts receive a JSON chat array ([{"role":"system",...}, {"role":"user",...}]), they automatically enter grader mode, parse the array, and route each part correctly to the CLI tool.

Claude Code supports --system-prompt to replace its default context entirely, keeping system and user messages separate. Devin, GitHub Copilot, and Gemini don't have an equivalent flag, so their scripts concatenate the system instructions and user message into one combined prompt instead.

Assertion Parameters

Each llm-rubric assertion accepts these parameters:

Parameter Type Required Description
value string Yes The natural language criterion to evaluate. Write it as an observable statement: "The response includes a code example" not "the response is good".
threshold float 0.0–1.0 No Minimum score to pass this assertion. Defaults to 0.5 if omitted.
metric string No Label shown in the promptfoo web UI results table — useful for identifying which dimension failed.
weight number No Relative importance when computing the test's overall weighted score. weight: 2 counts twice as much as weight: 1. Defaults to 1.

Example

- vars:
    query: 'Explain the concept of recursion in programming'
  description: 'Recursion explanation - rubric with per-dimension scoring'
  assert:
    # Core definition — highest weight; must score at least 0.7
    - type: llm-rubric
      value: The response clearly defines what recursion is in plain language a non-expert could understand
      threshold: 0.7
      metric: clarity
      weight: 2

    # Code example — important for a technical explanation
    - type: llm-rubric
      value: The response includes at least one working code example that demonstrates recursion
      threshold: 0.7
      metric: code-example
      weight: 1.5

    # Base/recursive case — critical for correctness; strict threshold
    - type: llm-rubric
      value: The response explicitly explains both the base case (stopping condition) and the recursive case
      threshold: 0.8
      metric: base-case-explained
      weight: 2

    # Risks — nice to have; lower threshold and weight
    - type: llm-rubric
      value: The response mentions stack overflow or call stack depth as a potential pitfall
      threshold: 0.4
      metric: mentions-risks
      weight: 1

Design Guidance

  • Set threshold based on criticality. Core requirements warrant 0.70.8. Nice-to-have criteria can use 0.40.5.
  • Use weight to express relative importance. Don't just vary threshold — a critical dimension should also carry more weight so it influences the overall score proportionally.
  • One criterion per assertion. Splitting dimensions into separate llm-rubric entries gives individual pass/fail per dimension in the web UI, rather than one opaque rubric score.
  • Write criteria as observable statements. "The response includes at least one working code example" is better than "the response is well-explained".

3. Script-Based Evaluations

Custom deterministic evaluation logic written in JavaScript. Functions live in eval-script.js and are referenced directly from the config using file://path:functionName syntax.

Example:

- vars:
    query: 'Write a function to validate an email address'
  description: 'Email validation - script-based eval'
  assert:
    - type: javascript
      value: file://eval-script.js:hasCodeFormatting
    - type: javascript
      value: file://eval-script.js:explainsWhy
    - type: javascript
      value: file://eval-script.js:mentionsBestPractices

Functions in eval-script.js:

Function Checks
hasCodeFormatting(output) Response contains a fenced code block
explainsWhy(output) Response includes reasoning words (because, since, therefore…)
mentionsBestPractices(output) Response mentions best practices or common patterns
isConcise(output) Response is under 2000 characters

Each function returns { pass, score, reason } so promptfoo can display a meaningful failure message.

Adding your own functions:

module.exports = {
  yourCustomCheck: (output) => {
    const pass = /* your logic */;
    return { pass, score: pass ? 1 : 0, reason: 'Explanation shown on failure' };
  }
};

Running Scripts Directly

You can test the CLI scripts directly to verify functionality:

# Provider mode (plain text prompt)
node ./devin.js "What is 2+2?" '{"config": {"model": "SWE-1.6"}}' "context"

# Grader mode (JSON array prompt)
node ./devin.js '[{"role":"system","content":"You are an evaluator..."},{"role":"user","content":"Test message"}]' '{}'

Prompt Under Test

The prompt-under-test.md file contains the actual prompt being evaluated. Using an external file provides:

  • Version control for prompt changes
  • Easy editing without modifying configuration
  • Reusability across different test configurations
  • Clear separation between test logic and prompt content

EDD Workflow

  1. Define Success Criteria - Determine what makes a prompt "good" for your use case
  2. Create Evaluation Tests - Set up deterministic, rubric, and script-based tests
  3. Write Initial Prompt - Create your prompt in the external file
  4. Run Evaluations - Test against your criteria
  5. Iterate - Refine prompt based on evaluation results
  6. Regression Test - Ensure changes don't break existing successful tests

Test Cases

The current configuration includes 6 test cases:

  1. JavaScript concepts (deterministic) - Tests let vs const explanation
  2. Python string reversal (deterministic) - Tests Python coding knowledge
  3. Recursion explanation (rubric) - Evaluates completeness of recursion explanation
  4. Error handling best practices (rubric) - Assesses JavaScript error handling knowledge
  5. Email validation function (script-based) - Tests code generation quality
  6. Equality operators (script-based) - Tests explanation quality with custom assertions

Customization

For Your Own Use Case

  1. Replace prompt-under-test.md with your own prompt
  2. Modify test cases in promptfooconfig.yaml to match your domain
  3. Add custom evaluation functions to eval-script.js as needed
  4. Adjust rubric criteria to reflect your quality standards
  5. Switch to your preferred AI provider in the configuration

Adding New Evaluation Functions

Edit eval-script.js to add custom functions:

module.exports = {
  yourCustomFunction: (output) => {
    // Your evaluation logic
    return {
      pass: true/false,
      score: 0-1,
      reason: 'Explanation of result'
    };
  }
};

Requirements

  • Node.js (for promptfoo and CLI scripts)
  • CLI tools for the providers you want to use:
    • Claude Code CLI (claude)
    • Devin CLI (devin)
    • GitHub Copilot CLI (gh copilot)
    • Gemini CLI (gemini)

Troubleshooting

Provider CLI Installation

Ensure your chosen CLI tool is properly installed and accessible in your PATH.

Script Execution Issues

If you encounter issues running the Node.js scripts:

  • Verify Node.js is installed: node --version
  • Check script permissions: ls -la *.js (should be executable)
  • Test scripts directly using the examples in "Running Scripts Directly" section

Configuration Errors

If promptfoo reports configuration errors:

  • Verify YAML syntax in promptfooconfig.yaml
  • Check that the referenced script files exist
  • Ensure model names match what your CLI tool supports

License

This demonstration is provided as-is for educational and development purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors