This project provides a local-first benchmarking system for comparing provider, model, harness, and harness-configuration performance through a shared LiteLLM proxy.
The system is built for interactive terminal agents and IDE agents that can be pointed at a custom inference base URL. The benchmark application does not own the harness runtime. It owns session registration, correlation, collection, normalization, storage, reporting, and dashboards.
Run this first:
make install-devThen start the local stack:
export LITELLM_MASTER_KEY="sk-litellm-master-$(openssl rand -hex 16)"
export FIREWORKS_API_KEY="your-fireworks-key"
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
docker compose up -d --force-recreate
uv run benchmark config init-dbIf you already use DATABASE_URL for some other project, unset it first or set BENCHMARK_DATABASE_URL explicitly for this repo. The benchmark CLI falls back to DATABASE_URL when BENCHMARK_DATABASE_URL is not set.
Useful local URLs:
- LiteLLM health:
http://localhost:4000/health/liveliness - LiteLLM metrics:
http://localhost:4000/metrics - Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000(admin/admin)
Open Grafana first and inspect the Benchmark folder:
Live Request LatencyLive TTFT MetricsLive Error RateExperiment Summary
Use Grafana for dashboards and Prometheus for raw metric/debug queries.
The completed system should make it easy to answer questions such as:
- Which provider and model combination is fastest for the same task card and harness?
- How does Claude Code compare with Codex, OpenCode, OpenHands, Gemini-oriented clients, or other agent harnesses when routed through the same local proxy?
- Does a harness configuration change improve TTFT, total latency, output throughput, error rate, or cache behavior?
- Does a provider-specific routing change improve session-level performance?
- How much variance exists between repeated sessions of the same benchmark variant?
- LiteLLM is the single shared proxy and routing layer.
- Every interactive benchmark session gets a benchmark-owned session ID.
- Session correlation is built around a session-scoped proxy credential plus benchmark tags.
- The project stores canonical benchmark records in a project-owned database.
- LiteLLM and Prometheus are telemetry sources, not the canonical query model.
- Prompt and response content are disabled by default.
- The benchmark application stays harness-agnostic in its core path.
- Define providers, harness profiles, variants, experiments, and task cards in versioned config files.
- Create a benchmark session for a chosen variant and task card.
- The session manager issues a session-scoped proxy credential and renders the exact environment snippet for the selected harness.
- Launch the harness manually and use it interactively against the local LiteLLM proxy.
- LiteLLM emits request data and Prometheus metrics while the benchmark app captures benchmark metadata.
- Collectors normalize request- and session-level data into the project database.
- Reports and dashboards compare sessions, variants, providers, models, and harnesses.
README.md: quick start and operator workflowconfigs/litellm/README.md: proxy routes, model aliases, and harness env examplesdocs/operator-workflow.md: detailed session lifecycle guidancedocs/launch-recipes.md: harness-specific launch instructionsdocs/architecture.mdanddocs/data-model-and-observability.md: deeper implementation details
- LiteLLM exposes raw metrics at
http://localhost:4000/metrics - Prometheus stores those metrics at
http://localhost:9090 - Grafana visualizes live Prometheus data and historical PostgreSQL data at
http://localhost:3000
Example Prometheus queries:
litellm_proxy_total_requests_metric_total
histogram_quantile(0.50, sum(rate(litellm_request_total_latency_metric_bucket[5m])) by (le))
histogram_quantile(0.50, sum(rate(litellm_llm_api_time_to_first_token_metric_bucket[5m])) by (le))
This section guides a new operator through a complete benchmark session from start to finish.
Before starting, ensure you have:
- Docker and Docker Compose installed
-
uvpackage manager installed (for the benchmark application) - API keys for your target providers (e.g.,
FIREWORKS_API_KEY,OPENAI_API_KEY) - A terminal agent or CLI harness installed (e.g., Claude Code, OpenCode, Codex)
- A repository to benchmark against (can be any codebase)
# Install benchmark application dependencies
make install-dev
# Set required environment variables
export LITELLM_MASTER_KEY="sk-litellm-master-$(openssl rand -hex 16)"
export FIREWORKS_API_KEY="your-fireworks-key"
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
# Start infrastructure services
docker compose up -d --force-recreate
# Verify services are healthy
docker compose ps
curl http://localhost:4000/health/livelinessExpected output: all services show healthy status and LiteLLM returns "I'm alive!".
# Optional: avoid inheriting another project's DATABASE_URL
unset DATABASE_URL
# Create the local schema and import configs into the benchmark database
uv run benchmark config init-dbExpected output: schema initialized and config records synced into the database.
Choose an experiment, variant, task card, and harness from the available configs:
# List available experiments
uv run benchmark config list-experiments
# List available variants
uv run benchmark config list-variants
# List available task cards
uv run benchmark config list-task-cards
# List available harness profiles
uv run benchmark config list-harnessesYou can also validate cross-references between config files:
uv run benchmark config validateAn experiment is the comparison bucket, not a single run. To compare Claude Code and OpenCode on Fireworks Kimi K2.5, create one session for each variant in the experiment.
# Optional: avoid inheriting another project's DATABASE_URL
unset DATABASE_URL
# Create a Claude Code session for the Kimi K2.5 harness comparison
uv run benchmark session create \
--experiment fireworks-kimi-k2-5-harness-comparison \
--variant fireworks-kimi-k2-5-claude-code \
--task-card repo-auth-analysis \
--harness claude-code \
--label "claude-code-run-1" \
--notes "Initial benchmark run"
# Create an OpenCode session for the same comparison
uv run benchmark session create \
--experiment fireworks-kimi-k2-5-harness-comparison \
--variant fireworks-kimi-k2-5-opencode \
--task-card repo-auth-analysis \
--harness opencode \
--label "opencode-run-1" \
--notes "Initial benchmark run"Expected output: each command creates a unique session_id (UUID). Git metadata (branch, commit, dirty state) is captured automatically.
# Render the harness-specific snippet for your session
uv run benchmark session env <session-id>Use the generated output according to the harness:
- Claude Code: evaluate the generated shell exports in your terminal
- OpenHands: evaluate the generated
LLM_*shell exports in your terminal - OpenCode: copy the generated JSON into
~/.config/opencode/opencode.jsonor projectopencode.json - Codex: copy the generated TOML into
~/.codex/config.toml, and export the referenced API key env var before launching Codex
Important:
sk-benchmark-<session-id>is a placeholder for a generated LiteLLM virtual key, not your LLM provider API key.- The benchmark session manager should generate it for the session and print it in
uv run benchmark session env <session-id>. - The session ID identifies the benchmark run in the benchmark database.
- The session virtual key is the proxy credential the harness uses when talking to LiteLLM.
- Per-session segmentation is done by issuing a different virtual key for each benchmark session, usually with session metadata attached.
If you are setting up a harness manually before the session-manager flow is complete, generate a proxy key yourself:
curl -s -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"metadata": {
"session_id": "manual-dev-session",
"harness": "claude-code"
},
"duration": "1h"
}'Use the returned key value as the harness-facing session credential. For env-based harnesses, that means ANTHROPIC_API_KEY, OPENAI_API_KEY, or LLM_API_KEY. For file-configured harnesses like OpenCode or Codex, insert the returned key into the generated config snippet.
With the environment set, launch your harness:
# For Claude Code
claude
# For OpenCode
opencode
# For Codex
codexThe harness now routes all traffic through the local LiteLLM proxy with session correlation.
The harness talks to LiteLLM using a protocol surface, and LiteLLM maps the requested model alias to a real upstream provider route.
Current built-in model aliases:
- Fireworks:
kimi-k2-5,kimi-k2-5-turbo,glm-5,glm-5-fast - OpenAI:
gpt-5.4,gpt-5.4-mini - Anthropic:
claude-opus-4-6,claude-sonnet-4-6
You can see the current aliases with:
curl -s http://localhost:4000/models -H "Authorization: Bearer $LITELLM_MASTER_KEY"Use those aliases as the model names your harness sends to the proxy.
Examples:
- Claude Code uses Anthropic-style env vars and works well with the LiteLLM Anthropic endpoint.
- OpenHands uses
LLM_BASE_URL,LLM_API_KEY, andLLM_MODEL. - OpenCode is configured through
~/.config/opencode/opencode.jsonor projectopencode.json. - Codex is configured through
~/.codex/config.toml.
To add more models later, add a new LiteLLM alias in configs/litellm/config.yaml and keep the provider config in configs/providers/ in sync.
Follow the task card instructions. The benchmark system automatically captures:
- Request latencies and TTFT
- Token counts (input, output, cached)
- Error rates and status codes
- Cache hit behavior
When done, finalize the session:
uv run benchmark session finalize <session-id> --status completedExpected output: Session status updated, end time recorded.
# Open Grafana dashboards
open http://localhost:3000
# Export comparison reports
uv run benchmark export sessions --format csv --output sessions.csvIf you want to wipe local benchmark sessions and start over from a clean slate:
unset DATABASE_URL
rm -f benchmark.db
docker compose down -v
docker compose up -d --force-recreate
uv run benchmark config init-dbWhat this resets:
benchmark.db: local benchmark sessions, requests, and imported config records- Docker volumes: local Postgres, Prometheus, and Grafana persisted state
After this, uv run benchmark session list should be empty.
For a complete walkthrough of running a benchmark session, see configs/litellm/README.md. The quick version:
- Start the infrastructure stack (LiteLLM, PostgreSQL, Prometheus, Grafana)
- Run
uv run benchmark config init-dbto create the schema and import config records - Use
uv run benchmark config list-experiments,list-variants, andlist-task-cardsto pick a test case - For onboarding, use the
fireworks-kimi-k2-5-harness-comparisonexperiment to compareclaude-codeandopencode - Create a session:
uv run benchmark session create --experiment <name> --variant <name> --task-card <name> --harness <name> - Copy the rendered environment snippet and launch your harness interactively
- Work on the task; the proxy captures all traffic with session correlation
- Finalize the session:
uv run benchmark session finalize --session-id <id> --status completed - View metrics in Grafana and export comparison reports
This section covers common setup failures and their solutions.
Symptom: docker compose up -d fails or services show unhealthy status.
Diagnosis:
# Check service status
docker compose ps
# Check service logs
docker compose logs litellm
docker compose logs postgres
docker compose logs prometheusCommon Causes and Solutions:
-
Port conflicts: Another service is using port 4000, 5432, 9090, or 3000.
# Find process using port lsof -i :4000 # Kill the process or change port in docker-compose.yml
-
Missing environment variables: LiteLLM master key or provider keys not set.
# Verify environment variables echo $LITELLM_MASTER_KEY echo $FIREWORKS_API_KEY echo $OPENAI_API_KEY echo $ANTHROPIC_API_KEY
-
Docker not running: Ensure Docker daemon is active.
# Check Docker status docker info
Symptom: curl http://localhost:4000/health/liveliness returns error or timeout.
Diagnosis:
# Check if LiteLLM container is running
docker compose ps litellm
# Check LiteLLM logs
docker compose logs litellm --tail 100Common Causes and Solutions:
-
Config syntax error: LiteLLM config has invalid YAML.
# Validate YAML syntax python -c "import yaml; yaml.safe_load(open('configs/litellm/config.yaml'))"
-
Missing provider keys: Required API keys not set.
# Verify provider keys echo $FIREWORKS_API_KEY echo $OPENAI_API_KEY
-
Database connection failure: PostgreSQL not ready or connection string incorrect.
# Check PostgreSQL status docker compose ps postgres # If using a custom benchmark database, inspect the benchmark DB URL echo $BENCHMARK_DATABASE_URL echo $DATABASE_URL
Symptom: uv run benchmark session create returns error.
Diagnosis:
# Check if database is accessible
uv run benchmark health check
# Verify configs exist
ls configs/experiments/
ls configs/variants/
ls configs/task-cards/Common Causes and Solutions:
-
Invalid experiment/variant/task-card name: Name doesn't match config file.
# List available configs uv run benchmark config list-experiments uv run benchmark config list-variants uv run benchmark config list-task-cards -
Database not initialized: Benchmark database tables don't exist.
# Create the schema and import config records unset DATABASE_URL uv run benchmark config init-db
-
Wrong database selected: Another project's
DATABASE_URLoverrides the local benchmark DB.unset DATABASE_URL uv run benchmark config init-db -
Not in a git repository: Git metadata capture fails.
# Check if in git repo git rev-parse --is-inside-work-tree # If not, session will proceed with warning (non-blocking)
Symptom: uv run benchmark session env shows incorrect environment variables.
Diagnosis:
# Check harness profile config
cat configs/harnesses/claude-code.yaml
# Verify variant config
cat configs/variants/fireworks-kimi-k2-5-claude-code.yamlCommon Causes and Solutions:
-
Wrong harness profile: Session created with wrong harness.
# Check session details uv run benchmark session show <session-id>
-
Harness profile mismatch: Variant specifies different harness than session.
# Verify variant's harness_profile field grep harness_profile configs/variants/<variant-name>.yaml
Symptom: Harness sends requests directly to provider, not through LiteLLM.
Diagnosis:
# Check environment variables are set
env | grep ANTHROPIC
env | grep OPENAI
# Verify base URL points to proxy
echo $ANTHROPIC_BASE_URL # Should be http://localhost:4000
echo $OPENAI_BASE_URL # Should be http://localhost:4000Common Causes and Solutions:
-
Environment not sourced: Environment snippet not applied to current shell.
# Re-apply the environment snippet eval "$(uv run benchmark session env <session-id>)" # Or copy-paste the exports manually
-
Existing environment overrides: Previous environment variables take precedence.
# Unset old variables unset ANTHROPIC_BASE_URL ANTHROPIC_API_KEY unset OPENAI_BASE_URL OPENAI_API_KEY # Re-apply session environment uv run benchmark session env <session-id>
-
Harness config file override: Harness has hardcoded base URL in config.
# Check harness config files cat ~/.claude/config.json # For Claude Code # Temporarily remove or rename config to use environment variables
Symptom: Harness reports 401 Unauthorized or invalid API key.
Diagnosis:
# Test session virtual key
curl http://localhost:4000/key/info \
-H "Authorization: Bearer $ANTHROPIC_API_KEY"
# Check if session exists
uv run benchmark session show <session-id>Common Causes and Solutions:
-
Session not created: Session ID doesn't exist.
# List sessions to find correct ID uv run benchmark session list -
Virtual key expired: Session key has time or budget limit.
# Create new session or check key info curl http://localhost:4000/key/info \ -H "Authorization: Bearer $SESSION_VIRTUAL_KEY"
Symptom: Grafana dashboards show "No data".
Diagnosis:
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Verify LiteLLM is emitting metrics
curl http://localhost:4000/metrics | grep litellmCommon Causes and Solutions:
-
Prometheus not scraping LiteLLM: Scrape config missing or misconfigured.
# Check Prometheus config cat configs/prometheus/prometheus.yml # Restart Prometheus docker compose restart prometheus
-
No traffic yet: No requests have been sent through the proxy.
# Send test request curl http://localhost:4000/v1/chat/completions \ -H "Authorization: Bearer $SESSION_VIRTUAL_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "kimi-k2-5", "messages": [{"role": "user", "content": "test"}]}'
-
Time range mismatch: Grafana time picker is outside session time.
# In Grafana, adjust time range to "Last 1 hour" or session time
Symptom: Session queries return no data after session is finalized.
Diagnosis:
# Check if session exists
uv run benchmark session show <session-id>
# Check request count
uv run benchmark session show <session-id> | grep requestCommon Causes and Solutions:
-
Collection not run: Normalization job not executed.
# Run collection manually uv run benchmark normalize litellm --session-id <session-id>
-
LiteLLM logging disabled: Request logs not being written.
# Check LiteLLM config for logging settings grep -A5 "litellm_settings" configs/litellm/config.yaml
If issues persist after following troubleshooting steps:
-
Check logs: Review all service logs for error messages.
docker compose logs --tail 200
-
Verify versions: Ensure you're using compatible versions.
docker compose version uv --version python --version
-
Clean slate: Reset the environment and start fresh.
# Stop and remove containers, volumes, and networks docker compose down -v # Remove local database (if applicable) rm -f benchmark.db # Restart from Step 1 docker compose up -d
-
File an issue: Report bugs or documentation gaps at the project repository.