|
| 1 | +# Remote Evaluation Infrastructure |
| 2 | + |
| 3 | +This directory contains the infrastructure for running Codebuff evaluations in containerized environments (Docker Compose) for CI/CD and local testing. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +### Option 1: Using Drizzle Seed (Recommended) |
| 8 | +```bash |
| 9 | +bash evals/scripts/run-remote.sh seed |
| 10 | +``` |
| 11 | + |
| 12 | +### Option 2: Using Test Auth Bypass (Faster) |
| 13 | +```bash |
| 14 | +bash evals/scripts/run-remote.sh bypass |
| 15 | +``` |
| 16 | + |
| 17 | +## Prerequisites |
| 18 | + |
| 19 | +- Docker and Docker Compose |
| 20 | +- Bun runtime |
| 21 | +- Optional: `npm install -g codebuff` (or set `CODEBUFF_SKIP_BINARY_CHECK=1`) |
| 22 | + |
| 23 | +## Architecture |
| 24 | + |
| 25 | +- **evals/docker-compose.evals.yml**: Orchestrates PostgreSQL database and backend services |
| 26 | +- **evals/backend.Dockerfile**: Backend container definition |
| 27 | +- **evals/seeds/seed-evals.ts**: Drizzle-based database seeding for test users/sessions |
| 28 | +- **evals/scripts/run-remote.sh**: Main runner script with teardown |
| 29 | +- **evals/scripts/wait-for-healthz.sh**: Health check waiting utility |
| 30 | + |
| 31 | +## Key Features |
| 32 | + |
| 33 | +### SDK Enhancements |
| 34 | +- **Binary Check Skip**: Set `CODEBUFF_SKIP_BINARY_CHECK=1` to skip codebuff CLI requirement |
| 35 | +- **WebSocket URL Override**: Set `CODEBUFF_WEBSOCKET_URL=ws://127.0.0.1:4242/ws` to target ephemeral backend |
| 36 | + |
| 37 | +### Backend Enhancements |
| 38 | +- **Test Auth Bypass**: Set `CODEBUFF_TEST_AUTH_TOKEN` + `NODE_ENV=test` for quick auth |
| 39 | +- **WebSocket-Ready Health Check**: `/healthz` returns 503 until WebSocket server is accepting connections |
| 40 | + |
| 41 | +### Container Strategy |
| 42 | +- **Loopback Binding**: Backend bound to `127.0.0.1:4242` only (no public exposure) |
| 43 | +- **Optimized PostgreSQL**: Fast settings for CI (fsync=off, etc.) |
| 44 | +- **Build Context**: Uses repo root with Dockerfile in evals/ for clean separation |
| 45 | + |
| 46 | +## Environment Variables |
| 47 | + |
| 48 | +- `CODEBUFF_WEBSOCKET_URL`: Override WebSocket URL (e.g., `ws://127.0.0.1:4242/ws`) |
| 49 | +- `CODEBUFF_SKIP_BINARY_CHECK=1`: Skip SDK binary presence check |
| 50 | +- `CODEBUFF_TEST_AUTH_TOKEN`: Enable test-only auth bypass (when NODE_ENV=test) |
| 51 | +- `CODEBUFF_API_KEY`: API key for SDK authentication (set by scripts) |
| 52 | + |
| 53 | +## GitHub Actions Integration |
| 54 | + |
| 55 | +### Automatic Trigger |
| 56 | +Add `[remote-eval]` to your commit message to trigger remote evaluations: |
| 57 | +```bash |
| 58 | +git commit -m "fix: terminal CWD handling [remote-eval]" |
| 59 | +``` |
| 60 | + |
| 61 | +### Manual Trigger |
| 62 | +Go to Actions → Remote Evaluations → Run workflow: |
| 63 | +- **Eval file**: `eval-codebuff.json` (default) |
| 64 | +- **Commit index**: `0` (default) |
| 65 | +- **Mode**: `bypass` or `seed` |
| 66 | + |
| 67 | +### Matrix Evaluations |
| 68 | +Add `[remote-eval-all]` to run multiple evaluations in parallel: |
| 69 | +```bash |
| 70 | +git commit -m "major: refactor terminal logic [remote-eval-all]" |
| 71 | +``` |
| 72 | + |
| 73 | +### Workflow Files |
| 74 | +- `.github/workflows/remote-evals.yml` - Main remote evaluation workflow |
| 75 | +- Uses our containerized infrastructure with Docker Compose |
| 76 | +- Uploads artifacts and logs automatically |
| 77 | +- Handles cleanup and error reporting |
| 78 | + |
| 79 | +### Usage in CI |
| 80 | + |
| 81 | +```yaml |
| 82 | +# Single evaluation |
| 83 | +- name: Run remote eval (bypass mode) |
| 84 | + run: bash evals/scripts/run-remote-parameterized.sh bypass eval-codebuff.json 0 |
| 85 | + |
| 86 | +# With database seeding |
| 87 | +- name: Run remote eval (seed mode) |
| 88 | + run: bash evals/scripts/run-remote-parameterized.sh seed eval-manifold.json 1 |
| 89 | +``` |
| 90 | +
|
| 91 | +## Manual Usage |
| 92 | +
|
| 93 | +1. Start services: |
| 94 | + ```bash |
| 95 | + docker compose -f evals/docker-compose.evals.yml up -d --build db backend |
| 96 | + ``` |
| 97 | + |
| 98 | +2. Wait for readiness: |
| 99 | + ```bash |
| 100 | + evals/scripts/wait-for-healthz.sh http://127.0.0.1:4242/healthz 90 |
| 101 | + ``` |
| 102 | + |
| 103 | +3. Seed database and capture API key: |
| 104 | + ```bash |
| 105 | + KEY_LINE=$(docker compose -f evals/docker-compose.evals.yml run --rm seeder | tail -n1) |
| 106 | + export CODEBUFF_API_KEY="${KEY_LINE#CODEBUFF_API_KEY=}" |
| 107 | + ``` |
| 108 | + |
| 109 | +4. Run evaluation: |
| 110 | + ```bash |
| 111 | + export CODEBUFF_WEBSOCKET_URL=ws://127.0.0.1:4242/ws |
| 112 | + export CODEBUFF_SKIP_BINARY_CHECK=1 |
| 113 | + bun scripts/git-evals/run-single-eval.ts --prompt "Your test prompt" |
| 114 | + ``` |
| 115 | + |
| 116 | +5. Cleanup: |
| 117 | + ```bash |
| 118 | + docker compose -f evals/docker-compose.evals.yml down -v |
| 119 | + ``` |
| 120 | + |
| 121 | +## Troubleshooting |
| 122 | + |
| 123 | +- **Connection Issues**: Check that `CODEBUFF_WEBSOCKET_URL=ws://127.0.0.1:4242/ws` is set |
| 124 | +- **Auth Failures**: Verify `CODEBUFF_API_KEY` is properly captured from seeder output |
| 125 | +- **Backend Not Ready**: Ensure `/healthz` returns 200 before proceeding |
| 126 | +- **Port Conflicts**: Backend binds to `127.0.0.1:4242` - ensure port is available |
| 127 | + |
| 128 | +## Implementation Details |
| 129 | + |
| 130 | +Based on the remote-eval-infra-plan.md specification: |
| 131 | +- Monorepo + Bun compatible |
| 132 | +- Docker-agnostic backend (Dockerfile lives in evals/) |
| 133 | +- Idempotent Drizzle seeding with deterministic IDs |
| 134 | +- WS readiness validation in health checks |
| 135 | +- Test-only auth bypass for fast smoke tests |
| 136 | +- Comprehensive error logging and cleanup |
0 commit comments