Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
36a875a
open router
bradleyshep Mar 23, 2026
3b88747
no guidelines variant, new workflows, results save updates
bradleyshep Mar 23, 2026
76016e7
new evals batch one
bradleyshep Mar 23, 2026
3bdecca
query evals
bradleyshep Mar 23, 2026
b3ce8f7
more evals + categories
bradleyshep Mar 23, 2026
52e28b9
fixes
bradleyshep Mar 23, 2026
617e052
fixes
bradleyshep Mar 23, 2026
6eb1168
fmt
bradleyshep Mar 23, 2026
b9a545f
llm benchmark site
bradleyshep Mar 23, 2026
afee2e0
Create ModelDetail.tsx
bradleyshep Mar 23, 2026
61d815e
site + details
bradleyshep Mar 23, 2026
e132ed8
benchmark site + run
bradleyshep Mar 24, 2026
56e693f
more evals + fixes
bradleyshep Mar 24, 2026
1216af6
fixes
bradleyshep Mar 25, 2026
ec966f9
refinements
bradleyshep Mar 26, 2026
00d6598
updates
bradleyshep Mar 26, 2026
850254e
updates; guidelines mode
bradleyshep Mar 27, 2026
4abe096
Create README.md
bradleyshep Mar 27, 2026
b432278
fixes
bradleyshep Mar 27, 2026
bed39d0
updates
bradleyshep Mar 27, 2026
9ffba0b
remove tools/site
bradleyshep Mar 27, 2026
bb26681
normalize model names
bradleyshep Mar 28, 2026
139408e
scoring fixes
bradleyshep Mar 30, 2026
6f740ed
fixes
bradleyshep Mar 30, 2026
dd35c66
results
bradleyshep Mar 30, 2026
db0e185
rust concurrency and details updates
bradleyshep Mar 31, 2026
25a246e
Update spacetimedb-typescript.mdc
bradleyshep Mar 31, 2026
741fcf4
update actions
bradleyshep Mar 31, 2026
603b5ee
Merge branch 'master' into bradley/llm-benchmarks-improvements
bradleyshep Mar 31, 2026
5fd1a0e
Update llm-benchmark-periodic.yml
bradleyshep Mar 31, 2026
b6677f9
updates
bradleyshep Mar 31, 2026
b9e43b8
Update spacetimedb-typescript.mdc
bradleyshep Mar 31, 2026
68ae3ef
refinements
bradleyshep Apr 1, 2026
e8b039a
updates
bradleyshep Apr 1, 2026
28662c6
fixes/cleanup
bradleyshep Apr 1, 2026
8f070f4
cleanup
bradleyshep Apr 1, 2026
a3e4421
cleanup
bradleyshep Apr 1, 2026
6feb97d
Update global.json
bradleyshep Apr 1, 2026
920217c
Delete llm-comparison-details.lock
bradleyshep Apr 1, 2026
e5a5546
fmt
bradleyshep Apr 1, 2026
aa3caf1
fmt
bradleyshep Apr 1, 2026
c9770da
lints
bradleyshep Apr 1, 2026
fc1d685
clippy
bradleyshep Apr 1, 2026
5cf84bb
Single source ai docs
bradleyshep Apr 8, 2026
ef78cac
separate csharp client / unity; cpp; rust refinements
bradleyshep Apr 9, 2026
d5afe95
Update init.rs
bradleyshep Apr 9, 2026
0f356c2
Update llms.md
bradleyshep Apr 9, 2026
b91f238
docusaurus md generation
bradleyshep Apr 9, 2026
be7ca88
Merge branch 'master' into bradley/llm-single-source-of-truth
bradleyshep Apr 9, 2026
e663777
Remove unused
bradleyshep Apr 9, 2026
8b96bca
Merge remote-tracking branch 'origin/bradley/llm-benchmarks-improveme…
bradleyshep Apr 9, 2026
9987996
fixes
bradleyshep Apr 9, 2026
9acb33c
skill updates
bradleyshep Apr 9, 2026
c9c9732
unreal skill
bradleyshep Apr 10, 2026
dc7a0f1
Merge branch 'master' into bradley/llm-benchmarks-improvements
bradleyshep Apr 10, 2026
acbb618
Update SKILL.md
bradleyshep Apr 10, 2026
0974acf
multi index preference
bradleyshep Apr 13, 2026
5ab0938
Merge branch 'master' into bradley/llm-benchmarks-improvements
bradleyshep Apr 13, 2026
dee42a6
Remove summary, file io, ci quickfix/check; add analysis; remove jsons
bradleyshep Apr 14, 2026
2223a75
Update client.rs
bradleyshep Apr 14, 2026
dfb6779
analysis command + permissions
bradleyshep Apr 14, 2026
514068b
updates
bradleyshep Apr 14, 2026
61192f5
Merge remote-tracking branch 'origin/bradley/llm-single-source-of-tru…
bradleyshep Apr 14, 2026
4eac30e
remove cursor rules mode + dead code cleanup
bradleyshep Apr 14, 2026
91f5861
add goldens to evals, save runs and anlysis locally if dry run
bradleyshep Apr 14, 2026
3689d9c
Update client.rs
bradleyshep Apr 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
33 changes: 0 additions & 33 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -590,39 +590,6 @@ jobs:
run: |
cargo ci cli-docs

llm_ci_check:
name: Verify LLM benchmark is up to date
permissions:
contents: read
runs-on: ubuntu-latest
# Disable the tests because they are causing us headaches with merge conflicts and re-runs etc.
if: false
steps:
# Build the tool from master to ensure consistent hash computation
# with the llm-benchmark-update workflow (which also uses master's tool).
- name: Checkout master (build tool from trusted code)
uses: actions/checkout@v4
with:
ref: master
fetch-depth: 1

- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2

- name: Install llm-benchmark tool from master
run: |
cargo install --path tools/xtask-llm-benchmark --locked
command -v llm_benchmark

# Now checkout the PR branch to verify its benchmark files
- name: Checkout PR branch
uses: actions/checkout@v4
with:
clean: false

- name: Run hash check (both langs)
run: llm_benchmark ci-check

unity-testsuite:
needs: [lints]
# Skip if this is an external contribution.
Expand Down
67 changes: 67 additions & 0 deletions .github/workflows/docs-update-llms.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: Docs / Update llms files

permissions:
contents: write

on:
push:
branches:
- docs/release
paths:
- 'docs/docs/**'
- 'skills/**'
workflow_dispatch: # Allow manual trigger

jobs:
update-llms:
runs-on: spacetimedb-new-runner-2
steps:
- name: Checkout repository
uses: actions/checkout@v3
with:
ref: docs/release

- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: '22'

- uses: pnpm/action-setup@v4
with:
run_install: true

- name: Get pnpm store directory
working-directory: sdks/typescript
shell: bash
run: |
echo "STORE_PATH=$(pnpm store path --silent)" >> $GITHUB_ENV

- uses: actions/cache@v4
name: Setup pnpm cache
with:
path: ${{ env.STORE_PATH }}
key: ${{ runner.os }}-pnpm-store-${{ hashFiles('**/pnpm-lock.yaml') }}
restore-keys: |
${{ runner.os }}-pnpm-store-

- name: Install dependencies
working-directory: docs
run: pnpm install

- name: Docusaurus build
working-directory: docs
run: pnpm build

- name: Generate llms files
working-directory: docs
run: node scripts/generate-llms.mjs

- name: Commit updated llms files
working-directory: docs
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add static/llms.md
git diff --staged --quiet && echo "No changes" && exit 0
git commit -m "Update llms files from docs build"
git push
118 changes: 118 additions & 0 deletions .github/workflows/llm-benchmark-periodic.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
name: Periodic LLM benchmarks

on:
schedule:
# Daily at midnight UTC. Change to '0 */6 * * *' for every 6h,
# or '0 */4 * * *' for every 4h.
- cron: '0 0 * * *'
workflow_dispatch:
inputs:
models:
description: 'Models to run (provider:model format, comma-separated, or "all")'
required: false
default: 'all'
languages:
description: 'Languages to benchmark (comma-separated: rust,csharp,typescript)'
required: false
default: 'rust,csharp,typescript'
modes:
description: 'Modes to run (comma-separated: guidelines,no_context,docs,...)'
required: false
default: 'guidelines,no_context'

permissions:
contents: read

concurrency:
group: llm-benchmark-periodic
cancel-in-progress: true

jobs:
run-benchmarks:
runs-on: spacetimedb-new-runner
container:
image: localhost:5000/spacetimedb-ci:latest
options: >-
--privileged
timeout-minutes: 180

steps:
- name: Install spacetime CLI
run: |
curl -sSf https://install.spacetimedb.com | sh -s -- -y
echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Checkout master
uses: actions/checkout@v4
with:
ref: master
fetch-depth: 1

- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2

- name: Setup .NET SDK
uses: actions/setup-dotnet@v4
with:
dotnet-version: "8.0.x"

- name: Install WASI workload
env:
DOTNET_MULTILEVEL_LOOKUP: "0"
DOTNET_CLI_HOME: ${{ runner.temp }}/dotnet-home
DOTNET_SKIP_FIRST_TIME_EXPERIENCE: "1"
run: |
dotnet workload install wasi-experimental --skip-manifest-update --disable-parallel

- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: 22

- name: Install pnpm
uses: pnpm/action-setup@v4

- name: Build llm-benchmark tool
run: cargo install --path tools/xtask-llm-benchmark --locked

- name: Run benchmarks
env:
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
LLM_BENCHMARK_API_KEY: ${{ secrets.LLM_BENCHMARK_API_KEY }}
LLM_BENCHMARK_UPLOAD_URL: ${{ secrets.LLM_BENCHMARK_UPLOAD_URL }}
MSBUILDDISABLENODEREUSE: "1"
DOTNET_CLI_USE_MSBUILD_SERVER: "0"
INPUT_LANGUAGES: ${{ inputs.languages || 'rust,csharp,typescript' }}
INPUT_MODELS: ${{ inputs.models || 'all' }}
INPUT_MODES: ${{ inputs.modes || 'guidelines,no_context' }}
run: |
LANGS="$INPUT_LANGUAGES"
MODELS="$INPUT_MODELS"
MODES="$INPUT_MODES"

SUCCEEDED=0
FAILED=0
for LANG in $(echo "$LANGS" | tr ',' ' '); do
if [ "$MODELS" = "all" ]; then
if llm_benchmark run --lang "$LANG" --modes "$MODES"; then
SUCCEEDED=$((SUCCEEDED + 1))
else
echo "::warning::Benchmark run failed for lang=$LANG"
FAILED=$((FAILED + 1))
fi
else
if llm_benchmark run --lang "$LANG" --modes "$MODES" --models "$MODELS"; then
SUCCEEDED=$((SUCCEEDED + 1))
else
echo "::warning::Benchmark run failed for lang=$LANG models=$MODELS"
FAILED=$((FAILED + 1))
fi
fi
done
echo "Benchmark runs: $SUCCEEDED succeeded, $FAILED failed"
if [ "$SUCCEEDED" -eq 0 ] && [ "$FAILED" -gt 0 ]; then
echo "::error::All benchmark runs failed"
exit 1
fi
Loading
Loading