Skip to content

exp: v1 runtime benchmark matrix#1630

Draft
mikasenghaas wants to merge 9 commits into
feat/nano-as-v1from
exp/v1-bench-runtime-matrix
Draft

exp: v1 runtime benchmark matrix#1630
mikasenghaas wants to merge 9 commits into
feat/nano-as-v1from
exp/v1-bench-runtime-matrix

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

Reshapes the v1 benchmark to compare runtimes instead of env-server worker modes, writes one results file per matrix cell, adds a plot script, and commits a single-turn results set (subprocess / prime / modal).

  • Swept axis is the runtime: benchmark.sh (gsm8k-v1), agentic_benchmark.sh (terminal-bench-2-v1).
  • Rollouts {64, 512, 1024} (single-turn); concurrency unbounded by default (MAX_CONCURRENT knob to cap it); agentic turns capped at 32; gsm8k generation unbounded.
  • Per-cell results: bench/results/<suite>/<runtime>-r<rollouts>.json. bench_aggregate.py writes one file per cell (records max_concurrent, reward/errors, and the sorted per-rollout duration lists for setup / generation / scoring / total); the scripts write into the committed results dir without wiping it, so re-running a subset (e.g. just modal) refreshes only those cells. bench/plot.py renders the committed plot below.
  • Base-compat fixes: --retry.attempts--retries.*.max_retries 0 (feat: per-call model + runtime retries #1632), and a guard on the ~/.env source so a detached run doesn't die re-reading the 1Password FIFO.

Results (bench/results/single_turn/)

gsm8k-v1 on deepseek-v4-flash, unbounded concurrency. Each stage column is p50 (p10–p90) seconds per rollout; e2e is the cell wall clock. (subprocess runs rollouts on the eval host; prime/modal offload to remote sandboxes.)

runtime benchmark

cell conc e2e err rew setup generation scoring total
subprocess r64 12s 0 1.0 0.0 (0.0–0.0) 6.3 (3.9–7.6) 0.3 (0.2–0.3) 6.5 (4.2–7.8)
subprocess r512 33s 0 0.994 0.0 (0.0–0.0) 18.6 (11.8–21.6) 5.3 (4.3–6.9) 24.6 (17.1–26.9)
subprocess r1024 56s 0 0.997 0.0 (0.0–0.0) 30.3 (15.9–45.5) 14.1 (4.7–16.7) 45.0 (31.7–50.3)
prime r64 32s 0 1.0 10.9 (8.7–11.8) 9.6 (9.6–12.6) 3.5 (3.5–3.5) 24.6 (21.8–27.2)
prime r512 49s 0 0.996 14.2 (10.2–17.2) 12.8 (12.6–16.0) 3.5 (3.5–3.7) 31.9 (26.7–35.8)
prime r1024 84s 0 0.999 20.4 (12.6–40.6) 16.0 (12.8–19.6) 6.8 (3.5–9.9) 45.7 (34.9–57.4)
modal r64 41s 0 1.0 8.0 (2.5–12.5) 10.3 (8.7–12.7) 3.5 (2.7–4.4) 22.0 (15.6–28.0)
modal r512 173s 0 1.0 51.1 (11.8–91.4) 10.7 (8.5–13.2) 3.4 (2.6–4.5) 64.0 (26.1–105.8)
modal r1024 283s 0 0.998 101.7 (21.7–183.7) 10.0 (8.2–13.3) 3.5 (2.6–4.5) 115.7 (35.8–197.7)

Results (bench/results/agentic/)

terminal-bench-2-v1 fix-git on deepseek-v4-flash, 32-turn agent. docker capped at 128 (local containers), prime/modal unbounded. subprocess is excluded — agentic needs a container. Stage columns are p50 (p10–p90) seconds per rollout.

agentic benchmark

cell conc e2e err rew setup generation scoring total
docker r64 128 165s 0 0.984 0.5 (0.4–0.6) 68.3 (62.0–85.1) 4.5 (4.2–5.4) 74.0 (66.9–89.8)
docker r512 128 485s 0 0.992 0.9 (0.3–2.0) 90.0 (58.6–111.2) 7.5 (3.5–18.8) 99.0 (65.5–131.6)
prime r64 233s 0 0.984 10.4 (8.4–12.3) 60.6 (50.5–75.1) 14.7 (8.1–16.3) 87.4 (69.7–94.3)
prime r512 289s 0 0.996 14.1 (10.2–17.3) 105.3 (81.0–117.9) 6.8 (6.7–14.5) 126.0 (108.6–140.0)
modal r64 136s 0 1.0 8.4 (2.8–13.1) 64.0 (50.9–84.5) 6.6 (4.8–7.9) 80.3 (62.7–100.4)
modal r512 234s 0 0.977 51.3 (11.3–92.8) 60.6 (52.0–73.4) 6.4 (4.6–8.3) 121.9 (79.4–161.2)

Depends on

@mikasenghaas mikasenghaas force-pushed the exp/v1-bench-runtime-matrix branch 3 times, most recently from 7928584 to 577ae0c Compare June 11, 2026 21:22
Comment thread bench/bench_aggregate.py Outdated
mikasenghaas and others added 5 commits June 11, 2026 23:25
Swap the env-server WORKERS axis for a RUNTIME axis: benchmark.sh (gsm8k-v1)
compares subprocess/docker/prime; agentic_benchmark.sh compares docker/prime.
Rollouts {64, 512}, concurrency unbounded (--max_concurrent None), agentic turns
capped at 32, gsm8k generation unbounded (MAX_TOKENS optional). bench_aggregate.py
parses <runtime>-r<rollouts> labels and surfaces per-rollout setup_durations next
to gen_durations (setup span landed in #1631).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt base

Results (bench/results/single_turn.json) from gsm8k-v1 on deepseek-v4-flash,
subprocess + prime x {64,512}, unbounded concurrency. docker was excluded on this
box (per-container cold-install fills the disk at 512; ~100GB needed vs ~50G free
— it stays in the script's default matrix). Highlights: prime sandbox provisioning
~10-15s/rollout (grows with concurrency) vs ~0 for subprocess; 0 errors, reward ~1.0.

- bench/plot.py: stacked setup-vs-generation p50 per cell + gen p90 whisker (PNG
  is gitignored; regenerate from the JSON).
- benchmark.sh / agentic_benchmark.sh: --retry.attempts -> --retries.*.max_retries 0
  (retry config was restructured by #1632), and guard the ~/.env source so a
  detached run doesn't die re-reading the 1Password FIFO.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…RENT knob

One JSON per matrix cell under bench/results/single_turn/ so a subset re-run
refreshes only those cells: bench_aggregate.py writes per-cell (recording
max_concurrent); benchmark.sh / agentic_benchmark.sh write into the committed
results dir without wiping it; plot.py reads the whole dir. Adds a MAX_CONCURRENT
knob — docker ran capped at 128 here to bound its per-container cold-install disk
use (subprocess/prime unbounded; recorded per cell). Includes the docker results
(its cold-install lands in generation, not setup). Drops the combined
single_turn.json.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
modal r64/r512 (remote sandboxes, unbounded). modal-r64 is clean (0 errors,
setup_p50 ~3.5s, like prime); modal-r512 hits Modal's sandbox provisioning rate
limit at 512 concurrent — 221/512 errors (reward 0.57), a real scaling signal.
Sandboxes terminate per-rollout + via the lifetime backstop (0 left after the run).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lel limit)

Re-run on the merged host-global creation limiter (5/s) with the account's
raised 1000-container parallel ceiling: modal-r512 now 0 errors / reward 1.0
(was 221 errors / 0.568), e2e 173s (was 257s). Setup p90 ~92s is the 5/s
creation ramp for 512 sandboxes, not failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas force-pushed the exp/v1-bench-runtime-matrix branch from 577ae0c to 597fd44 Compare June 11, 2026 23:30
Aggregate now records per-rollout scoring and total (whole-rollout) durations
alongside setup + generation. plot.py becomes a 2x2 of the four stages, each a
per-runtime p50 bar with a p10-p90 whisker grouped by rollout count (committed
PNG). Refreshed subprocess / prime / modal cells carry the new metrics; docker
not re-run this round, so it appears only in the setup/generation panels.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread bench/plot.py Outdated
mikasenghaas and others added 3 commits June 12, 2026 00:03
… modal

All clean (0 errors). prime scales flattest (total p50 46s); modal is gated by
5/s sandbox creation (setup p50 102s, the 1024/5 ramp) but gen/scoring stay flat
since compute is offloaded; subprocess saturates the host at 1024 concurrent
rollouts (gen p50 30s, scoring p50 14s). plot.py now handles >2 rollout groups.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le-turn

plot.py is now three panels (setup, generation, scoring) — drop the total
panel. Remove the docker single-turn cells: strictly worse than subprocess
(local container start + per-container uv cold-install for no isolation benefit
on a single-turn task), so not a useful comparison point there. docker stays in
the agentic suite, where container isolation is required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
terminal-bench-2-v1 fix-git, 32-turn agent. All clean (0 errors, reward
0.98-1.0). docker capped at 128, prime/modal unbounded. agentic.png committed
(setup/generation/scoring panels).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title exp: v1 runtime benchmark matrix + record per-rollout setup timing exp: v1 runtime benchmark matrix Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant