exp: v1 runtime benchmark matrix by mikasenghaas · Pull Request #1630 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-06-11T19:41:33Z

Summary

Reshapes the v1 benchmark to compare runtimes instead of env-server worker modes, writes one results file per matrix cell, adds a plot script, and commits a single-turn results set (subprocess / prime / modal).

Swept axis is the runtime: benchmark.sh (gsm8k-v1), agentic_benchmark.sh (terminal-bench-2-v1).
Rollouts {64, 512, 1024} (single-turn); concurrency unbounded by default (MAX_CONCURRENT knob to cap it); agentic turns capped at 32; gsm8k generation unbounded.
Per-cell results: bench/results/<suite>/<runtime>-r<rollouts>.json. bench_aggregate.py writes one file per cell (records max_concurrent, reward/errors, and the sorted per-rollout duration lists for setup / generation / scoring / total); the scripts write into the committed results dir without wiping it, so re-running a subset (e.g. just modal) refreshes only those cells. bench/plot.py renders the committed plot below.
Base-compat fixes: --retry.attempts → --retries.*.max_retries 0 (feat: per-call model + runtime retries #1632), and a guard on the ~/.env source so a detached run doesn't die re-reading the 1Password FIFO.

Results (bench/results/single_turn/)

gsm8k-v1 on deepseek-v4-flash, unbounded concurrency. Each stage column is p50 (p10–p90) seconds per rollout; e2e is the cell wall clock. (subprocess runs rollouts on the eval host; prime/modal offload to remote sandboxes.)

cell	conc	e2e	rew	setup	generation	scoring	total
subprocess r64	∞	12s	1.0	0.0 (0.0–0.0)	6.3 (3.9–7.6)	0.3 (0.2–0.3)	6.5 (4.2–7.8)
subprocess r512	∞	33s	0.994	0.0 (0.0–0.0)	18.6 (11.8–21.6)	5.3 (4.3–6.9)	24.6 (17.1–26.9)
subprocess r1024	∞	56s	0.997	0.0 (0.0–0.0)	30.3 (15.9–45.5)	14.1 (4.7–16.7)	45.0 (31.7–50.3)
prime r64	∞	32s	1.0	10.9 (8.7–11.8)	9.6 (9.6–12.6)	3.5 (3.5–3.5)	24.6 (21.8–27.2)
prime r512	∞	49s	0.996	14.2 (10.2–17.2)	12.8 (12.6–16.0)	3.5 (3.5–3.7)	31.9 (26.7–35.8)
prime r1024	∞	84s	0.999	20.4 (12.6–40.6)	16.0 (12.8–19.6)	6.8 (3.5–9.9)	45.7 (34.9–57.4)
modal r64	∞	41s	1.0	8.0 (2.5–12.5)	10.3 (8.7–12.7)	3.5 (2.7–4.4)	22.0 (15.6–28.0)
modal r512	∞	173s	1.0	51.1 (11.8–91.4)	10.7 (8.5–13.2)	3.4 (2.6–4.5)	64.0 (26.1–105.8)
modal r1024	∞	283s	0.998	101.7 (21.7–183.7)	10.0 (8.2–13.3)	3.5 (2.6–4.5)	115.7 (35.8–197.7)

Results (bench/results/agentic/)

terminal-bench-2-v1 fix-git on deepseek-v4-flash, 32-turn agent. docker capped at 128 (local containers), prime/modal unbounded. subprocess is excluded — agentic needs a container. Stage columns are p50 (p10–p90) seconds per rollout.

cell	conc	e2e	rew	setup	generation	scoring	total
docker r64	128	165s	0.984	0.5 (0.4–0.6)	68.3 (62.0–85.1)	4.5 (4.2–5.4)	74.0 (66.9–89.8)
docker r512	128	485s	0.992	0.9 (0.3–2.0)	90.0 (58.6–111.2)	7.5 (3.5–18.8)	99.0 (65.5–131.6)
prime r64	∞	233s	0.984	10.4 (8.4–12.3)	60.6 (50.5–75.1)	14.7 (8.1–16.3)	87.4 (69.7–94.3)
prime r512	∞	289s	0.996	14.1 (10.2–17.3)	105.3 (81.0–117.9)	6.8 (6.7–14.5)	126.0 (108.6–140.0)
modal r64	∞	136s	1.0	8.4 (2.8–13.1)	64.0 (50.9–84.5)	6.6 (4.8–7.9)	80.3 (62.7–100.4)
modal r512	∞	234s	0.977	51.3 (11.3–92.8)	60.6 (52.0–73.4)	6.4 (4.6–8.3)	121.9 (79.4–161.2)

Depends on

feat: record per-rollout setup timing as a distinct phase #1631 (setup span) — merged.
The Trace.to_wire timing-exclusion fix needed to run over the env-server path — merged into the base (generic computed-field exclusion); fix: exclude timing.setup.duration from Trace.to_wire #1634 closed as superseded.

Swap the env-server WORKERS axis for a RUNTIME axis: benchmark.sh (gsm8k-v1) compares subprocess/docker/prime; agentic_benchmark.sh compares docker/prime. Rollouts {64, 512}, concurrency unbounded (--max_concurrent None), agentic turns capped at 32, gsm8k generation unbounded (MAX_TOKENS optional). bench_aggregate.py parses <runtime>-r<rollouts> labels and surfaces per-rollout setup_durations next to gen_durations (setup span landed in #1631). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nt base Results (bench/results/single_turn.json) from gsm8k-v1 on deepseek-v4-flash, subprocess + prime x {64,512}, unbounded concurrency. docker was excluded on this box (per-container cold-install fills the disk at 512; ~100GB needed vs ~50G free — it stays in the script's default matrix). Highlights: prime sandbox provisioning ~10-15s/rollout (grows with concurrency) vs ~0 for subprocess; 0 errors, reward ~1.0. - bench/plot.py: stacked setup-vs-generation p50 per cell + gen p90 whisker (PNG is gitignored; regenerate from the JSON). - benchmark.sh / agentic_benchmark.sh: --retry.attempts -> --retries.*.max_retries 0 (retry config was restructured by #1632), and guard the ~/.env source so a detached run doesn't die re-reading the 1Password FIFO. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…RENT knob One JSON per matrix cell under bench/results/single_turn/ so a subset re-run refreshes only those cells: bench_aggregate.py writes per-cell (recording max_concurrent); benchmark.sh / agentic_benchmark.sh write into the committed results dir without wiping it; plot.py reads the whole dir. Adds a MAX_CONCURRENT knob — docker ran capped at 128 here to bound its per-container cold-install disk use (subprocess/prime unbounded; recorded per cell). Includes the docker results (its cold-install lands in generation, not setup). Drops the combined single_turn.json. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

modal r64/r512 (remote sandboxes, unbounded). modal-r64 is clean (0 errors, setup_p50 ~3.5s, like prime); modal-r512 hits Modal's sandbox provisioning rate limit at 512 concurrent — 221/512 errors (reward 0.57), a real scaling signal. Sandboxes terminate per-rollout + via the lifetime backstop (0 left after the run). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lel limit) Re-run on the merged host-global creation limiter (5/s) with the account's raised 1000-container parallel ceiling: modal-r512 now 0 errors / reward 1.0 (was 221 errors / 0.568), e2e 173s (was 257s). Setup p90 ~92s is the 5/s creation ramp for 512 sandboxes, not failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Aggregate now records per-rollout scoring and total (whole-rollout) durations alongside setup + generation. plot.py becomes a 2x2 of the four stages, each a per-runtime p50 bar with a p10-p90 whisker grouped by rollout count (committed PNG). Refreshed subprocess / prime / modal cells carry the new metrics; docker not re-run this round, so it appears only in the setup/generation panels. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… modal All clean (0 errors). prime scales flattest (total p50 46s); modal is gated by 5/s sandbox creation (setup p50 102s, the 1024/5 ramp) but gen/scoring stay flat since compute is offloaded; subprocess saturates the host at 1024 concurrent rollouts (gen p50 30s, scoring p50 14s). plot.py now handles >2 rollout groups. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…le-turn plot.py is now three panels (setup, generation, scoring) — drop the total panel. Remove the docker single-turn cells: strictly worse than subprocess (local container start + per-container uv cold-install for no isolation benefit on a single-turn task), so not a useful comparison point there. docker stays in the agentic suite, where container isolation is required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

terminal-bench-2-v1 fix-git, 32-turn agent. All clean (0 errors, reward 0.98-1.0). docker capped at 128, prime/modal unbounded. agentic.png committed (setup/generation/scoring panels). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas force-pushed the exp/v1-bench-runtime-matrix branch 3 times, most recently from 7928584 to 577ae0c Compare June 11, 2026 21:22

macroscopeapp Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread bench/bench_aggregate.py Outdated

mikasenghaas and others added 5 commits June 11, 2026 23:25

mikasenghaas force-pushed the exp/v1-bench-runtime-matrix branch from 577ae0c to 597fd44 Compare June 11, 2026 23:30

macroscopeapp Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread bench/plot.py Outdated

mikasenghaas and others added 3 commits June 12, 2026 00:03

mikasenghaas changed the title ~~exp: v1 runtime benchmark matrix + record per-rollout setup timing~~ exp: v1 runtime benchmark matrix Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: v1 runtime benchmark matrix#1630

exp: v1 runtime benchmark matrix#1630
mikasenghaas wants to merge 9 commits into
feat/nano-as-v1from
exp/v1-bench-runtime-matrix

mikasenghaas commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikasenghaas commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (bench/results/single_turn/)

Results (bench/results/agentic/)

Depends on

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented Jun 11, 2026 •

edited

Loading