exp: v1 runtime benchmark matrix#1630
Draft
mikasenghaas wants to merge 9 commits into
Draft
Conversation
7928584 to
577ae0c
Compare
Swap the env-server WORKERS axis for a RUNTIME axis: benchmark.sh (gsm8k-v1)
compares subprocess/docker/prime; agentic_benchmark.sh compares docker/prime.
Rollouts {64, 512}, concurrency unbounded (--max_concurrent None), agentic turns
capped at 32, gsm8k generation unbounded (MAX_TOKENS optional). bench_aggregate.py
parses <runtime>-r<rollouts> labels and surfaces per-rollout setup_durations next
to gen_durations (setup span landed in #1631).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt base
Results (bench/results/single_turn.json) from gsm8k-v1 on deepseek-v4-flash,
subprocess + prime x {64,512}, unbounded concurrency. docker was excluded on this
box (per-container cold-install fills the disk at 512; ~100GB needed vs ~50G free
— it stays in the script's default matrix). Highlights: prime sandbox provisioning
~10-15s/rollout (grows with concurrency) vs ~0 for subprocess; 0 errors, reward ~1.0.
- bench/plot.py: stacked setup-vs-generation p50 per cell + gen p90 whisker (PNG
is gitignored; regenerate from the JSON).
- benchmark.sh / agentic_benchmark.sh: --retry.attempts -> --retries.*.max_retries 0
(retry config was restructured by #1632), and guard the ~/.env source so a
detached run doesn't die re-reading the 1Password FIFO.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…RENT knob One JSON per matrix cell under bench/results/single_turn/ so a subset re-run refreshes only those cells: bench_aggregate.py writes per-cell (recording max_concurrent); benchmark.sh / agentic_benchmark.sh write into the committed results dir without wiping it; plot.py reads the whole dir. Adds a MAX_CONCURRENT knob — docker ran capped at 128 here to bound its per-container cold-install disk use (subprocess/prime unbounded; recorded per cell). Includes the docker results (its cold-install lands in generation, not setup). Drops the combined single_turn.json. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
modal r64/r512 (remote sandboxes, unbounded). modal-r64 is clean (0 errors, setup_p50 ~3.5s, like prime); modal-r512 hits Modal's sandbox provisioning rate limit at 512 concurrent — 221/512 errors (reward 0.57), a real scaling signal. Sandboxes terminate per-rollout + via the lifetime backstop (0 left after the run). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lel limit) Re-run on the merged host-global creation limiter (5/s) with the account's raised 1000-container parallel ceiling: modal-r512 now 0 errors / reward 1.0 (was 221 errors / 0.568), e2e 173s (was 257s). Setup p90 ~92s is the 5/s creation ramp for 512 sandboxes, not failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
577ae0c to
597fd44
Compare
Aggregate now records per-rollout scoring and total (whole-rollout) durations alongside setup + generation. plot.py becomes a 2x2 of the four stages, each a per-runtime p50 bar with a p10-p90 whisker grouped by rollout count (committed PNG). Refreshed subprocess / prime / modal cells carry the new metrics; docker not re-run this round, so it appears only in the setup/generation panels. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… modal All clean (0 errors). prime scales flattest (total p50 46s); modal is gated by 5/s sandbox creation (setup p50 102s, the 1024/5 ramp) but gen/scoring stay flat since compute is offloaded; subprocess saturates the host at 1024 concurrent rollouts (gen p50 30s, scoring p50 14s). plot.py now handles >2 rollout groups. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le-turn plot.py is now three panels (setup, generation, scoring) — drop the total panel. Remove the docker single-turn cells: strictly worse than subprocess (local container start + per-container uv cold-install for no isolation benefit on a single-turn task), so not a useful comparison point there. docker stays in the agentic suite, where container isolation is required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
terminal-bench-2-v1 fix-git, 32-turn agent. All clean (0 errors, reward 0.98-1.0). docker capped at 128, prime/modal unbounded. agentic.png committed (setup/generation/scoring panels). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reshapes the v1 benchmark to compare runtimes instead of env-server worker modes, writes one results file per matrix cell, adds a plot script, and commits a single-turn results set (subprocess / prime / modal).
benchmark.sh(gsm8k-v1),agentic_benchmark.sh(terminal-bench-2-v1).{64, 512, 1024}(single-turn); concurrency unbounded by default (MAX_CONCURRENTknob to cap it); agentic turns capped at 32; gsm8k generation unbounded.bench/results/<suite>/<runtime>-r<rollouts>.json.bench_aggregate.pywrites one file per cell (recordsmax_concurrent, reward/errors, and the sorted per-rollout duration lists for setup / generation / scoring / total); the scripts write into the committed results dir without wiping it, so re-running a subset (e.g. just modal) refreshes only those cells.bench/plot.pyrenders the committed plot below.--retry.attempts→--retries.*.max_retries 0(feat: per-call model + runtime retries #1632), and a guard on the~/.envsource so a detached run doesn't die re-reading the 1Password FIFO.Results (bench/results/single_turn/)
gsm8k-v1 on
deepseek-v4-flash, unbounded concurrency. Each stage column is p50 (p10–p90) seconds per rollout; e2e is the cell wall clock. (subprocess runs rollouts on the eval host; prime/modal offload to remote sandboxes.)Results (bench/results/agentic/)
terminal-bench-2-v1
fix-gitondeepseek-v4-flash, 32-turn agent. docker capped at 128 (local containers), prime/modal unbounded. subprocess is excluded — agentic needs a container. Stage columns are p50 (p10–p90) seconds per rollout.Depends on
Trace.to_wiretiming-exclusion fix needed to run over the env-server path — merged into the base (generic computed-field exclusion); fix: exclude timing.setup.duration from Trace.to_wire #1634 closed as superseded.