ci: Standardize benchmarks and input constants by samuelburnham · Pull Request #467 · argumentcomputer/ix

samuelburnham · 2026-07-02T01:27:25Z

Note

Update bencher testbed from aiur-typecheck-x64-32x to aiur-check-x64-32x before merge

Lowering the witness cap below Zisk's built-in default (10) was measured to have a negligible effect on host RAM and prove time for the kernel typecheck workload, so the CLI override (which defaulted to 5) is removed and the prover uses EmbeddedOpts::default(). Drop the now-stale flag mentions and per-RAM tuning guidance from the README, the cost-model doc, and the shard.rs model comments.

Rebased onto main (post #411 + #459) and integrated with the renamed bench-main.yml and reworked bencher-track interface. - Benchmarks/Vectors.csv: single shared source of truth (71 library constants from Init/Std/Mathlib/Lean). Consumed by Aiur (bench-typecheck --manifest), the zkVM hosts, and shell. - bench-pr.yml: `!benchmark [aiur] [zisk] [sp1|all] [execute|prove]` over the curated set, posting a main-vs-PR table; main results cached by base SHA. Hardened: comment body from env (no injection), allowlisted env parsing. - .github/scripts/{bench.py,run.sh}: parse/manifest/compare/comment, and the compile-.ixe + backend driver (cycles/execute-time/throughput/peak-rss, plus shards/max-shard-cycles for sharded runs). - .github/actions/install-{sp1,zisk}: shared zkVM toolchain + deps install, used by bench-pr.yml, bench-main.yml, and riscv-bench.yml. - bench-main.yml: add zkvm-execute (Zisk/SP1 cycle counts + time/RAM) and native-check (native parallel `ix check --anon` throughput) jobs, using the new bencher-track workload/thresholds interface. - bench-typecheck: add --constant / --skip-deps (align with the zkVM hosts; --skip-deps replaces --subject-only) and --execute-only (fast execute path).

issue_comment workflows only run from the default branch, so the `!benchmark` path can't be exercised on a PR branch. Add a pull_request trigger (base/head from the PR payload; empty comment → parser defaults) to test pre-merge. Revert before merge — delete the `pull_request:` trigger; the dual base/head resolution and `|| pull_request.number` fallbacks are harmless to keep.

The full curated set (~60 InitStd) is too slow to run on every !benchmark and on the bencher prove/zkVM jobs. Add a `primary` column to Vectors.csv marking 11 constants spanning shape (nat/list/array/int/string/vector/multiset, defs + proofs) and the cheap→heavy cost range (incl. 3 shard targets), and make it the default: - bench.py manifest --primary; parse honors BENCH_FULL (run the full set). - bench-pr.yml: !benchmark defaults to the primary subset; BENCH_FULL=1 runs the whole curated set. - bench-main.yml: prove + zkvm-execute derive constants from `manifest --primary` (replacing the hardcoded lists), so all backends bench the same set from the one source of truth. The tier filter keeps prove on the cheap primaries (heavy ones would OOM a single-shard prove); execute/native get the heavy ones for scale.

Bring both surfaces (!benchmark PR comment and bencher.dev) to parity across all four backends and remove the test scaffolding. - Native kernel on !benchmark: add a `native` backend (whole-env `ix check --anon`, the parallel out-of-circuit kernel) — bench.py backend/runner/metrics, a run.sh `native` branch, and a GNU-time install for native PR cells. `all` now fans out aiur/zisk/sp1/native. - run.sh: single-source the Aiur path through a per-constant bench-typecheck loop (per-constant peak-rss), add a per-constant `timeout` to the zkVM execute path (heavy primaries can't hang a job), and accept the env arg case-insensitively (bencher reuses the cached InitStd.ixe; bench-pr compiles initStd.ixe). - bench-main.yml: the prove and native-check jobs now drive run.sh too (dedup with the PR path); zkvm-execute gains the Mathlib env to match prove/native. - bench-thresholds-reset.yml: register the zisk / sp1 / native-check workloads. - bench-pr.yml: drop the temporary pull_request trigger (keep the harmless dual-SHA / number fallbacks). - docs/benchmarking.md: document the two surfaces, backends, Vectors.csv, the !benchmark grammar, and the bencher workloads / threshold resets.

…maries too - Aiur prove now covers the whole primary subset: bench.py exempts --primary from the prove cheap-tier cap, and run.sh proves each constant whose Aiur fft-cost fits the prover RAM ceiling (AIUR_PROVE_MAX_FFT, ~128 GB at 2.34 GB per billion fft) and execute-only's the rest — so heavy primaries (only Vector.extract_append at ~145 GB) still report execute metrics instead of being dropped. BENCH_FULL prove stays capped at the cheap tier so it doesn't balloon. - native now reports two views per env: the whole env (`ix check --anon`, keyed by env) and a per-primary subject check (`ix check --consts`, keyed by constant) — apples-to-apples with the zkVM --skip-deps execute. Wired into both the native-check bencher job and the !benchmark native backend. - run.sh: fix a stream-corruption bug — tool stdout and ::warning::/::notice:: now go to logs/stderr so only JSON reaches the per-constant `jq -s` merge.

Route every backend's peak-rss and per-phase timings through tracing-texray: - peak-rss now comes from the fork's process-tree sampler, so Zisk's ASM microservices' memory (tens of GB in separate PIDs) is captured — a bare /proc/self/status read of the host missed it. Verified on real execute runs: sp1 ~9GB, zisk ~49GB peak. - zisk/sp1 hosts and bench-typecheck self-report cycles/time/throughput/ peak-rss via --json, retiring run.sh's grep/awk/time parsing. They also emit per-phase span timings via --texray-json; native check-rs reports peak-rss as a 6th ##check## field. - run.sh folds span timings into a per-constant `phases` object; bench.py renders a collapsible per-phase drill-down in the !benchmark comparison, and bench-main.yml tracks phase: measures on bencher. tracing-texray is pinned to the argumentcomputer/tracing-texray json-ram branch (rev 15ae57c), which adds the process-tree RSS sampler + JSON sink.

…h -check - Backend `native` is renamed to `ooc` (out-of-circuit) throughout: the BACKENDS tuple in bench.py, the run.sh case arm, the `native-check` job in bench-main.yml (now `ooc-check`), the `\!benchmark` grammar in bench-pr.yml, and docs/benchmarking.md. - Bencher workload names carry an explicit `-check` suffix so the tracked identity is self-describing: `aiur-check`, `zisk-check`, `sp1-check`, `ooc-check` (was `aiur`, `zisk`, `sp1`, `native-check`). Reset workflow tokens (options / valid / accepted / error text), tag-reference comments, and the bencher-track action's docstring updated to match. - Clarify the `\!benchmark` grammar: `([aiur] [zisk] [sp1] [ooc] | all)` — `all` is an alternative to the whole backend list, not just to `ooc`.

…ness **bench.py !benchmark surface** - `fetch-main`: queries bencher.dev's public reports API for `branch=main` at the base SHA, filters results by `--names`, and reshapes into the neutral `{ "<name>": { "<metric>": v } }` shape. Replaces the actions/cache layer; bench-pr.yml falls back to a local base run only when bencher hasn't ingested the base SHA yet. TODO left for non-main base branches. - Mode is fixed per backend (`aiur=prove`, others=`execute`); the optional bare `execute` token in `!benchmark` flips aiur to `--execute-only`. Grammar drops `[execute|prove]` and `BENCH_GPU`; zkVM prove paths removed. - Single runner `warp-ubuntu-latest-x64-32x` for every cell; testbeds unified to `<backend>-check-x64-32x`; `MAIN_TESTBEDS` maps (backend, mode) to the slug. - `compare` renders the OOM sentinel `{"oom": true}` as `OOM` cells + `n/a` Δ%. Regression/improvement flags fire on every metric column (previously only the first); summary counts distinct constants; worst names constant + metric. `_human` is unit-aware: bytes → GiB/MiB, seconds → µs/ms/s/m, counts → K/M/B/T. - Sub-span (per-constant phase) drill-down removed with a TODO; `run.sh`'s `merge_phases` and `bench-main.yml`'s `phase:` flattening still populate the neutral JSON for later reinstatement. **CLI unification** (bench-typecheck, zisk-host, sp1-host) - Dropped `--constant`; added `--consts <n1,n2,…>` (comma-list, clap `value_delimiter` on the Rust hosts) and `--consts-file <path>`. Multi-const runs loop the single-const path per name, accumulating one entry per name via a merge-safe `write_json_entry`. - Dropped `--texray-json <path>` and `--no-texray`. `--texray` (bool) alone toggles the tracing-texray subscriber; combined with `--json <path>`, span timings auto-write to `<json>.spans`. - Inline `#[cfg(test)] mod cli_tests` in each Rust host covers value-delimiter parsing, `--consts` `requires`, and `collect_consts` union/dedup. **Vectors.csv** - Removed `aiur_fft`, `zisk_cycles` (measurements live in bencher, never here). Rows can now omit trailing zero fields; parser tolerates 3+ cols. - Primary set curated: renamed `Vector.extract_append` → `Vector.extract_append._proof_1` (kept as shard_target); promoted `Vector.append` and `Nat.sub_le_of_le_add` to primary; added the un- shardable Init constants from the Zisk cost-model doc's canonical 12 (`Char.ofOrdinal_le_of_le`, `Array.extract_append._proof_1_1`, the `SInt.Int{8,16,32,64}.instRxcHasSize_eq` family with the case corrected), plus `ByteArray.utf8DecodeChar?_utf8EncodeChar_append`, `String.append`, `IxVMPrim.nat_pow_big`, and `Std.Tactic.BVDecide.BVExpr.bitblast.goCache_Inv_of_Inv._mutual`. Trimmed obvious duplicates (`Array.qsort`, `Int.ediv`, `List.dropLast`, `List.range`, `UInt32.toNat`, `Std.Time.Month.Offset.ofNat`). **RAM harness** (run.sh, aiur prove) - Tier gate removed. Every constant attempts a full prove. - `watch_ram_kill` samples `ps -eo pid,ppid,rss` every ~3 s and SIGKILLs the tree if it exceeds `AIUR_PROVE_MAX_RSS_GB` (default 120 GB — 8 GB headroom under 128 GB). Killed constants record `{"oom": true}` for the compare table. **Misc** - `install-zisk`'s description now correctly says the proving key is installed (`client.setup()` loads const-trees before execute too). - `bencher-track`'s workload description enumerates all options. - `zisk/Cargo.toml`'s patch example uses `/path/to/…`, not `/home/ubuntu`. - `docs/zisk-cycle-cost-model.md` finding #4 disambiguates "not-shardable" (mutual blocks only) from "not full-closure-single-leaf provable" (the canonical 12). - `riscv-bench.yml` still on the temporary `sb/ci-benchmarks` push trigger; drop before merge.

…threshold + doc cleanup **`compile` as a !benchmark backend** - bench.py: `DEFAULT_MODE["compile"]="compile"`, `MAIN_TESTBEDS[("compile","compile")]="ix-compile-x64-32x"`, `METRICS[("compile","compile")]=["compile-time","throughput","file-size", "constants"]`. `cmd_manifest` short-circuits on `--backend compile`: writes a one-line names.txt with the CamelCase env slug (`InitStd` / `Lean` / `Mathlib` / `FLT`) and skips the CSV. New `--backend` arg on manifest. - run.sh: setup step teed to `$compile_log` for every backend; `compile` ignores REUSE_IXE (needs a fresh compile to measure) and parses `##benchmark## <elapsed_ms> <bytes> <constants>` into the neutral `{ "<CamelCase>": {"compile-time","file-size","constants","throughput"} }` shape. Compile job on main (`bench-main.yml` testbed `ix-compile-x64-32x`) is the same run bencher already stores. - bench-pr.yml: `!benchmark ([aiur] [zisk] [sp1] [ooc] [compile] | all)`; manifest step passes `--backend "$BACKEND"`. **zkVM parity with ooc (drop `--skip-deps`)** - run.sh zisk/sp1 branch: `--consts <c>` without `--skip-deps` so the check is directly comparable to `ooc`'s `ix check-rs --consts <c>` (also full-closure). Comparing subject-only zkVM against full-closure ooc mixed in-circuit-vs-out-of-circuit overhead with scope; both are now full-closure, so the delta isolates just the overhead. **Threshold semantics for deterministic-but-directional measures** - bench-main.yml zkvm-execute (now zkvm-typecheck): `cycles`, `shards`, `max-shard-cycles` change from pinned `0/0` to `upper 0 / lower _`. These are deterministic (no noise) but a real guest / packer improvement legitimately drops them; the old pin flagged wins as regressions. `constants` (definitional count) stays `0/0`. **docs/benchmarking.md — stale content swept** - Removed the "design-level skips post a note explaining why" paragraph (that machinery was already removed). - Replaced the `cheap → prove / heavy → execute-only` fallback description with the RAM-watchdog reality (attempts prove for every primary; OOMs go through as `{"oom": true}` sentinels rendered `OOM`). - Backends table adds `compile`; drops metrics `aiur` doesn't actually emit. - "Constant set": ~20 primaries in ~68 total (was ~11 in ~60); notes tier is informational-only now, rows may omit trailing zeros, and `compile` short-circuits the CSV. - Per-measure threshold breakdown replaces the incorrect "Deterministic measures … are pinned exactly" claim. - "Not yet covered" now lists three real TODOs (zkVM prove, per-constant phase drill-down, non-main base branches) with pointers to the in-code TODO markers. **Code comment audit** — every touched comment/docstring re-read; stale references to `--skip-deps` parity, tier-based fallback, sub-span drill-down, and cache-hit rendering all updated to match implementation.

Δ% is easier to interpret at scale when paired with a ratio — e.g. `-33.3% (1.50x faster)` for a big prove-time drop; `+15.2% (1.15x slower)` for a real execute-time regression. Only shown when the factor is >= 1.05x (roughly ±5% in either direction) so sub-noise deltas don't get `(1.02x slower)` clutter next to `+1.1%`. Cell format cascades cleanly: `{Δ%} [(1.NN× {slower|faster})] [emoji]`.

…dedup Fixes from the branch-wide review, plus build-once caching and a dedup pass. **check-rs full-closure mode (new kernel FFI)** - `rs_kernel_check_anon_consts`: resolve displayed names via the env's `named` map (the zkVM hosts' resolution), then anon-check the seeds' FULL dependency closures via the existing `closure_addrs` + `build_anon_work` primitives — no sub-env serialization. `skip_deps` restricts to subject-only. `index_anon_work` extracted so the whole-env and filtered paths share the result-slot indexing. - `ix check-rs --anon --consts <names>` now runs that (full-closure default, `--skip-deps` opt-out — the hosts' CLI shape); meta-mode seeded checks are unchanged (subject-only, bisect flows intact). run.sh's ooc per-constant view uses it, so ooc-vs-zkVM deltas genuinely isolate in-circuit overhead (both sides now full-closure anon; previously ooc was subject-only meta). **OOM sentinel end-to-end** - run.sh merges `oom: true` INTO already-measured Phase-1 metrics (was: overwrite); compare renders OOM only for the missing metrics. - `bench.py bmf` (one neutral→BMF converter replacing four hand-copied jq/awk pipelines in bench-main.yml) strips the boolean sentinel — one OOM row no longer fails the whole bencher upload. This also covers the zkVM execute path, which gained its own RAM watchdog + sentinel. - Watchdog: setsid + process-group kill (reaches all descendants); kernel OOM beating the 3 s sampler (exit 137, no marker) labels OOM instead of silently dropping; per-constant wall-clock timeouts (AIUR_PROVE_TIMEOUT 50m / AIUR_EXECUTE_TIMEOUT 25m); $out re-merged per constant so a job-level kill keeps completed rows. **compare correctness** - throughput is higher-is-better: flags/worst/ratio direction-aware (HIGHER_IS_BETTER + _badness). - Ratio wording follows metric kind: times faster/slower, bytes larger/smaller, counts more/fewer. - `execute-peak-rss`: Phase-1 RSS high-water sampled at the Phase 1/2 boundary in bench-typecheck (both modes) and adopted by the zkVM hosts' execute JSON — one name for "execute-phase peak" on every backend; bare `peak-rss` reserved for prove-phase peaks. aiur-execute compares it apples-to-apples against prove-run baselines (raw peak-rss would dwarf it). - One-side-empty tables get a loud note (e.g. CLI-incompatible base). - Tidy JSON: bench-typecheck emits decimal `JsonNumber`s (`jsonRound`) instead of Float's full binary repr. **fetch-main robustness** - Retry with backoff + newest-first pagination (base SHA beyond page one no longer forces a permanent fallback). - Exit codes honored by bench-pr.yml: 3 (transient) → local fallback; 2 (BACKEND_TABLE / bench-main drift) fails the cell loudly. - ooc whole-env row keyed CamelCase on both sides (benv_cc table in run.sh, ENV_CC in bench.py) and admitted past the --names filter. **build-once + caching (bench-pr.yml)** - New build job: ix + bench-typecheck built once per head SHA (they embed the IxVM kernel/prover), cached as `bench-bins-<sha>`; cells restore into .bins/pr instead of per-cell Lean builds. Re-running \!benchmark on the same commit skips the build. - Base fallback restores bench-main's own `bench-bins-`/`bench-ixe-` caches (toolchain cmp + mathlib-oleans guard) before paying for a from-scratch base build. PR-side .ixe cached across cells. - Cache keys renamed `aiur-bench-bins-`/`aiur-ixe-` → `bench-bins-`/`bench-ixe-` (they serve every backend). - run.sh resolves tools in-tree first, then PATH (`resolve_bin`). **dedup** - bench.py: DEFAULT_MODE + METRICS + MAIN_TESTBEDS collapsed into one BACKEND_TABLE (mode / testbed / per-mode metrics). - bench-main's compile job routes through run.sh's compile backend + bmf — the `##benchmark##` line is parsed in exactly one place (run.sh gains `flt` in its env table for the FLT matrix cell). - `Ix.Cli.ConstsFile`: one names-file/comma-list parser (inline-`#` comments, dedup) shared by check-rs meta+anon, bench-typecheck, and `ix compile --exclude-file` — closes the whole-line-vs-inline comment drift across the five previous copies. **hosts** - clap `requires = "consts"` dropped from --json/--skip-deps (rejected valid --consts-file-only runs); validated in main after collect_consts. zisk's multi-`--ixe` guard now covers --consts-file. **misc** - Vectors.csv / docs tier semantics corrected (manifest-only consumer); stale prove-fallback comments fixed; tracing-texray bumped to bd4faa08; push-to-main workflows drop their concurrency groups (every merged commit must be benchmarked; a later merge must never cancel one). Verified: cargo check -p ix-ffi, lake build ix / bench-typecheck, live runs of the new check-rs mode against a 480 MB Init/Std env (closure = 34 items / 42 targets for Nat.add_comm; --skip-deps = 1; missing names error loudly), bmf/parse/compare dry-runs, YAML + bash -n + AST checks. zisk/sp1 host crates compile-verified in CI only (local sandbox lacks the pil2-proofman C++ toolchain).

…key) bench-main.yml's zkvm-execute job now runs real executions of both hosts on every main push, but tolerates per-constant failures by design (dropped rows, OOM sentinels) — it never turns red on a breakage. This workflow becomes the red-X signal instead, kept cheap: - Build-only + unit tests: `cargo build --release --bin <host>` plus `cargo test --release --bin <host>`, which covers the clap surface run.sh drives (--consts comma-splitting, --consts-file union/dedup, shard-plan conflicts). No execution, so no fixture compile job, no minimal.ixe artifact, no memlock prlimit. - install-zisk gains a `proving-key` input (default true): the key is loaded at runtime by `client.setup()`, never at build time, so the build gate skips the ~3 GB download + const-tree regeneration. - Renamed "RISC-V bench" → "zkVM host build" (it no longer benchmarks). - The TEMPORARY sb/ci-benchmarks push trigger (from the mid-branch test commit) reverts to main-only here; the tip commit is its sole carrier. The runtime/typecheck smoke signal (executing myReflEq and asserting failures == 0) is retired: that coverage now lives in zkvm-execute's real runs, at the cost of being warnings-not-red-X.

zisk-host pulls zisk-sdk as both a dependency and a build-dependency, so cargo compiles proofman-starks-lib-c as two units whose build scripts can run concurrently — and both run `make` inside the SHARED ~/.cargo/git/checkouts/pil2-proofman-* source tree (not OUT_DIR). On a cold runner the Makefile stamp is absent, so both units take the `make clean` + `make -j` path and race: one unit's clean deletes build/ while the other's g++ is mid-compile, dying with "opening dependency file ….d: No such file or directory" (g++ writes .d files at the END of compilation, so the dir vanished underneath it). Warm runners never hit it — the stamp short-circuits the clean — which is why the race only surfaced on this branch's cold cache (lockfile change → new rust-cache key). install-zisk now builds the sys crate solo first: its build script runs once, `make` completes, the stamp lands; both units of the subsequent parallel build then skip the clean and their `make -j` is a no-op. Same total work, just ordered. The proper fix is an flock around the make in pil2-proofman's build.rs — worth an upstream PR.

`resolve_bin ix` ran unconditionally at startup, but bench-main's zkvm-execute job legitimately has no `ix` anywhere: it restores only the `.ixe` cache (REUSE_IXE=1 skips the compile step) and builds its zkVM host via cargo — before the resolve_bin refactor that job never touched `ix` at all, so the eager check regressed it with "ix not found (in-tree or PATH)". Resolve per use site instead: the compile step and the ooc branch resolve `ix`, the aiur branch resolves `bench-typecheck`, and the zkVM branch needs neither.

…or names Three fixes for the failing bench-main runs: - run.sh: a heavy prove can die of memory without tripping the watchdog or the kernel OOM killer — one huge trace allocation fails and the runtime aborts (SIGABRT, exit 134) with an allocator message. looks_like_oom() folds that third case into the OOM classification so those constants get an OOM row instead of being dropped. Non-OOM failures now log the first lines of the tool output so drops are diagnosable from the job log. - run.sh: after a watchdog group-kill, the Zisk host never runs its Drop-time cleanup of /dev/shm/ZISK_* segments (multi-GB; the MT output segment alone starts at 6 GB), so the next host launch fails creating its own segments before Zisk's startup stale-pid sweep runs — exactly one dropped constant after every OOM kill (the alternating OOM/exit-1 pattern in the zisk-execute logs). Sweep the debris between constants. - Vectors.csv: IxVMPrim.nat_pow_big removed (kernel primitive, never present in compiled envs); Vector.extract_append._proof_1 → ._proof_2 and Array.extract_append._proof_1_1 → Array.extract_append (proof-term names are toolchain-dependent; both replacements verified against a freshly compiled initStd.ixe and measured heavy: fft-cost 23.4e9 / 40.1e9).

…nchmarks riscv-bench.yml's two jobs (build + unit-test the Zisk/SP1 hosts, no execution, no proving key) are cheap enough (~5 min with warm caches) to gate every PR commit, and ci.yml already has exactly the triggers and per-ref cancellation they need — so they move there and the standalone workflow goes away. sp1 benchmarks are commented out for now — the execute run is too slow: the sp1 cell of bench-main's zkvm-execute matrix and bench-pr's Install SP1 step. Uncomment both to re-enable. The \!benchmark backend-subset selection is unchanged.

The op (added with the native execution pipeline, #463) pushes two values onto the value map during execution — the quotient/remainder list-head pointers — and the constraint generator allocates two auxiliary columns for them, but the trace populator treated it as a no-op. Every ValIdx and witness column after the first big-Nat division in a block was therefore off by two, and trace population panicked with "index out of bounds" at trace.rs:308 — the exit-134 failures on every big-Nat-heavy prove in the bench-main aiur job (List.mergeSort, String.split, the SInt instRxcHasSize_eq family, Multiset.sort, …). Reproducible in seconds with any Nat-division user, e.g. `bench-typecheck --consts Nat.repr`. The trace arm now mirrors the execute arm: recompute (q, r) with num_bigint and resolve the head pointers execution already recorded in memory[10] via a read-only twin of build_klimbs_u64, pushing both as map entries + auxiliary columns. Verified: Nat.repr proves end-to-end (panicked before); `lake test -- --ignored aiur ixvm` passes; clippy clean.

Prover changes can trade prove speed against proof size or verification cost, so Phase 2 now serializes each fresh proof (proof-size, bytes) and verifies it (verify-time; a verification failure is reported loudly and drops only that measure). The full-closure path rebuilds the Array-G claim from proveAddrWithEnv's serialized claim bytes — verify_claim's input is the 32-G blake3 digest of those bytes, the same recipe as `ix verify`. Both measures flow through the neutral JSON into the PR compare table (bench.py aiur prove metrics + byte/second unit mapping) and onto bencher (verify-time 10% upper, proof-size 5% upper). Measured: Nat.add_comm proves in 1.63s, verifies in 0.19s, 33.4 MB proof.

The constants that OOM as single full-closure leaves (extract_append proofs, the instRxcHasSize_eq family, …) are only measurable under env sharding — their checks fit in one manifest shard each, with deps checked in other shards (docs/zisk-cycle-cost-model.md, finding 4). Wire the existing offline partitioner into CI instead of any per-constant sharding: - compile job: after the compile benchmark, `ix profile <Env>.ixe` → `ix shard --max-ram 120` for InitStd/Mathlib; the `.ixes` manifest is cached next to the `.ixe` (every restore of that key lists the same paths — actions/cache versions entries by path list). - zisk-host: `--shard-plan --execute --json` now writes one env-level row — total cycles, shards, max-shard-cycles, execute-time, throughput, execute-peak-rss, and a per-shard `shard-cycles` breakdown — keyed by the new `--json-name` (default: manifest stem). - run.sh: the zkVM guest-run dance (setsid + RAM watchdog + OOM classification + shm sweep + merge) is factored into one `zkvm_run` helper; when a `.ixes` sits next to the `.ixe` (bench-main only — the \!benchmark PR path has none) the zisk branch appends the env-sharded run after the per-constant loop. - bench.py bmf: nested-object flattening generalized — `phases` → `phase:` as before, any other dict (e.g. `shard-cycles`) → `<key>:` — so per-shard cycles land on bencher as un-thresholded measures while the thresholded aggregates (cycles, shards, max-shard-cycles) do the alerting. zkvm job timeout 60 → 150 min: the whole-env run (own 60m cap inside run.sh) rides on top of the per-constant loop.

fetch-main previously treated main.json as all-or-nothing: any name bencher lacked at the base SHA (typically constants the PR itself adds to Vectors.csv) silently rendered n/a. It now writes the uncovered subset to --missing-out and still exits 0; bench-pr runs the base checkout on JUST those names and merges under bencher's rows (bencher wins overlaps — it is the canonical main side), so a brand-new constant gets a real main-vs-PR delta on its first \!benchmark. The full-set base fallback (exit 3, SHA not ingested) is unchanged.

A PR can change the kernel's cost profile, so the env-sharded zisk run must not inherit main's partition. run.sh now cuts the manifest in-place (ix profile → ix shard) whenever one isn't already sitting next to the .ixe — the profiler counts heartbeats rather than wall time, so an unchanged tree re-partitions deterministically and per-shard comparisons stay meaningful. bench-main's compile-job manifest keeps pre-empting the generation on the main side; the \!benchmark PR side profiles its own tree (manifest cached per head SHA under bench-pr-ixes-*). The base fallback reuses bench-main's cached manifest and pays the env-sharded run only on a full bencher miss — a partial miss means bencher already holds main's env row (ZISK_ENV_SHARD=0 skips it). bench-pr benchmark timeout 120 → 180 min for the zisk cell's worst case (per-constant loop + profile + the env execute's own 60m cap).

The shard breakdown carried cycles only. Each shard now also records its guest execute time (SDK-reported) and its RAM high-water — the texray sampler's peak is reset before every shard, so the windows are independent and the env row's execute-peak-rss becomes their max (the run's execution-phase high-water; setup RAM no longer counted). Uploaded as shard-time:<k> / shard-peak-rss:<k> alongside shard-cycles:<k>, all un-thresholded like phase:. No prove-side equivalent yet: zisk proving isn't wired up in CI at all (needs a GPU runner), so there is no prove-time for any zisk row.

…mand Two ways to get a closure-only `.ixe` for one constant: - `ix compile <file.lean> --consts <n1,n2,…>` (+ --consts-file): seed the compile by exact constant name instead of the whole import env, with transitive deps via the existing collectDeps — same names vocabulary as `ix check --consts`. Mutually exclusive with --module. - `ix shard extract <env.ixe> --consts <n1,n2,…> --out <sub.ixe>`: the sharding pipeline's scoping step — cut the closure out of an EXISTING env without recompiling from source. The output carries the closure's genuine constant bytes, blobs, and reducibility hints (build_sub_env, now shared via sub_env_of), plus a name→address entry per closure constant so `--consts`-style tools still resolve. Metadata is dropped (real ConstantMeta references name addresses throughout its tree and would need the full hash-consed name index) — extracted envs serve the anon pipeline; meta-mode tools need the source env. A mutual-block member extracts its whole block. Verified against a fresh initStd.ixe: both forms produce envs whose bench-typecheck fft-cost is bit-identical to the full-env run (content addressing at work), and extract → ix profile → ix shard composes cleanly (Int8.instRxcHasSize_eq: 2.1 MB closure env, 16.08e9 steps → 11 shards at the 120 GiB cap).

Replaces the whole-env sharded run (wrong scope: all of InitStd is 1.94e12 steps / 1263 shards, and its biggest atomic block is INFEASIBLE under the cap) with per-constant closure sharding through the canonical pipeline. For each heavy-tier primary, run.sh runs `ix shard extract` (closure-only env, no recompile) → `ix profile` → `ix shard --max-ram 120`, then one `zisk-host --shard-plan --execute` run executes the shards sequentially — the constant's row carries totals plus the per-shard shard-{cycles,time,peak-rss}:<k> breakdown, uploaded to bencher like every other measure. Cheap primaries keep the single-leaf --consts run. - run.sh: cut_closure_shards (also exposed as the `cutshards` backend so bench-main's compile job pre-cuts through the same code path) + heavy-tier dispatch in the zisk loop (ZISK_HEAVY_NAMES). A constant whose partition still can't fit (atomic mutual block over the cap) OOMs under the RAM watchdog: honest OOM row, remaining shards skipped, loop proceeds to the next constant. Cutting failures fall back to the single-leaf run. - bench.py manifest --heavy-out: the selected heavy-tier names, from Vectors.csv's tier column. - bench-main: the compile job pre-cuts zkshards-<Bench>/ next to the fresh .ixe (it has ix + the toolchain; the zkvm job stays Lean-free) and ships it to the zkvm job in the sha-keyed bench-ixe-* cache entry. - bench-pr: the PR side cuts its own artifacts fresh every run (cheap — seconds per closure — and a PR can change the cost profile; profiling counts heartbeats, so an unchanged tree re-partitions deterministically); the base fallback reuses bench-main's pre-cut dir.

… job A constant the kernel rejects is a correctness regression, not a benchmark datum. End to end: - zisk-host: every sharded loop (shard-plan execute + prove, whole-env execute + leaf prove) bails on the FIRST failing shard via the shared reject_failures helper — mirroring the OOM kill, which also cancels the constant's remaining shards — instead of accumulating failures across the full manifest. sp1 and the single-leaf paths already bailed immediately. - run.sh: a zkVM failure whose log carries the host's "kernel typecheck produced" abort records the neutral `{"failed": true}` sentinel (an ::error:: annotation, not a silent drop); ooc does the same when ##check## reports failures. bench-typecheck marks Phase-1 check errors failed and skips them in Phase 2. Failure warnings now print the log's head AND tail (the host's abort lands at the end of a mid-manifest log). - bench.py compare: `failed` renders ❌ cells (outranking OOM) plus a bold "FAILED TO TYPECHECK on the <side> side" note under the table; bmf strips the sentinel so rejected rows never reach bencher. - workflows: prove/zkvm/ooc jobs (bench-main) and the benchmark cell (bench-pr) exit nonzero when the neutral JSON carries a failed row — AFTER the clean rows upload / the table posts, so the red X lands on the commit/PR without losing the report.

Drop this commit before merging. The zkVM build gate needs no branch trigger — it lives in ci.yml now and runs on pull_request.

samuelburnham force-pushed the sb/ci-benchmarks branch 2 times, most recently from 43d00a3 to eb6b068 Compare July 2, 2026 20:09

samuelburnham added 15 commits July 2, 2026 21:50

samuelburnham force-pushed the sb/ci-benchmarks branch from c73bafa to 00dd93d Compare July 3, 2026 02:10

samuelburnham added 2 commits July 2, 2026 23:00

samuelburnham force-pushed the sb/ci-benchmarks branch from 00dd93d to a959f36 Compare July 3, 2026 13:20

samuelburnham force-pushed the sb/ci-benchmarks branch from a959f36 to 6949298 Compare July 3, 2026 15:26

samuelburnham added 4 commits July 3, 2026 12:16

samuelburnham force-pushed the sb/ci-benchmarks branch from 6949298 to 07264b9 Compare July 3, 2026 17:00

samuelburnham added 2 commits July 3, 2026 13:09

samuelburnham added 3 commits July 3, 2026 15:53

test(ci): trigger bench-main on branch pushes (TEMPORARY)

f1ff88a

Drop this commit before merging. The zkVM build gate needs no branch trigger — it lives in ci.yml now and runs on pull_request.

samuelburnham force-pushed the sb/ci-benchmarks branch from 07264b9 to f1ff88a Compare July 3, 2026 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: Standardize benchmarks and input constants#467

ci: Standardize benchmarks and input constants#467
samuelburnham wants to merge 27 commits into
mainfrom
sb/ci-benchmarks

samuelburnham commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

samuelburnham commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samuelburnham commented Jul 2, 2026 •

edited

Loading