Skip to content

ci: Standardize benchmarks and input constants#467

Draft
samuelburnham wants to merge 27 commits into
mainfrom
sb/ci-benchmarks
Draft

ci: Standardize benchmarks and input constants#467
samuelburnham wants to merge 27 commits into
mainfrom
sb/ci-benchmarks

Conversation

@samuelburnham

@samuelburnham samuelburnham commented Jul 2, 2026

Copy link
Copy Markdown
Member

After argumentcomputer/tracing-texray#3

Note

Update bencher testbed from aiur-typecheck-x64-32x to aiur-check-x64-32x before merge

@samuelburnham samuelburnham force-pushed the sb/ci-benchmarks branch 2 times, most recently from 43d00a3 to eb6b068 Compare July 2, 2026 20:09
Lowering the witness cap below Zisk's built-in default (10) was measured to
have a negligible effect on host RAM and prove time for the kernel typecheck
workload, so the CLI override (which defaulted to 5) is removed and the prover
uses EmbeddedOpts::default(). Drop the now-stale flag mentions and per-RAM
tuning guidance from the README, the cost-model doc, and the shard.rs model
comments.
Rebased onto main (post #411 + #459) and integrated with the renamed
bench-main.yml and reworked bencher-track interface.

- Benchmarks/Vectors.csv: single shared source of truth (71 library constants
  from Init/Std/Mathlib/Lean). Consumed by Aiur (bench-typecheck --manifest),
  the zkVM hosts, and shell.
- bench-pr.yml: `!benchmark [aiur] [zisk] [sp1|all] [execute|prove]` over the
  curated set, posting a main-vs-PR table; main results cached by base SHA.
  Hardened: comment body from env (no injection), allowlisted env parsing.
- .github/scripts/{bench.py,run.sh}: parse/manifest/compare/comment, and the
  compile-.ixe + backend driver (cycles/execute-time/throughput/peak-rss, plus
  shards/max-shard-cycles for sharded runs).
- .github/actions/install-{sp1,zisk}: shared zkVM toolchain + deps install,
  used by bench-pr.yml, bench-main.yml, and riscv-bench.yml.
- bench-main.yml: add zkvm-execute (Zisk/SP1 cycle counts + time/RAM) and
  native-check (native parallel `ix check --anon` throughput) jobs, using the
  new bencher-track workload/thresholds interface.
- bench-typecheck: add --constant / --skip-deps (align with the zkVM hosts;
  --skip-deps replaces --subject-only) and --execute-only (fast execute path).
issue_comment workflows only run from the default branch, so the `!benchmark`
path can't be exercised on a PR branch. Add a pull_request trigger (base/head
from the PR payload; empty comment → parser defaults) to test pre-merge.

Revert before merge — delete the `pull_request:` trigger; the dual base/head
resolution and `|| pull_request.number` fallbacks are harmless to keep.
The full curated set (~60 InitStd) is too slow to run on every !benchmark and on
the bencher prove/zkVM jobs. Add a `primary` column to Vectors.csv marking 11
constants spanning shape (nat/list/array/int/string/vector/multiset, defs +
proofs) and the cheap→heavy cost range (incl. 3 shard targets), and make it the
default:

- bench.py manifest --primary; parse honors BENCH_FULL (run the full set).
- bench-pr.yml: !benchmark defaults to the primary subset; BENCH_FULL=1 runs the
  whole curated set.
- bench-main.yml: prove + zkvm-execute derive constants from `manifest --primary`
  (replacing the hardcoded lists), so all backends bench the same set from the one
  source of truth. The tier filter keeps prove on the cheap primaries (heavy ones
  would OOM a single-shard prove); execute/native get the heavy ones for scale.
Bring both surfaces (!benchmark PR comment and bencher.dev) to parity across all
four backends and remove the test scaffolding.

- Native kernel on !benchmark: add a `native` backend (whole-env `ix check
  --anon`, the parallel out-of-circuit kernel) — bench.py backend/runner/metrics,
  a run.sh `native` branch, and a GNU-time install for native PR cells. `all`
  now fans out aiur/zisk/sp1/native.
- run.sh: single-source the Aiur path through a per-constant bench-typecheck loop
  (per-constant peak-rss), add a per-constant `timeout` to the zkVM execute path
  (heavy primaries can't hang a job), and accept the env arg case-insensitively
  (bencher reuses the cached InitStd.ixe; bench-pr compiles initStd.ixe).
- bench-main.yml: the prove and native-check jobs now drive run.sh too (dedup
  with the PR path); zkvm-execute gains the Mathlib env to match prove/native.
- bench-thresholds-reset.yml: register the zisk / sp1 / native-check workloads.
- bench-pr.yml: drop the temporary pull_request trigger (keep the harmless
  dual-SHA / number fallbacks).
- docs/benchmarking.md: document the two surfaces, backends, Vectors.csv, the
  !benchmark grammar, and the bencher workloads / threshold resets.
…maries too

- Aiur prove now covers the whole primary subset: bench.py exempts --primary from
  the prove cheap-tier cap, and run.sh proves each constant whose Aiur fft-cost
  fits the prover RAM ceiling (AIUR_PROVE_MAX_FFT, ~128 GB at 2.34 GB per billion
  fft) and execute-only's the rest — so heavy primaries (only Vector.extract_append
  at ~145 GB) still report execute metrics instead of being dropped. BENCH_FULL
  prove stays capped at the cheap tier so it doesn't balloon.
- native now reports two views per env: the whole env (`ix check --anon`, keyed by
  env) and a per-primary subject check (`ix check --consts`, keyed by constant) —
  apples-to-apples with the zkVM --skip-deps execute. Wired into both the
  native-check bencher job and the !benchmark native backend.
- run.sh: fix a stream-corruption bug — tool stdout and ::warning::/::notice:: now
  go to logs/stderr so only JSON reaches the per-constant `jq -s` merge.
Route every backend's peak-rss and per-phase timings through tracing-texray:

- peak-rss now comes from the fork's process-tree sampler, so Zisk's ASM
  microservices' memory (tens of GB in separate PIDs) is captured — a bare
  /proc/self/status read of the host missed it. Verified on real execute
  runs: sp1 ~9GB, zisk ~49GB peak.

- zisk/sp1 hosts and bench-typecheck self-report cycles/time/throughput/
  peak-rss via --json, retiring run.sh's grep/awk/time parsing. They also
  emit per-phase span timings via --texray-json; native check-rs reports
  peak-rss as a 6th ##check## field.

- run.sh folds span timings into a per-constant `phases` object; bench.py
  renders a collapsible per-phase drill-down in the !benchmark comparison,
  and bench-main.yml tracks phase:<span> measures on bencher.

tracing-texray is pinned to the argumentcomputer/tracing-texray json-ram
branch (rev 15ae57c), which adds the process-tree RSS sampler + JSON sink.
…h -check

- Backend `native` is renamed to `ooc` (out-of-circuit) throughout: the
  BACKENDS tuple in bench.py, the run.sh case arm, the `native-check` job
  in bench-main.yml (now `ooc-check`), the `\!benchmark` grammar in
  bench-pr.yml, and docs/benchmarking.md.

- Bencher workload names carry an explicit `-check` suffix so the tracked
  identity is self-describing: `aiur-check`, `zisk-check`, `sp1-check`,
  `ooc-check` (was `aiur`, `zisk`, `sp1`, `native-check`). Reset workflow
  tokens (options / valid / accepted / error text), tag-reference comments,
  and the bencher-track action's docstring updated to match.

- Clarify the `\!benchmark` grammar: `([aiur] [zisk] [sp1] [ooc] | all)` —
  `all` is an alternative to the whole backend list, not just to `ooc`.
…ness

**bench.py !benchmark surface**
- `fetch-main`: queries bencher.dev's public reports API for `branch=main` at
  the base SHA, filters results by `--names`, and reshapes into the neutral
  `{ "<name>": { "<metric>": v } }` shape. Replaces the actions/cache layer;
  bench-pr.yml falls back to a local base run only when bencher hasn't
  ingested the base SHA yet. TODO left for non-main base branches.
- Mode is fixed per backend (`aiur=prove`, others=`execute`); the optional
  bare `execute` token in `!benchmark` flips aiur to `--execute-only`.
  Grammar drops `[execute|prove]` and `BENCH_GPU`; zkVM prove paths removed.
- Single runner `warp-ubuntu-latest-x64-32x` for every cell; testbeds
  unified to `<backend>-check-x64-32x`; `MAIN_TESTBEDS` maps (backend, mode)
  to the slug.
- `compare` renders the OOM sentinel `{"oom": true}` as `OOM` cells + `n/a`
  Δ%. Regression/improvement flags fire on every metric column (previously
  only the first); summary counts distinct constants; worst names constant
  + metric. `_human` is unit-aware: bytes → GiB/MiB, seconds → µs/ms/s/m,
  counts → K/M/B/T.
- Sub-span (per-constant phase) drill-down removed with a TODO; `run.sh`'s
  `merge_phases` and `bench-main.yml`'s `phase:<span>` flattening still
  populate the neutral JSON for later reinstatement.

**CLI unification** (bench-typecheck, zisk-host, sp1-host)
- Dropped `--constant`; added `--consts <n1,n2,…>` (comma-list, clap
  `value_delimiter` on the Rust hosts) and `--consts-file <path>`.
  Multi-const runs loop the single-const path per name, accumulating one
  entry per name via a merge-safe `write_json_entry`.
- Dropped `--texray-json <path>` and `--no-texray`. `--texray` (bool) alone
  toggles the tracing-texray subscriber; combined with `--json <path>`,
  span timings auto-write to `<json>.spans`.
- Inline `#[cfg(test)] mod cli_tests` in each Rust host covers
  value-delimiter parsing, `--consts` `requires`, and `collect_consts`
  union/dedup.

**Vectors.csv**
- Removed `aiur_fft`, `zisk_cycles` (measurements live in bencher, never
  here). Rows can now omit trailing zero fields; parser tolerates 3+ cols.
- Primary set curated: renamed `Vector.extract_append` →
  `Vector.extract_append._proof_1` (kept as shard_target); promoted
  `Vector.append` and `Nat.sub_le_of_le_add` to primary; added the un-
  shardable Init constants from the Zisk cost-model doc's canonical 12
  (`Char.ofOrdinal_le_of_le`, `Array.extract_append._proof_1_1`, the
  `SInt.Int{8,16,32,64}.instRxcHasSize_eq` family with the case corrected),
  plus `ByteArray.utf8DecodeChar?_utf8EncodeChar_append`, `String.append`,
  `IxVMPrim.nat_pow_big`, and
  `Std.Tactic.BVDecide.BVExpr.bitblast.goCache_Inv_of_Inv._mutual`.
  Trimmed obvious duplicates (`Array.qsort`, `Int.ediv`, `List.dropLast`,
  `List.range`, `UInt32.toNat`, `Std.Time.Month.Offset.ofNat`).

**RAM harness** (run.sh, aiur prove)
- Tier gate removed. Every constant attempts a full prove.
- `watch_ram_kill` samples `ps -eo pid,ppid,rss` every ~3 s and SIGKILLs
  the tree if it exceeds `AIUR_PROVE_MAX_RSS_GB` (default 120 GB — 8 GB
  headroom under 128 GB). Killed constants record `{"oom": true}` for the
  compare table.

**Misc**
- `install-zisk`'s description now correctly says the proving key is
  installed (`client.setup()` loads const-trees before execute too).
- `bencher-track`'s workload description enumerates all options.
- `zisk/Cargo.toml`'s patch example uses `/path/to/…`, not `/home/ubuntu`.
- `docs/zisk-cycle-cost-model.md` finding #4 disambiguates
  "not-shardable" (mutual blocks only) from "not full-closure-single-leaf
  provable" (the canonical 12).
- `riscv-bench.yml` still on the temporary `sb/ci-benchmarks` push
  trigger; drop before merge.
…threshold + doc cleanup

**`compile` as a !benchmark backend**
- bench.py: `DEFAULT_MODE["compile"]="compile"`,
  `MAIN_TESTBEDS[("compile","compile")]="ix-compile-x64-32x"`,
  `METRICS[("compile","compile")]=["compile-time","throughput","file-size",
  "constants"]`. `cmd_manifest` short-circuits on `--backend compile`: writes
  a one-line names.txt with the CamelCase env slug (`InitStd` / `Lean` /
  `Mathlib` / `FLT`) and skips the CSV. New `--backend` arg on manifest.
- run.sh: setup step teed to `$compile_log` for every backend; `compile`
  ignores REUSE_IXE (needs a fresh compile to measure) and parses
  `##benchmark## <elapsed_ms> <bytes> <constants>` into the neutral
  `{ "<CamelCase>": {"compile-time","file-size","constants","throughput"} }`
  shape. Compile job on main (`bench-main.yml` testbed
  `ix-compile-x64-32x`) is the same run bencher already stores.
- bench-pr.yml: `!benchmark ([aiur] [zisk] [sp1] [ooc] [compile] | all)`;
  manifest step passes `--backend "$BACKEND"`.

**zkVM parity with ooc (drop `--skip-deps`)**
- run.sh zisk/sp1 branch: `--consts <c>` without `--skip-deps` so the check
  is directly comparable to `ooc`'s `ix check-rs --consts <c>` (also
  full-closure). Comparing subject-only zkVM against full-closure ooc mixed
  in-circuit-vs-out-of-circuit overhead with scope; both are now
  full-closure, so the delta isolates just the overhead.

**Threshold semantics for deterministic-but-directional measures**
- bench-main.yml zkvm-execute (now zkvm-typecheck): `cycles`, `shards`,
  `max-shard-cycles` change from pinned `0/0` to `upper 0 / lower _`.
  These are deterministic (no noise) but a real guest / packer improvement
  legitimately drops them; the old pin flagged wins as regressions.
  `constants` (definitional count) stays `0/0`.

**docs/benchmarking.md — stale content swept**
- Removed the "design-level skips post a note explaining why" paragraph
  (that machinery was already removed).
- Replaced the `cheap → prove / heavy → execute-only` fallback description
  with the RAM-watchdog reality (attempts prove for every primary; OOMs go
  through as `{"oom": true}` sentinels rendered `OOM`).
- Backends table adds `compile`; drops metrics `aiur` doesn't actually emit.
- "Constant set": ~20 primaries in ~68 total (was ~11 in ~60); notes tier
  is informational-only now, rows may omit trailing zeros, and `compile`
  short-circuits the CSV.
- Per-measure threshold breakdown replaces the incorrect "Deterministic
  measures … are pinned exactly" claim.
- "Not yet covered" now lists three real TODOs (zkVM prove, per-constant
  phase drill-down, non-main base branches) with pointers to the in-code
  TODO markers.

**Code comment audit** — every touched comment/docstring re-read; stale
references to `--skip-deps` parity, tier-based fallback, sub-span
drill-down, and cache-hit rendering all updated to match implementation.
Δ% is easier to interpret at scale when paired with a ratio — e.g.
`-33.3% (1.50x faster)` for a big prove-time drop; `+15.2% (1.15x slower)`
for a real execute-time regression. Only shown when the factor is >= 1.05x
(roughly ±5% in either direction) so sub-noise deltas don't get
`(1.02x slower)` clutter next to `+1.1%`. Cell format cascades cleanly:
`{Δ%} [(1.NN× {slower|faster})] [emoji]`.
…dedup

Fixes from the branch-wide review, plus build-once caching and a dedup pass.

**check-rs full-closure mode (new kernel FFI)**
- `rs_kernel_check_anon_consts`: resolve displayed names via the env's
  `named` map (the zkVM hosts' resolution), then anon-check the seeds'
  FULL dependency closures via the existing `closure_addrs` +
  `build_anon_work` primitives — no sub-env serialization. `skip_deps`
  restricts to subject-only. `index_anon_work` extracted so the whole-env
  and filtered paths share the result-slot indexing.
- `ix check-rs --anon --consts <names>` now runs that (full-closure default,
  `--skip-deps` opt-out — the hosts' CLI shape); meta-mode seeded checks are
  unchanged (subject-only, bisect flows intact). run.sh's ooc per-constant
  view uses it, so ooc-vs-zkVM deltas genuinely isolate in-circuit overhead
  (both sides now full-closure anon; previously ooc was subject-only meta).

**OOM sentinel end-to-end**
- run.sh merges `oom: true` INTO already-measured Phase-1 metrics (was:
  overwrite); compare renders OOM only for the missing metrics.
- `bench.py bmf` (one neutral→BMF converter replacing four hand-copied
  jq/awk pipelines in bench-main.yml) strips the boolean sentinel — one
  OOM row no longer fails the whole bencher upload. This also covers the
  zkVM execute path, which gained its own RAM watchdog + sentinel.
- Watchdog: setsid + process-group kill (reaches all descendants); kernel
  OOM beating the 3 s sampler (exit 137, no marker) labels OOM instead of
  silently dropping; per-constant wall-clock timeouts (AIUR_PROVE_TIMEOUT
  50m / AIUR_EXECUTE_TIMEOUT 25m); $out re-merged per constant so a
  job-level kill keeps completed rows.

**compare correctness**
- throughput is higher-is-better: flags/worst/ratio direction-aware
  (HIGHER_IS_BETTER + _badness).
- Ratio wording follows metric kind: times faster/slower, bytes
  larger/smaller, counts more/fewer.
- `execute-peak-rss`: Phase-1 RSS high-water sampled at the Phase 1/2
  boundary in bench-typecheck (both modes) and adopted by the zkVM hosts'
  execute JSON — one name for "execute-phase peak" on every backend;
  bare `peak-rss` reserved for prove-phase peaks. aiur-execute compares
  it apples-to-apples against prove-run baselines (raw peak-rss would
  dwarf it).
- One-side-empty tables get a loud note (e.g. CLI-incompatible base).
- Tidy JSON: bench-typecheck emits decimal `JsonNumber`s (`jsonRound`)
  instead of Float's full binary repr.

**fetch-main robustness**
- Retry with backoff + newest-first pagination (base SHA beyond page one
  no longer forces a permanent fallback).
- Exit codes honored by bench-pr.yml: 3 (transient) → local fallback;
  2 (BACKEND_TABLE / bench-main drift) fails the cell loudly.
- ooc whole-env row keyed CamelCase on both sides (benv_cc table in
  run.sh, ENV_CC in bench.py) and admitted past the --names filter.

**build-once + caching (bench-pr.yml)**
- New build job: ix + bench-typecheck built once per head SHA (they embed
  the IxVM kernel/prover), cached as `bench-bins-<sha>`; cells restore
  into .bins/pr instead of per-cell Lean builds. Re-running \!benchmark on
  the same commit skips the build.
- Base fallback restores bench-main's own `bench-bins-`/`bench-ixe-`
  caches (toolchain cmp + mathlib-oleans guard) before paying for a
  from-scratch base build. PR-side .ixe cached across cells.
- Cache keys renamed `aiur-bench-bins-`/`aiur-ixe-` →
  `bench-bins-`/`bench-ixe-` (they serve every backend).
- run.sh resolves tools in-tree first, then PATH (`resolve_bin`).

**dedup**
- bench.py: DEFAULT_MODE + METRICS + MAIN_TESTBEDS collapsed into one
  BACKEND_TABLE (mode / testbed / per-mode metrics).
- bench-main's compile job routes through run.sh's compile backend + bmf —
  the `##benchmark##` line is parsed in exactly one place (run.sh gains
  `flt` in its env table for the FLT matrix cell).
- `Ix.Cli.ConstsFile`: one names-file/comma-list parser (inline-`#`
  comments, dedup) shared by check-rs meta+anon, bench-typecheck, and
  `ix compile --exclude-file` — closes the whole-line-vs-inline comment
  drift across the five previous copies.

**hosts**
- clap `requires = "consts"` dropped from --json/--skip-deps (rejected
  valid --consts-file-only runs); validated in main after collect_consts.
  zisk's multi-`--ixe` guard now covers --consts-file.

**misc**
- Vectors.csv / docs tier semantics corrected (manifest-only consumer);
  stale prove-fallback comments fixed; tracing-texray bumped to bd4faa08;
  push-to-main workflows drop their concurrency groups (every merged
  commit must be benchmarked; a later merge must never cancel one).

Verified: cargo check -p ix-ffi, lake build ix / bench-typecheck, live
runs of the new check-rs mode against a 480 MB Init/Std env (closure =
34 items / 42 targets for Nat.add_comm; --skip-deps = 1; missing names
error loudly), bmf/parse/compare dry-runs, YAML + bash -n + AST checks.
zisk/sp1 host crates compile-verified in CI only (local sandbox lacks
the pil2-proofman C++ toolchain).
…key)

bench-main.yml's zkvm-execute job now runs real executions of both hosts
on every main push, but tolerates per-constant failures by design
(dropped rows, OOM sentinels) — it never turns red on a breakage. This
workflow becomes the red-X signal instead, kept cheap:

- Build-only + unit tests: `cargo build --release --bin <host>` plus
  `cargo test --release --bin <host>`, which covers the clap surface
  run.sh drives (--consts comma-splitting, --consts-file union/dedup,
  shard-plan conflicts). No execution, so no fixture compile job, no
  minimal.ixe artifact, no memlock prlimit.
- install-zisk gains a `proving-key` input (default true): the key is
  loaded at runtime by `client.setup()`, never at build time, so the
  build gate skips the ~3 GB download + const-tree regeneration.
- Renamed "RISC-V bench" → "zkVM host build" (it no longer benchmarks).
- The TEMPORARY sb/ci-benchmarks push trigger (from the mid-branch test
  commit) reverts to main-only here; the tip commit is its sole carrier.

The runtime/typecheck smoke signal (executing myReflEq and asserting
failures == 0) is retired: that coverage now lives in zkvm-execute's
real runs, at the cost of being warnings-not-red-X.
zisk-host pulls zisk-sdk as both a dependency and a build-dependency, so
cargo compiles proofman-starks-lib-c as two units whose build scripts can
run concurrently — and both run `make` inside the SHARED
~/.cargo/git/checkouts/pil2-proofman-* source tree (not OUT_DIR). On a
cold runner the Makefile stamp is absent, so both units take the
`make clean` + `make -j` path and race: one unit's clean deletes build/
while the other's g++ is mid-compile, dying with "opening dependency
file ….d: No such file or directory" (g++ writes .d files at the END of
compilation, so the dir vanished underneath it). Warm runners never hit
it — the stamp short-circuits the clean — which is why the race only
surfaced on this branch's cold cache (lockfile change → new rust-cache
key).

install-zisk now builds the sys crate solo first: its build script runs
once, `make` completes, the stamp lands; both units of the subsequent
parallel build then skip the clean and their `make -j` is a no-op. Same
total work, just ordered. The proper fix is an flock around the make in
pil2-proofman's build.rs — worth an upstream PR.
`resolve_bin ix` ran unconditionally at startup, but bench-main's
zkvm-execute job legitimately has no `ix` anywhere: it restores only the
`.ixe` cache (REUSE_IXE=1 skips the compile step) and builds its zkVM
host via cargo — before the resolve_bin refactor that job never touched
`ix` at all, so the eager check regressed it with "ix not found
(in-tree or PATH)". Resolve per use site instead: the compile step and
the ooc branch resolve `ix`, the aiur branch resolves `bench-typecheck`,
and the zkVM branch needs neither.
…or names

Three fixes for the failing bench-main runs:

- run.sh: a heavy prove can die of memory without tripping the watchdog or
  the kernel OOM killer — one huge trace allocation fails and the runtime
  aborts (SIGABRT, exit 134) with an allocator message. looks_like_oom()
  folds that third case into the OOM classification so those constants get
  an OOM row instead of being dropped. Non-OOM failures now log the first
  lines of the tool output so drops are diagnosable from the job log.

- run.sh: after a watchdog group-kill, the Zisk host never runs its
  Drop-time cleanup of /dev/shm/ZISK_* segments (multi-GB; the MT output
  segment alone starts at 6 GB), so the next host launch fails creating its
  own segments before Zisk's startup stale-pid sweep runs — exactly one
  dropped constant after every OOM kill (the alternating OOM/exit-1 pattern
  in the zisk-execute logs). Sweep the debris between constants.

- Vectors.csv: IxVMPrim.nat_pow_big removed (kernel primitive, never
  present in compiled envs); Vector.extract_append._proof_1 →
  ._proof_2 and Array.extract_append._proof_1_1 → Array.extract_append
  (proof-term names are toolchain-dependent; both replacements verified
  against a freshly compiled initStd.ixe and measured heavy:
  fft-cost 23.4e9 / 40.1e9).
…nchmarks

riscv-bench.yml's two jobs (build + unit-test the Zisk/SP1 hosts, no
execution, no proving key) are cheap enough (~5 min with warm caches) to
gate every PR commit, and ci.yml already has exactly the triggers and
per-ref cancellation they need — so they move there and the standalone
workflow goes away.

sp1 benchmarks are commented out for now — the execute run is too slow:
the sp1 cell of bench-main's zkvm-execute matrix and bench-pr's Install
SP1 step. Uncomment both to re-enable. The \!benchmark backend-subset
selection is unchanged.
The op (added with the native execution pipeline, #463) pushes two values
onto the value map during execution — the quotient/remainder list-head
pointers — and the constraint generator allocates two auxiliary columns
for them, but the trace populator treated it as a no-op. Every ValIdx and
witness column after the first big-Nat division in a block was therefore
off by two, and trace population panicked with "index out of bounds" at
trace.rs:308 — the exit-134 failures on every big-Nat-heavy prove in the
bench-main aiur job (List.mergeSort, String.split, the SInt
instRxcHasSize_eq family, Multiset.sort, …). Reproducible in seconds
with any Nat-division user, e.g. `bench-typecheck --consts Nat.repr`.

The trace arm now mirrors the execute arm: recompute (q, r) with
num_bigint and resolve the head pointers execution already recorded in
memory[10] via a read-only twin of build_klimbs_u64, pushing both as map
entries + auxiliary columns.

Verified: Nat.repr proves end-to-end (panicked before); `lake test --
--ignored aiur ixvm` passes; clippy clean.
Prover changes can trade prove speed against proof size or verification
cost, so Phase 2 now serializes each fresh proof (proof-size, bytes) and
verifies it (verify-time; a verification failure is reported loudly and
drops only that measure). The full-closure path rebuilds the Array-G
claim from proveAddrWithEnv's serialized claim bytes — verify_claim's
input is the 32-G blake3 digest of those bytes, the same recipe as
`ix verify`.

Both measures flow through the neutral JSON into the PR compare table
(bench.py aiur prove metrics + byte/second unit mapping) and onto
bencher (verify-time 10% upper, proof-size 5% upper).

Measured: Nat.add_comm proves in 1.63s, verifies in 0.19s, 33.4 MB proof.
The constants that OOM as single full-closure leaves (extract_append
proofs, the instRxcHasSize_eq family, …) are only measurable under env
sharding — their checks fit in one manifest shard each, with deps checked
in other shards (docs/zisk-cycle-cost-model.md, finding 4). Wire the
existing offline partitioner into CI instead of any per-constant
sharding:

- compile job: after the compile benchmark, `ix profile <Env>.ixe` →
  `ix shard --max-ram 120` for InitStd/Mathlib; the `.ixes` manifest is
  cached next to the `.ixe` (every restore of that key lists the same
  paths — actions/cache versions entries by path list).
- zisk-host: `--shard-plan --execute --json` now writes one env-level
  row — total cycles, shards, max-shard-cycles, execute-time,
  throughput, execute-peak-rss, and a per-shard `shard-cycles`
  breakdown — keyed by the new `--json-name` (default: manifest stem).
- run.sh: the zkVM guest-run dance (setsid + RAM watchdog + OOM
  classification + shm sweep + merge) is factored into one `zkvm_run`
  helper; when a `.ixes` sits next to the `.ixe` (bench-main only — the
  \!benchmark PR path has none) the zisk branch appends the env-sharded
  run after the per-constant loop.
- bench.py bmf: nested-object flattening generalized — `phases` →
  `phase:<span>` as before, any other dict (e.g. `shard-cycles`) →
  `<key>:<sub>` — so per-shard cycles land on bencher as un-thresholded
  measures while the thresholded aggregates (cycles, shards,
  max-shard-cycles) do the alerting.

zkvm job timeout 60 → 150 min: the whole-env run (own 60m cap inside
run.sh) rides on top of the per-constant loop.
fetch-main previously treated main.json as all-or-nothing: any name
bencher lacked at the base SHA (typically constants the PR itself adds
to Vectors.csv) silently rendered n/a. It now writes the uncovered
subset to --missing-out and still exits 0; bench-pr runs the base
checkout on JUST those names and merges under bencher's rows (bencher
wins overlaps — it is the canonical main side), so a brand-new constant
gets a real main-vs-PR delta on its first \!benchmark. The full-set base
fallback (exit 3, SHA not ingested) is unchanged.
A PR can change the kernel's cost profile, so the env-sharded zisk run
must not inherit main's partition. run.sh now cuts the manifest in-place
(ix profile → ix shard) whenever one isn't already sitting next to the
.ixe — the profiler counts heartbeats rather than wall time, so an
unchanged tree re-partitions deterministically and per-shard comparisons
stay meaningful. bench-main's compile-job manifest keeps pre-empting the
generation on the main side; the \!benchmark PR side profiles its own
tree (manifest cached per head SHA under bench-pr-ixes-*). The base
fallback reuses bench-main's cached manifest and pays the env-sharded
run only on a full bencher miss — a partial miss means bencher already
holds main's env row (ZISK_ENV_SHARD=0 skips it).

bench-pr benchmark timeout 120 → 180 min for the zisk cell's worst case
(per-constant loop + profile + the env execute's own 60m cap).
The shard breakdown carried cycles only. Each shard now also records its
guest execute time (SDK-reported) and its RAM high-water — the texray
sampler's peak is reset before every shard, so the windows are
independent and the env row's execute-peak-rss becomes their max (the
run's execution-phase high-water; setup RAM no longer counted). Uploaded
as shard-time:<k> / shard-peak-rss:<k> alongside shard-cycles:<k>, all
un-thresholded like phase:<span>.

No prove-side equivalent yet: zisk proving isn't wired up in CI at all
(needs a GPU runner), so there is no prove-time for any zisk row.
…mand

Two ways to get a closure-only `.ixe` for one constant:

- `ix compile <file.lean> --consts <n1,n2,…>` (+ --consts-file): seed the
  compile by exact constant name instead of the whole import env, with
  transitive deps via the existing collectDeps — same names vocabulary as
  `ix check --consts`. Mutually exclusive with --module.

- `ix shard extract <env.ixe> --consts <n1,n2,…> --out <sub.ixe>`: the
  sharding pipeline's scoping step — cut the closure out of an EXISTING
  env without recompiling from source. The output carries the closure's
  genuine constant bytes, blobs, and reducibility hints (build_sub_env,
  now shared via sub_env_of), plus a name→address entry per closure
  constant so `--consts`-style tools still resolve. Metadata is dropped
  (real ConstantMeta references name addresses throughout its tree and
  would need the full hash-consed name index) — extracted envs serve the
  anon pipeline; meta-mode tools need the source env. A mutual-block
  member extracts its whole block.

Verified against a fresh initStd.ixe: both forms produce envs whose
bench-typecheck fft-cost is bit-identical to the full-env run
(content addressing at work), and extract → ix profile → ix shard
composes cleanly (Int8.instRxcHasSize_eq: 2.1 MB closure env,
16.08e9 steps → 11 shards at the 120 GiB cap).
Replaces the whole-env sharded run (wrong scope: all of InitStd is
1.94e12 steps / 1263 shards, and its biggest atomic block is INFEASIBLE
under the cap) with per-constant closure sharding through the canonical
pipeline. For each heavy-tier primary, run.sh runs `ix shard extract`
(closure-only env, no recompile) → `ix profile` → `ix shard --max-ram
120`, then one `zisk-host --shard-plan --execute` run executes the
shards sequentially — the constant's row carries totals plus the
per-shard shard-{cycles,time,peak-rss}:<k> breakdown, uploaded to
bencher like every other measure. Cheap primaries keep the single-leaf
--consts run.

- run.sh: cut_closure_shards (also exposed as the `cutshards` backend so
  bench-main's compile job pre-cuts through the same code path) +
  heavy-tier dispatch in the zisk loop (ZISK_HEAVY_NAMES). A constant
  whose partition still can't fit (atomic mutual block over the cap)
  OOMs under the RAM watchdog: honest OOM row, remaining shards skipped,
  loop proceeds to the next constant. Cutting failures fall back to the
  single-leaf run.
- bench.py manifest --heavy-out: the selected heavy-tier names, from
  Vectors.csv's tier column.
- bench-main: the compile job pre-cuts zkshards-<Bench>/ next to the
  fresh .ixe (it has ix + the toolchain; the zkvm job stays Lean-free)
  and ships it to the zkvm job in the sha-keyed bench-ixe-* cache entry.
- bench-pr: the PR side cuts its own artifacts fresh every run (cheap —
  seconds per closure — and a PR can change the cost profile; profiling
  counts heartbeats, so an unchanged tree re-partitions
  deterministically); the base fallback reuses bench-main's pre-cut dir.
… job

A constant the kernel rejects is a correctness regression, not a
benchmark datum. End to end:

- zisk-host: every sharded loop (shard-plan execute + prove, whole-env
  execute + leaf prove) bails on the FIRST failing shard via the shared
  reject_failures helper — mirroring the OOM kill, which also cancels
  the constant's remaining shards — instead of accumulating failures
  across the full manifest. sp1 and the single-leaf paths already bailed
  immediately.
- run.sh: a zkVM failure whose log carries the host's "kernel typecheck
  produced" abort records the neutral `{"failed": true}` sentinel (an
  ::error:: annotation, not a silent drop); ooc does the same when
  ##check## reports failures. bench-typecheck marks Phase-1 check errors
  failed and skips them in Phase 2. Failure warnings now print the log's
  head AND tail (the host's abort lands at the end of a mid-manifest
  log).
- bench.py compare: `failed` renders ❌ cells (outranking OOM) plus a
  bold "FAILED TO TYPECHECK on the <side> side" note under the table;
  bmf strips the sentinel so rejected rows never reach bencher.
- workflows: prove/zkvm/ooc jobs (bench-main) and the benchmark cell
  (bench-pr) exit nonzero when the neutral JSON carries a failed row —
  AFTER the clean rows upload / the table posts, so the red X lands on
  the commit/PR without losing the report.
Drop this commit before merging. The zkVM build gate needs no branch
trigger — it lives in ci.yml now and runs on pull_request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant