Multithreading for KeY by unp1 · Pull Request #3842 · KeYProject/key

unp1 · 2026-06-18T21:19:29Z

Intended Change

This pull request adds a multi-core (parallel) prover to KeY: automatic proof
search runs on a worker pool. An open goal is either not yet assigned to a worker or
it has exactly one worker, proving independent
sibling goals concurrently. There are at most as many workers as configured and the number of workers is capped at the number of processor cores. It is a port and modernisation of the 2018
HacKeYthon multithreading prototype, rebuilt on current main.

The legacy single-threaded prover stays the default and the safe fallback;
the multi-core prover is opt-in per the configuration below.

Default in this PR: to make the feature easy to try, this branch ships with
the multi-core prover enabled at 4 threads. When this PR is accepted for
main, the default will be reverted to single-core so that the
single-core-only features (proof caching, slicing, merge rule) are active as
before. This is a one-line change to the setting default.

Architecture (conceptual)

The guiding principle is isolation, not locking. Goal.apply is split into a
concurrent compute phase (rule selection + matching, the expensive part) and a
commit phase (proof-tree mutation) serialized under a single lock. A
single-monitor GoalScheduler hands goals to workers depth-first. Non-essential
listeners are detached for the run, fresh names come from a per-worker partition
of the name allocator, shared matching/cost caches use thread-safe cache
utilities, and rules not yet concurrency-safe (the merge rule) disable themselves
during a multi-worker run. Only the standard Java profile runs in parallel.

Full conceptual write-up: Multi-Core Proving (key-docs).

Problems encountered, fixed and verified

Lost goals / non-closure: sibling goals shared a mutable taclet-index
cache key and raced on it → a dropped goal and a nondeterministic non-closure.
Fixed with an immutable per-lookup key. Reproduced in ~50% of 8-worker runs
before, 0 after; guarded by a CI stress test.
Early termination: completing a goal and offering its successors is now a
single atomic step under the scheduler monitor.
Shared-state hazards: a systematic audit hardened the remaining shared
proving state; each is a separate, behaviour-preserving commit.
Off-EDT view updates: per-goal listeners (attached to each Goal, outside
the proof-level suspension) fired on worker threads and touched Swing off the
event-dispatch thread. The suspension now covers them too, and the views refresh
once from the final proof state after a run.

Single-core-only features

Proof caching, proof slicing and the merge rule are switched off while the
multi-core prover is active (GUI greys them out with a tooltip; the status line
shows the mode) and restored when switching back. Loading a proof script and
running macros use the multi-core prover when active and the profile supports it,
with one deliberate exception: TryClose (and the Full Auto pilot that ends in
it) closes goals one at a time under a tight per-goal step budget and is pinned to
the single-core prover — a single goal offers no parallelism, and several workers
exploring its subtree apply rules less step-efficiently and can exhaust the budget
before the goal closes. Either way a run returns only once all workers stop, so
pruning sees a quiescent proof. These restrictions will be lifted incrementally
with the feature owners once the
subset is confirmed safe. Because the default is single-core, these features
work exactly as before.

Configuration

GUI: Settings → Prover (Single / Multi-Core) + a clickable SC / MT N×
status-line indicator.
CLI: --threads N.
Tests: pinned single-core; -Dkey.prover.parallel[.threads] overrides.

Benchmarks

Measured on a 16-core machine, one isolated run per proof (each proof loaded in a
clean settings state). seq (s) is the single-threaded wall-clock time on current
main (which already includes the single-thread performance work); the 2× / 4× / 8×
columns give the speedup at that worker count over that baseline. The multi-core prover
does not change the single-threaded path.

Real-world proofs — automode (speedup by worker count).

Proof	seq (s)	2×	4×	8×
SimplifiedLinkedList.remove	17.0	1.59×	2.15×	2.08×
gemplusDecimal/add	8.9	1.73×	2.70×	3.55×
Saddleback_search	11.6	1.64×	1.82×	1.16×
symmArray	14.9	1.72×	2.44×	2.74×
ArrayList.remove.1	2.4	1.52×	1.85×	1.89×
ArrayList_concatenate	8.8	2.37×	3.25×	3.67×
arith/median	3.0	1.79×	2.26×	3.25×
ArrayList.remFirst	0.6	1.14×	1.34×	1.40×

Wide, splitting proofs scale furthest — ArrayList_concatenate reaches 3.67× at 8
workers, add 3.55×, median 3.25×. Moderately wide proofs plateau around 4 workers
(~2–2.7×). Narrow proofs gain little and can regress at 8× (see the shape table).
(quicksort/sort is omitted — it needs its proof script to close, so it is not a
pure-automode entry.)

Proof	Shape	Why this speedup
ArrayList_concatenate, add, median	wide / splitting	many independent sibling goals → scale furthest, ~3.3–3.7× and still improving at 8 workers
symmArray, SLL.remove, ArrayList.remove.1	moderately wide	steady scaling, plateauing ~1.9–2.7× by 4–8 workers
Saddleback_search	narrow (~3-goal frontier)	limited parallel width; ~1.8× at 4×, regresses to 1.16× at 8× — extra workers do speculative work off the critical path with no width to amortise it
ArrayList.remFirst	tiny (~2k nodes)	too small to parallelise; ~1.1–1.4×

Diagrams. Rendered speedup bar charts are in the dev doc's Measured speedup section.

Plan

Parallel engine, scheduler, isolation mechanisms
GUI / CLI integration, feature gating
Proof-macro support on the multi-core prover
Thread-safety audit of shared proving state
Equivalence gate + 8-worker stress tests (CI)
spotless / NULLNESS / unit tests green (symb_exec & proof_ref excluded, known-broken)
Fill in benchmark tables (diagrams generated, to attach)
Switch the default back to single-core before merge

Type of pull request

New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue) — see the latent-bug commits below
There are changes to the (Java) code
There are changes to the deployment/CI infrastructure (gradle, github, ...)

Ensuring quality

Introduced/changed code is documented (javadoc + inline comments).
End-user features are documented (key-docs: Multi-Core Proving).
Added equivalence, stress and unit tests for the new functionality.
Tested: the equivalence gate (single vs parallel outcome) and 8-worker stress
tests pass; full key.core + touched-module unit tests pass.
Runtime performance: the single-core path is unchanged (the parallel prover is
opt-in); see the benchmark section for multi-core speedups.

Separable, main-relevant fixes (kept as standalone commits for cherry-picking,
useful even single-core):

Compute NodeInfo's active statement lazily, off the proving path
Cache loop-invariant instantiation on the rule app, not a static field (latent
cross-proof leak via a static field)
Make per-call program transformers stateless (MethodCall, JmlAssert)
Robustly read numeric settings stored as Integer or Long — fixes a startup
ClassCastException (present on main) that can prevent KeY from launching when
loading certain settings files (Cherry-Picked on Fixes for minor regressions on main #3844)
Don't crash reloading a recent file with no stored profile — a recent-file
entry without a profile serialized null as the string "null", so reloading (Cherry-Picked on Fixes for minor regressions on main #3844)
it failed to resolve a profile and showed an error instead of opening the file
Show a proof macro's aggregate statistics in the status line — the status line
kept a macro's last internal-pass count instead of its aggregate result (Cherry-Picked on Fixes for minor regressions on main #3844)

Additional information and contact(s)

This PR provides a clear opt-in benefit (parallel speedup on splitting proofs)
with a safe fallback to the unchanged single-core prover.

This PR has been done with AI tooling support.

The contributions within this pull request are licensed under GPLv2 (only) for inclusion in KeY.

wadoon · 2026-06-19T08:58:12Z

one thread per open goal

So you are creating 1000 threads if you have 1000 open goals?
Or do you mean, you create 1000 callable instances, ...

The idea of a worker pool is to have a fix/max amount of threads, and to dispatch work on them.

Are you using/Consider to use the new virtual threads. Somebody would also call them greenlets, or stackless...

unp1 · 2026-06-19T11:06:29Z

one thread per open goal

So you are creating 1000 threads if you have 1000 open goals?
Or do you mean, you create 1000 callable instances, ...

no that is just the starting idea. Depending on the amounts of threads you configure to use, you have that nr of threads and it is capped at the number of processor cores of your computer. Then there is a worker pool and a lifo queue from which a free worker is assigned the next goal. I'll clarify the description thanks.

The idea of a worker pool is to have a fix/max amount of threads, and to dispatch work on them.

Are you using/Consider to use the new virtual threads. Somebody would also call them greenlets, or stackless...

not yet. The first milestone was to get to a point where we have for some problems a win and KeY is (at least as far as one can test) race free. The problem is, I did not have any race conditions even under high pressure. But the problem is that other OSes with other schedulers might discover some still existing problems, so if someone has time and can just run it once in a while to get feedback, would be appreciated.

For trying out, I recommend the preview integration with the single core improvemnts. It solves some congestions here already.

wadoon · 2026-06-19T13:56:35Z

Could you provide the command to be executed? I could try to get a 64-core machine if we would have proofs having such number of goals.

The active statement (and its position) were eagerly computed when a node's rule application was set, on the proving path. Compute them lazily on first access instead; nodes whose active statement is never inspected do no work.

The loop-invariant rule cached the last instantiation in a static field shared across proofs, which could leak an instantiation between unrelated proofs. Store it on the rule application instead.

Give the MethodCall metaconstruct a fresh per-application copy instead of mutating shared instance state, make the class final to lock in that argument, and make JmlAssert's condition final.

ConcurrentLruCache (exact-LRU, single-lock, drop-in for synchronizedMap over an LRUCache) for eviction-sensitive caches, StripedLruCache for pure caches, and WeakValueInterner for identity-preserving interning under concurrency, with concurrency tests.

Add the parallelProverEnabled / parallelProverThreadCount settings and a Profile.supportsParallelAutomode() capability: the standard Java profile opts in, the well-definedness, information-flow and symbolic-execution profiles keep the single-core fallback.

Add EssentialProofListener and Proof.suspendNonEssentialListeners(), which detaches every proof-tree and rule-app listener not tagged essential for the duration of a run and re-attaches them afterwards, so nothing unrelated to proving fires on a worker thread.

ParallelProver runs automatic proof search on a worker pool, one open goal per worker, behind the AutoProvers selection seam. A single GoalScheduler monitor hands out goals depth-first; Goal.apply is split into a concurrent compute phase and a commit phase serialized on one lock; fresh names come from a partitioned per-proof allocator (atomic Counter), namespace flushes are deferred, and the merge rule is disabled during parallel runs. The taclet-index cache key is immutable so siblings sharing the cache set cannot race on it. Includes the scheduler unit test and the proof-equivalence gate.

Harden the shared state the workers touch during search: route the matching and cost-path caches through the thread-safe cache utilities, intern parametric operators and elementary updates identity-preservingly, give the per-call instantiation caches and counters safe publication or per-worker copies, and guard the specification repository's proving-time maps. Behaviour-preserving.

Add a settings pane and a clickable status-line indicator to switch between the single-core and multi-core prover, a --threads CLI option, and gating of the single-core-only features: proof caching, slicing and the merge rule are disabled (with explanatory tooltips, updated live) while the multi-core prover is active, and restored when switching back.

Route StrategyProofMacro and TryCloseMacro through the AutoProvers seam. The try-close prune stays safe because start() only returns once every worker has stopped, so no worker is live during the tree mutation; a stress test exercises the prune-and-close path at eight workers.

Add the real-proof speedup benchmark, the synthetic best/worst-case benchmark, a JFR profiling probe, and a high-worker non-closure stress test wired into the CI integration-test matrix. Pin the test suite to the single-core prover by default; opt-in parallel tests override it at runtime.

…rover The spec maps mutated by meta-construct transforms during the parallel prover's lock-free computeRuleApp phase (statementMap, mergeContracts, blockContracts, loopContracts, loopContractsOnLoops, loopInvs) were plain Linked/IdentityHashMaps. Concurrent puts from IntroAtPreDefsOp (introduced via main's JmlAssert/SetStatement statement-spec mechanism, which postdates the original MT shared-state audit) corrupt the table -- a put-triggered resize racing a get returns null, crashing IntroAtPreDefsOp.handleJmlStatement with an NPE. Wrap them in synchronizedMap, which keeps insertion order (so proof digests stay deterministic) while making get/put/remove atomic.

…spatch After a multi-core run the proof-tree listener is re-attached and the view must reconcile every change at once; on a large proof that rebuild ran inline in the synchronous autoModeStopped dispatch, blocking the cheaper views (goal list, sequent, status) until it finished -- looking like a slow stop / GUI freeze. Add GuiUtilities.deferAfterAutoMode (a thin EDT re-schedule) and use it in ProofTreeView.autoModeStopped: capture the per-run state, then run the heavy tree reconcile as its own EDT task. The cheap views finish the synchronous dispatch and repaint first; only the tree update is heavy, and isolated. The deferred task is guarded against a new run having started in the meantime.

… a run The deferred post-run tree reconcile can still hold the EDT for a while on very large proofs. Wrap the tree's scroll pane in a JLayer and paint a busy banner over it before the reconcile -- forced synchronously with paintImmediately, since the reconcile then blocks the EDT -- so the user sees the tree is updating rather than frozen, and clear it afterwards. Gated above BUSY_OVERLAY_NODE_THRESHOLD nodes so small/cheap updates do not flicker. The tree/scroll-pane hierarchy below the layer is unchanged.

…up after a run" This reverts commit bb05dcf.

The deferred autoModeStopped catch-up can run after the user has switched away from the proof tree tab. Touching a non-displayed large-model JTree fires treeStructureChanged, which NPEs in FixedHeightLayoutCache (no realized root node) -- this is the reported crash when switching to the goal list while the tree was updating. When the panel is passivated (panelActive == false), just set dirtyWhileHidden and return, so activatePanel rebuilds the tree (and re-attentivates the model) when the tab is shown again -- the established hidden-tab catch-up path -- instead of updating the hidden tree now.

…adoc)

…redo) Show an opaque overlay with an 'Updating proof tree...' banner over the tree while the active panel reconciles a large proof after a run, so the user sees it is updating rather than frozen. It is stacked over the scroll pane via OverlayLayout (no JLayer), shown only in the panelActive path (never on a hidden/passivated tree) and only above BUSY_OVERLAY_NODE_THRESHOLD nodes. Being opaque, paintImmediately paints only the overlay -- the covered tree is never forced to lay out -- so it cannot reintroduce the FixedHeightLayoutCache NPE.

… (fix NPE) Firing a bare root treeStructureChanged (updateTree(null)) on the attached large-model JTree NPEs in FixedHeightLayoutCache when its cached root no longer matches the run-stale model -- which the deferred catch-up reliably hits even with the panel active (the proof-tree listener is suspended during the run, so the model no longer matches the proof). Rebuild robustly instead: detach the view, clear+refresh the model, then re-attach so BasicTreeUI rebuilds its layout cache with a fresh root. Since the model is always stale after a suspended run, this replaces the incremental modifiedSubtrees path (now dead -- removed) with one clean rebuild.

…ll rebuild GUIProofTreeModel.updateTree's full-rebuild cleared branchNodes, so getRoot() minted a new root GUIBranchNode; firing a structure-change that carried a root the view's FixedHeightLayoutCache did not recognise NPEs it (changedNode == null), which the deferred catch-up reliably hit. Re-insert the same root and flush its children instead -- a clean rebuild with no identity mismatch. This lets ProofTreeView's deferred catch-up drop the setModel(null)/setModel re-attach workaround -- which corrupted the shared selection (tree and goal list showing every row selected) and slowed pane switching -- and go back to plain updateTree(null).

Switching the deferred catch-up to an always-full rebuild (updateTree(null)) collapsed and re-rendered the whole tree on every stop, resetting the selection and re-applying the per-node active-statement highlight to every row (the 'everything selected' look, which also disturbed the goal list via re-selection). Restore the original incremental path: refresh only the subtrees the prover worked on (getProofTreeNode creates lazily, so this does a subtree update, not the root rebuild, and never NPEs on a shown tree), preserving selection/expansion elsewhere. The full updateTree(null) stays only for the dirty-while-hidden case, where the root-identity fix keeps it NPE-free.

The opaque overlay (OverlayLayout wrapper + BusyOverlay) covered the entire tree area, which showed up as a light-blue tint over the whole panel (including the empty space below the tree) after pane switches -- the reported regression that only appeared with the deferral. Remove it entirely and go back to the plain scroll pane. The deferred catch-up, incremental subtree refresh, passivated-tab guard and the root-identity NPE fix all stay, so the GUI is still responsive on stop without the freeze, NPE, or selection/tint issues. A non-intrusive 'updating' indicator can be added later outside the tree area if wanted.

…annot fail it ProofStarter.init copies the proof's own modest maxSteps, so a parallel run that diverges in goal order -- exploring more goals before closing than the single- threaded proof needs -- could exhaust that budget and leave the proof open. SimplifiedLinkedList.remove @ 8 workers did this intermittently on CI. That is benign step exhaustion, not a lost goal: a real race saturates open (no rule left) below any cap, so raising the cap cannot mask it -- it only removes the step- exhaustion false positive, making the test a cleaner race detector. Override the cap to 200_000 after init(). Verified with 3 local reruns: 9/9 SimplifiedLinkedList.remove and 24/24 symmArray parallel reps at 8 workers closed.

The deferral (and its overlay / root-identity / panelActive follow-ups) moved the post-run proof-tree catch-up off the stop dispatch, but it never made the catch-up cheap -- it only relocated the freeze to the next EDT operation, where it collided with loading a new proof (the long spinner). It also caused the NPE, whole-area tint and 'everything selected' regressions. Restore ProofTreeView, GUIProofTreeModel and GuiUtilities to origin/main so the catch-up runs synchronously again (it finishes during the stop, before any load). Making the catch-up itself cheap on very large proofs is the real fix and will be a separate, measured change. Non-GUI fixes (SpecificationRepository thread-safety, MtStressTest step cap) are unaffected.

A thread dump on a ~300k-node proof showed the EDT RUNNABLE with 70+s of CPU in ProofRenderer.renderBranch: for every OPEN branch node, on every repaint, it ran a breadth-first search over the branch's whole subtree and called the linear-scan Proof.getOpenGoal on each visited node -- O(subtree x openGoals) per branch, per paint -- purely to decide the rare linked-folder icon. On a large proof with many open goals that is billions of operations per paint, which is the GUI sluggishness. Replace it with a scan over the proof's open goals (far fewer than a big subtree), short-circuited by the cheap isLinked() pre-check, walking ancestors only for the linked ones. Equivalent result, no caching/staleness. Also skip the getOpenGoal scan for closed leaves in renderLeaf (a closed leaf has no open goal). This is stock-KeY behaviour, exposed now that MT makes such large proofs reachable.

RuleAppContainer.compareTo compared only cost, so equal-cost candidates were left in their generation/insertion order. Under concurrent goal processing that order is timing-dependent, which made parallel proofs non-deterministic in size and not reproducible run to run. Add a deterministic, content-based tie-break for equal-cost apps: rule name, then structural position (antec/succ, position-in-term path, focus formula by operator names), then taclet instantiations (schema-variable names + instantiated terms by operator names). Only stable keys are used -- never object hashCodes or toString(), which can embed identity (e.g. term-label hashes). Cost still dominates; only ties change. This cuts parallel proof-size variance several-fold (SimplifiedLinkedList.remove at 2/4/8 workers: ~18-28k/6-16k/10-15k node spread down to ~4.5k/5k/2.5k) and stays deterministic single-threaded. The taclet-lemma corpus (203 proofs) and the 8-worker stress test still close.

The comments in ServiceCaches and AbstractMonomialSmallerThanFeature claimed the introductionTimeCache is eviction/history-sensitive (its value 'reflects the goal history at first-cache time'). It is not: introductionTime is the introducer's depth from the root, identical for every goal below that introducer, so the value is goal- and order-independent (verified -- bypassing the cache leaves proofs unchanged). Reword both comments to state this.

The workers==0 baseline cleared key.prover.parallel, but an unset property falls back to the persisted general setting (ParallelProver.isEnabled), which can be parallel-on -- so the 'single' baseline silently ran the parallel prover and every reported speedup was ~parallel/parallel (meaningless). Set the property explicitly to false for a true single-threaded baseline.

unp1 mentioned this pull request Jun 18, 2026

Multithreading for KeY: combined preview with the performance series #3843

Closed

unp1 force-pushed the bubel/mt-goals branch from 1518256 to cd9f12e Compare June 18, 2026 21:32

unp1 mentioned this pull request Jun 18, 2026

Fixes for minor regressions on main #3844

Merged

unp1 changed the title ~~Multithreading: a multi-core prover for automatic proof search~~ Multithreading for KeY Jun 18, 2026

unp1 self-assigned this Jun 18, 2026

unp1 added the 🚀 Performance label Jun 18, 2026

unp1 force-pushed the bubel/mt-goals branch from cd9f12e to 5fc660b Compare June 18, 2026 22:29

unp1 marked this pull request as ready for review June 18, 2026 23:15

unp1 force-pushed the bubel/mt-goals branch from 5fc660b to 7cf6734 Compare June 19, 2026 04:26

unp1 added 14 commits June 25, 2026 07:35

Cache loop-invariant instantiation on the rule app, not a static field

ae224c8

The loop-invariant rule cached the last instantiation in a static field shared across proofs, which could leak an instantiation between unrelated proofs. Store it on the rule application instead.

Make per-call program transformers stateless

02c27e1

Give the MethodCall metaconstruct a fresh per-application copy instead of mutating shared instance state, make the class final to lock in that argument, and make JmlAssert's condition final.

Pool the matching cursor per thread instead of behind a global lock

5f20095

Stop multi-core workers before restoring proof listeners on cancel

4214f2e

unp1 force-pushed the bubel/mt-goals branch from 890f466 to b7bd241 Compare June 25, 2026 06:09

unp1 added 3 commits June 25, 2026 10:56

Revert "Show an 'updating proof tree' overlay while the tree catches …

c3ed8d3

…up after a run" This reverts commit bb05dcf.

unp1 added 13 commits June 25, 2026 12:01

Apply spotless formatting (line wrapping in the deferred-catch-up jav…

b1d9845

…adoc)

unp1 mentioned this pull request Jun 26, 2026

Make proof-tree painting cheap on large proofs #3863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multithreading for KeY#3842

Multithreading for KeY#3842
unp1 wants to merge 30 commits into
mainfrom
bubel/mt-goals

unp1 commented Jun 18, 2026 •

edited

Loading

Uh oh!

wadoon commented Jun 19, 2026

Uh oh!

unp1 commented Jun 19, 2026

Uh oh!

wadoon commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

unp1 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intended Change

Architecture (conceptual)

Problems encountered, fixed and verified

Single-core-only features

Configuration

Benchmarks

Plan

Type of pull request

Ensuring quality

Additional information and contact(s)

Uh oh!

wadoon commented Jun 19, 2026

Uh oh!

unp1 commented Jun 19, 2026

Uh oh!

wadoon commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

unp1 commented Jun 18, 2026 •

edited

Loading