PERF: Speed up permutation cluster tests via Numba union-find + compact graph#13731
Open
sharifhsn wants to merge 2 commits intomne-tools:mainfrom
Open
PERF: Speed up permutation cluster tests via Numba union-find + compact graph#13731sharifhsn wants to merge 2 commits intomne-tools:mainfrom
sharifhsn wants to merge 2 commits intomne-tools:mainfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference issue
Related: #5439, #7784, #8095, #12609
What does this implement/fix?
Speeds up
spatio_temporal_cluster_1samp_test(and the otherpermutation_cluster_*functions) by ~5-10x on realistic data. The PR is split into 7 incremental commits. Maintainers can accept or reject each layer independently.Commit 1 — Precompute sum-of-squares for sign-flip t-test (+29/−9 lines, 3.2x)
For the default
ttest_1samp_no_p, s²=1 meanssum(X²)is constant across permutations. Each permutation becomes a singlesigns @ Xdot product instead of callingstat_fun. Also skipsbuffer_sizeverification for built-in stat functions.Commit 2 — Numba union-find for spatio-temporal CCL (+226/−11 lines, 10.3x cumulative)
JIT-compiled union-find kernel (
_st_fused_ccl) with path compression and union-by-rank, replacing the Python BFS in_get_clusters_st. Bundles tightly-coupled pieces: pre-computed CSR adjacency arrays,_sums_onlyflag to skip cluster list construction (usesnp.bincountinstead), and_csr_dataparameter threading. These are bundled because_sums_onlyonly fires insideif has_numba:and CSR data is only consumed by the Numba kernel.Commit 3 — Extract _union helper + simplify (+36/−72 lines)
Extract duplicated find+union logic into
_union()withinline="always", simplify_sum_cluster_data, trim docstrings/comments to match codebase style.Commit 4 — Fix step-down reshape (+1/−1 lines)
Pre-existing bug:
adjacency is None and adjacency is not Falsewas equivalent to justadjacency is None, missing theadjacency is Falsecase wherestep_down_includestill needs reshaping.Commit 5 — Changelog entries
Commit 6 — Test fixture (+1 line)
Patch
has_numbainnumba_conditionalfixture so the "NumPy" test variant actually exercises the Python BFS fallback path for spatio-temporal clustering.Commit 7 — Docstring (+18/−1 lines)
Expand
_st_fused_ccldocstring with algorithm description, complexity analysis, and Wikipedia reference, per reviewer request.Commits 3-7 are cleanup, bugfix, docs, and tests — they don't affect performance. All optimizations fall back to the original code paths when Numba is not installed. No public API changes.
Benchmarks
Per-commit cumulative speedup (local, Apple M-series,
spatio_temporal_cluster_1samp_test, ico-5, 15 subjects x 15 timepoints x 20,484 vertices, threshold=3.0, 512 permutations, median of 3 runs):AWS HPC end-to-end (AMD EPYC 7R13, same data dimensions):
Per-permutation cost: 15.8 ms → 3.1 ms (5.2x). Projected 10,000 permutations: 31 s vs 159 s.
Reproduce benchmarks locally
Additional information
threshold=dict(...)) correctly falls back to the original code path