Skip to content

Parallelize spatial neighbors#1198

Open
grst wants to merge 3 commits into
scverse:mainfrom
grst:parallelize-spatial-neighbors
Open

Parallelize spatial neighbors#1198
grst wants to merge 3 commits into
scverse:mainfrom
grst:parallelize-spatial-neighbors

Conversation

@grst
Copy link
Copy Markdown
Contributor

@grst grst commented Jun 3, 2026

Description

This PR addresses two performance leverages in sptial_neighbors:

  • extract per-library sub-matrices upfront and once (most important)
  • enable per-library parallelism with joblib (relatively high per-worker overhead, only worthwhile with many samples)

The first step brings down the runtime by ~50% in my real-world dataset.
Parallelism doesn't add a ton, but still slightly improves the wall time. It really depends on the dataset how much it adds, therefore I left the default at n_jobs=1.

When working with spatialdata, there's an additional bottleneck in _prepare_spatial_neighbors_input that I'll follow up on separately.

How has this been tested?

  • unit tests pass
  • manually on a real-world dataset

Closes

no related issue

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.29%. Comparing base (da789d0) to head (d2305e3).

Files with missing lines Patch % Lines
src/squidpy/gr/neighbors.py 42.85% 3 Missing and 1 partial ⚠️
src/squidpy/gr/_build.py 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1198      +/-   ##
==========================================
- Coverage   75.33%   75.29%   -0.04%     
==========================================
  Files          56       56              
  Lines        7922     7934      +12     
  Branches     1292     1294       +2     
==========================================
+ Hits         5968     5974       +6     
- Misses       1444     1448       +4     
- Partials      510      512       +2     
Files with missing lines Coverage Δ
src/squidpy/_docs.py 94.59% <100.00%> (+0.07%) ⬆️
src/squidpy/gr/_build.py 80.69% <84.61%> (-0.34%) ⬇️
src/squidpy/gr/neighbors.py 90.82% <42.85%> (-1.62%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@grst grst marked this pull request as ready for review June 3, 2026 11:07
@grst
Copy link
Copy Markdown
Contributor Author

grst commented Jun 3, 2026

The failing pre-release tests are unreleated (they already fail on main).

@grst grst requested review from selmanozleyen and timtreis and removed request for timtreis June 3, 2026 11:14
@selmanozleyen
Copy link
Copy Markdown
Member

Thanks, extracting the spatial coords and then slicing was very much needed yes but I really doubt we need multiprocess parallelization here. We can use thread parallelisim because scikit-learn's nearest neighbor search is written in Cython/C++ and releases the GIL, therefore it will utilize the cores.

@grst
Copy link
Copy Markdown
Contributor Author

grst commented Jun 3, 2026

Usually the effectiveness of thread-based parallelism is limited, even in C++, because of Ahmdal's law and communication overhead. It really depends a lot on the concrete method, but it is often more effective to parallelize at the level of samples, or use a combination of process and thread-based parallelism.

@selmanozleyen
Copy link
Copy Markdown
Member

"Ahmdal's law and communication overhead" these are also true for process based parallelism, and there is even more overhead on communication in processes. If we parallelize with backend=threads how different would the results here look like?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants