perf: batch community cohesion to stop hangs on large repos#183
Open
realkotob wants to merge 1 commit intotirth8205:mainfrom
Open
perf: batch community cohesion to stop hangs on large repos#183realkotob wants to merge 1 commit intotirth8205:mainfrom
realkotob wants to merge 1 commit intotirth8205:mainfrom
Conversation
Community detection was effectively hanging on the Godot source (~9k files, ~100k edges, ~93k nodes) during the post-parse phase. The parser finished fine, but `_detect_file_based` never returned, with Python pinned at 100% CPU. The cause was a quadratic pattern in `_compute_cohesion`: for each community, it walked every edge in the graph once to count internal/external edges. That's O(edges) per community, called once per community, so the whole step was O(edges × communities). - `_detect_file_based` creates one community per file. On Godot that's ~9k × ~100k ≈ ~10⁹ pure-Python comparisons — the hang I hit. - `_detect_leiden` and `_detect_leiden_sub` have the same shape, just with fewer clusters, so they were merely slow rather than catastrophic. Fix: compute cohesion for all communities in a single O(edges) pass via a new `_compute_cohesion_batch` helper. It builds a `qualified_name → community_index` reverse map once (since all three detection functions produce partitions, each node belongs to at most one community), then walks every edge exactly once and buckets it into internal/external counters per community. Total work drops from O(edges × communities) to O(edges + Σ|members|). Both `_detect_leiden` and `_detect_leiden_sub` are refactored to collect all cluster member sets first and then call `_compute_cohesion_batch` once, the same pattern as the file-based path. `_compute_cohesion` is kept as a thin wrapper around the batch helper so existing callers and the direct unit tests (which pin exact values like 0.5 and 1.0) keep working unchanged. Test coverage: - Unit: 3 new tests verify `_compute_cohesion_batch` matches `_compute_cohesion` on a mixed-edge fixture, handles empty community lists, and handles edge-less input. - Integration (deterministic): `test_detect_file_based_integration` builds an asymmetric fixture (one community with cohesion 1.0, another with cohesion 0.6667) and pins exact member sets and cohesion values — catches zip misalignment or wrong `member_qns` passed to the batch helper. - End-to-end (algorithm-agnostic): `test_detected_cohesions_match_direct_computation` runs `detect_communities` on the seeded two-cluster fixture (with an extra edge added to break cohesion symmetry) and asserts that every stored community cohesion equals what `_compute_cohesion` produces when called directly on that community's member set. Works whether Leiden (igraph) or the file-based fallback is active. All three new tests were validated with a mutation test (`results.reverse()` injected into `_compute_cohesion_batch`) and caught the bug with clear, specific error messages before being reverted. Full suite: 624 passed, 1 skipped, 2 xpassed. Ruff and mypy clean.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The problem
Running
code-review-graph buildon a large source tree (~9k files, ~100k edges, ~93k nodes) on my Macbook M1 Pro effectively hangs during post-processing. The parse phase finishes fine, but community detection never returns and Python is pinned at 100% CPU:A sampled stack trace points inside
_detect_file_based.Why it happens
_compute_cohesionwalks the entire edge list to count internal/external edges for one community. It's called once per community, from three places:_detect_file_based— once per file community ← the killer_detect_leiden— once per Leiden cluster_detect_leiden_sub— once per sub-clusterThe resulting work is O(edges × communities). For the file-based fallback on Godot that's ~9k × ~100k ≈ ~10⁹ pure-Python comparisons — the hang I hit.
The Leiden path has the same shape but with far fewer clusters, so it was merely slow rather than catastrophic. It's still a real bug, just less dramatic.
The fix
Compute cohesion for all communities in a single O(edges) pass.
New helper
_compute_cohesion_batch(community_member_qns, all_edges)builds aqualified_name → community_indexreverse map once — each node appears in at most one community since all three detection functions produce partitions — then walks every edge exactly once and buckets it into internal/external counters per community.Total work drops from O(edges × communities) to O(edges + Σ|members|). For the big repo, community detection now completes in seconds instead of hanging.
All three detection functions (
_detect_file_based,_detect_leiden,_detect_leiden_sub) are refactored to:_compute_cohesion_batchonce.zipthe cohesions back onto the community metadata._compute_cohesionis kept as a thin single-community wrapper around the batch helper, so existing callers and the direct unit tests (which pin exact cohesion values like 0.5 and 1.0) keep working unchanged.Test coverage
Three layers of regression protection, all added in this PR:
test_compute_cohesion_batch_matches_singletest_detect_file_based_integrationmember_qnspassed to the batch helpertest_detected_cohesions_match_direct_computationdetect_communitieson the seeded fixture (with an extra edge added to break cohesion symmetry) and asserts every stored cohesion equals what_compute_cohesionproduces directly on the same member setAll three were validated with a mutation test (
results.reverse()injected into_compute_cohesion_batch) and caught the bug with clear, specific error messages before being reverted. Without the fixture asymmetry,test_detected_cohesions_match_direct_computationalone wouldn't catch a simple swap — the test explicitly asserts that at least two distinct cohesion values are seen, so a future fixture regression that accidentally re-symmetrizes it will fail loudly.Test plan
uv run pytest tests/test_communities.py— 23 passed, 1 skipped (igraph-only test)uv run pytest tests/— 624 passed, 1 skipped, 2 xpasseduv run ruff check code_review_graph/ tests/test_communities.py— cleanuv run mypy code_review_graph/communities.py --ignore-missing-imports --no-strict-optional— cleanresults.reverse()into_compute_cohesion_batch, confirmed all three new tests fail loudly with specific error messages, then reverted.Out of scope
EDGE_WEIGHTS), or edge-kind handling.cohesionfield's serialization or rounding._compute_cohesionsignature.g.community_leiden(...)itself can be very slow on graphs this size, likely because it's called without ann_iterationscap. That's a distinct problem and will be addressed in a follow-up PR.