Skip to content

perf: batch community cohesion to stop hangs on large repos#183

Open
realkotob wants to merge 1 commit intotirth8205:mainfrom
realkotob:perf/batch-community-cohesion
Open

perf: batch community cohesion to stop hangs on large repos#183
realkotob wants to merge 1 commit intotirth8205:mainfrom
realkotob:perf/batch-community-cohesion

Conversation

@realkotob
Copy link
Copy Markdown

@realkotob realkotob commented Apr 9, 2026

The problem

Running code-review-graph build on a large source tree (~9k files, ~100k edges, ~93k nodes) on my Macbook M1 Pro effectively hangs during post-processing. The parse phase finishes fine, but community detection never returns and Python is pinned at 100% CPU:

INFO: Progress: 9234/9234 files parsed
INFO: FTS index rebuilt: 93541 rows indexed
INFO: igraph not available, using file-based community detection
    ... <never returns, 100% CPU> ...

A sampled stack trace points inside _detect_file_based.

Why it happens

_compute_cohesion walks the entire edge list to count internal/external edges for one community. It's called once per community, from three places:

  • _detect_file_based — once per file community ← the killer
  • _detect_leiden — once per Leiden cluster
  • _detect_leiden_sub — once per sub-cluster

The resulting work is O(edges × communities). For the file-based fallback on Godot that's ~9k × ~100k ≈ ~10⁹ pure-Python comparisons — the hang I hit.

The Leiden path has the same shape but with far fewer clusters, so it was merely slow rather than catastrophic. It's still a real bug, just less dramatic.

The fix

Compute cohesion for all communities in a single O(edges) pass.

New helper _compute_cohesion_batch(community_member_qns, all_edges) builds a qualified_name → community_index reverse map once — each node appears in at most one community since all three detection functions produce partitions — then walks every edge exactly once and buckets it into internal/external counters per community.

Total work drops from O(edges × communities) to O(edges + Σ|members|). For the big repo, community detection now completes in seconds instead of hanging.

All three detection functions (_detect_file_based, _detect_leiden, _detect_leiden_sub) are refactored to:

  1. Collect member sets for all qualifying communities first.
  2. Call _compute_cohesion_batch once.
  3. zip the cohesions back onto the community metadata.

_compute_cohesion is kept as a thin single-community wrapper around the batch helper, so existing callers and the direct unit tests (which pin exact cohesion values like 0.5 and 1.0) keep working unchanged.

Test coverage

Three layers of regression protection, all added in this PR:

Test Level Catches
test_compute_cohesion_batch_matches_single Unit Batch-vs-single parity on a mixed-edge fixture
test_detect_file_based_integration Integration (deterministic) Pins exact member sets and cohesion values on a hand-built fixture with asymmetric cohesions (1.0 vs 0.6667) — catches zip misalignment or wrong member_qns passed to the batch helper
test_detected_cohesions_match_direct_computation End-to-end (algorithm-agnostic) Runs detect_communities on the seeded fixture (with an extra edge added to break cohesion symmetry) and asserts every stored cohesion equals what _compute_cohesion produces directly on the same member set

All three were validated with a mutation test (results.reverse() injected into _compute_cohesion_batch) and caught the bug with clear, specific error messages before being reverted. Without the fixture asymmetry, test_detected_cohesions_match_direct_computation alone wouldn't catch a simple swap — the test explicitly asserts that at least two distinct cohesion values are seen, so a future fixture regression that accidentally re-symmetrizes it will fail loudly.

Test plan

  • uv run pytest tests/test_communities.py — 23 passed, 1 skipped (igraph-only test)
  • uv run pytest tests/ — 624 passed, 1 skipped, 2 xpassed
  • uv run ruff check code_review_graph/ tests/test_communities.py — clean
  • uv run mypy code_review_graph/communities.py --ignore-missing-imports --no-strict-optional — clean
  • Mutation test: injected results.reverse() into _compute_cohesion_batch, confirmed all three new tests fail loudly with specific error messages, then reverted.

Out of scope

  • No changes to the Leiden algorithm, edge weights (EDGE_WEIGHTS), or edge-kind handling.
  • No change to the cohesion field's serialization or rounding.
  • No change to the public _compute_cohesion signature.
  • A separate issue I observed during end-to-end verification — g.community_leiden(...) itself can be very slow on graphs this size, likely because it's called without an n_iterations cap. That's a distinct problem and will be addressed in a follow-up PR.

Community detection was effectively hanging on the Godot source
(~9k files, ~100k edges, ~93k nodes) during the post-parse phase. The
parser finished fine, but `_detect_file_based` never returned, with
Python pinned at 100% CPU.

The cause was a quadratic pattern in `_compute_cohesion`: for each
community, it walked every edge in the graph once to count
internal/external edges. That's O(edges) per community, called once
per community, so the whole step was O(edges × communities).

- `_detect_file_based` creates one community per file. On Godot that's
  ~9k × ~100k ≈ ~10⁹ pure-Python comparisons — the hang I hit.
- `_detect_leiden` and `_detect_leiden_sub` have the same shape, just
  with fewer clusters, so they were merely slow rather than
  catastrophic.

Fix: compute cohesion for all communities in a single O(edges) pass
via a new `_compute_cohesion_batch` helper. It builds a
`qualified_name → community_index` reverse map once (since all three
detection functions produce partitions, each node belongs to at most
one community), then walks every edge exactly once and buckets it
into internal/external counters per community. Total work drops from
O(edges × communities) to O(edges + Σ|members|).

Both `_detect_leiden` and `_detect_leiden_sub` are refactored to
collect all cluster member sets first and then call
`_compute_cohesion_batch` once, the same pattern as the file-based
path.

`_compute_cohesion` is kept as a thin wrapper around the batch helper
so existing callers and the direct unit tests (which pin exact
values like 0.5 and 1.0) keep working unchanged.

Test coverage:

- Unit: 3 new tests verify `_compute_cohesion_batch` matches
  `_compute_cohesion` on a mixed-edge fixture, handles empty
  community lists, and handles edge-less input.
- Integration (deterministic): `test_detect_file_based_integration`
  builds an asymmetric fixture (one community with cohesion 1.0,
  another with cohesion 0.6667) and pins exact member sets and
  cohesion values — catches zip misalignment or wrong `member_qns`
  passed to the batch helper.
- End-to-end (algorithm-agnostic):
  `test_detected_cohesions_match_direct_computation` runs
  `detect_communities` on the seeded two-cluster fixture (with an
  extra edge added to break cohesion symmetry) and asserts that
  every stored community cohesion equals what `_compute_cohesion`
  produces when called directly on that community's member set.
  Works whether Leiden (igraph) or the file-based fallback is active.

All three new tests were validated with a mutation test
(`results.reverse()` injected into `_compute_cohesion_batch`) and
caught the bug with clear, specific error messages before being
reverted.

Full suite: 624 passed, 1 skipped, 2 xpassed. Ruff and mypy clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant