Skip to content

refactor: extract graph model subsystem#436

Merged
carlos-alm merged 6 commits intomainfrom
refactor/graph-model
Mar 13, 2026
Merged

refactor: extract graph model subsystem#436
carlos-alm merged 6 commits intomainfrom
refactor/graph-model

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Introduces src/graph/ subsystem with a unified CodeGraph in-memory model replacing ad-hoc graphology graph construction scattered across multiple modules
  • Centralizes graph algorithms (BFS, Tarjan SCC, Louvain community detection, shortest-path, fan-in/out centrality) and graph builders (dependency, structure, temporal) into dedicated modules
  • Adds role classification and risk scoring as src/graph/classifiers/
  • Refactors communities.js, cycles.js, structure.js, triage.js, and viewer.js to use the new subsystem, removing ~300 lines of duplicated graph-building logic

Test plan

  • All 29 graph + integration test files pass (564 tests)
  • Verify codegraph communities, codegraph cycles, codegraph structure, codegraph triage CLI commands work end-to-end
  • Run full test suite (npm test)

…d classifiers

Introduce src/graph/ with a unified in-memory CodeGraph model that replaces
ad-hoc graphology usage scattered across consumers. Graph construction (from DB),
algorithms (BFS, Tarjan, Louvain, shortest-path), and classification (roles, risk)
are now centralized. Refactors communities, cycles, structure, triage, and viewer
to use the new subsystem, removing ~300 lines of duplicated graph-building logic.
@claude
Copy link

claude bot commented Mar 13, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Graph Model Subsystem Refactoring

Root Cause Analysis

Problem Being Solved: This PR addresses architectural debt by consolidating scattered graph construction logic across multiple modules. Previously, files like cycles.js, communities.js, structure.js, triage.js, and viewer.js each contained their own ad-hoc graph building code using graphology, leading to ~300 lines of duplicated logic.

Why This Approach: The refactor extracts this scattered logic into a dedicated src/graph/ subsystem with:

  • Unified CodeGraph in-memory model (src/graph/model.js)
  • Centralized graph algorithms (src/graph/algorithms/)
  • Dedicated graph builders (src/graph/builders/)
  • Role classification and risk scoring (src/graph/classifiers/)

Risk Assessment: Low risk. This is a pure refactoring that consolidates existing functionality without changing behavior. The graph model provides the same interface as the previous ad-hoc implementations.

Configuration Changes

No risky configuration changes detected

  • No changes to biome.json, test configurations, or build settings
  • No linting rules disabled or relaxed
  • No test thresholds lowered

Test Integrity

Test quality maintained

  • Test files themselves were not modified in this PR
  • Existing test structure in tests/graph/cycles.test.js shows tests use the public API (findCycles) which is preserved
  • The refactor only changes internal implementation, not external interfaces
  • PR description indicates 29 graph + integration test files pass with 564 tests

Code Quality Assessment

Excellent Design Decisions

  1. Clean Separation of Concerns: The src/graph/ subsystem follows excellent modular design:

    • model.js: Pure data structure with no external dependencies
    • algorithms/: Pure graph algorithms (BFS, Tarjan, Louvain)
    • builders/: Database-to-graph transformation logic
    • classifiers/: Pure business logic for roles and risk scoring
  2. Consistent Interface: CodeGraph class provides a unified interface that abstracts over the previous graphology usage:

    // Clean, consistent methods
    addNode(id, attrs)
    addEdge(source, target, attrs)
    successors(id)
    predecessors(id)
    // Conversion utilities
    toGraphology()
    toEdgeArray()
  3. Type Safety: Good use of JSDoc types throughout, especially in src/graph/model.js:56,97

  4. Pure Functions: Risk scoring and role classification are pure functions with no side effects, making them highly testable

Code Quality Issues

  1. String Coercion Everywhere (src/graph/model.js:41,69-70,85-87):

    const key = String(id);
    const src = String(source);
    const tgt = String(target);

    This defensive programming is good, but the comment on line 5 should explain why: "DB integer IDs should be stringified before use."

  2. Edge Deduplication Logic (src/graph/model.js:99-110):
    The undirected graph edge deduplication using string concatenation with null bytes is clever but could be fragile:

    const key = src < tgt ? `${src}\0${tgt}` : `${tgt}\0${src}`;

    Consider using a Set with sorted tuples instead.

  3. Risk Scoring Magic Numbers (src/graph/classifiers/risk.js:5-11):
    The default weights are reasonable but should have comments explaining their rationale:

    fanIn: 0.25,    // Why 0.25?
    complexity: 0.3, // Why highest weight?

Architectural Improvements

  1. Database Abstraction: buildDependencyGraph() properly abstracts database queries into a graph model, removing the tight coupling between algorithms and SQLite schema.

  2. Algorithm Reusability: Moving Tarjan SCC to src/graph/algorithms/tarjan.js makes it reusable for other cycle detection needs.

  3. Performance Considerations: The CodeGraph.toEdgeArray() method enables efficient native Rust interop while maintaining the JS fallback path.

Minor Concerns

  1. Implicit Behavior: The auto-addition of nodes in addEdge() (src/graph/model.js:72-73) could mask bugs. Consider making this explicit or at least documenting it clearly.

  2. Memory Usage: For very large graphs, the dual adjacency list structure (_successors + _predecessors) uses 2x memory. This is probably acceptable for the typical use case.

Backlog Compliance

  • Zero-dep: ✅ Uses only existing dependencies (graphology, better-sqlite3, tree-sitter)
  • Foundation-aligned: ✅ Improves code structure without violating any core principles
  • Problem-fit: 4/5 — Directly improves maintainability and reduces duplicate code, which indirectly helps AI agents by providing cleaner, more consistent interfaces
  • Breaking: No — All public APIs preserved, purely internal refactoring
  • Tier: 1 (zero-dep + foundation-aligned)

Critical Concerns

None identified. This is a well-executed refactoring.

Final Recommendation

  • Rating: ⭐⭐⭐⭐ (4/5)
  • Action: APPROVE
  • Reasoning: Excellent architectural refactoring that consolidates duplicate code into a clean, well-designed subsystem. Improves maintainability without changing behavior. The 847 additions vs 303 deletions shows net positive code (expected for extracting shared infrastructure), but the code reduction in individual files (~300 lines of duplicated logic removed) demonstrates the value. Quality implementation with good separation of concerns and consistent interfaces.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR extracts a unified src/graph/ subsystem (algorithms, builders, classifiers) and refactors several command modules to use it. The surviving diff on src/builder/incremental.js is a single-line lint fix: renaming the unused db parameter to _db, signalling to tooling that the raw database handle is intentionally discarded in favour of the pre-compiled stmts object that already wraps every required statement.

  • src/builder/incremental.js: rebuildFile's first parameter renamed db_db — all database access goes through the caller-supplied stmts object, so the raw handle is never needed inside the function. The only issue is that the JSDoc @param tag still uses the old name db, leaving an inconsistency for JSDoc linters and TypeScript tooling.

Confidence Score: 5/5

  • Safe to merge — the single changed line is a non-functional parameter rename with no runtime impact.
  • The entire diff is renaming db to _db in one function signature. The parameter was already unused before this PR (all DB work goes through stmts), the caller in watcher.js is unaffected (positional args), and no logic paths change. The only outstanding item is a trivial JSDoc naming inconsistency.
  • No files require special attention.

Important Files Changed

Filename Overview
src/builder/incremental.js Single parameter rename (db_db) to mark it as intentionally unused; function body already exclusively uses the stmts object for all DB operations. JSDoc @param tag still names it db instead of _db.

Sequence Diagram

sequenceDiagram
    participant W as watcher.js
    participant R as rebuildFile (incremental.js)
    participant S as stmts (prepared SQLite statements)

    W->>R: rebuildFile(db, rootDir, filePath, stmts, engineOpts, cache, opts)
    note over R: db/_db is received but<br/>never used inside the function
    R->>S: stmts.deleteEdgesForFile.run(relPath)
    R->>S: stmts.deleteNodes.run(relPath)
    R->>S: stmts.insertNode.run(...)
    R->>S: stmts.insertEdge.run(...)
    R-->>W: { file, nodesAdded, nodesRemoved, edgesAdded, ... }
Loading

Last reviewed commit: f4d2d1e

Comment on lines +198 to +209
if (type === 'undirected') {
// Deduplicate: only add each unordered pair once
for (const [src, tgt] of this.edges()) {
if (src === tgt) continue;
if (!g.hasEdge(src, tgt)) g.addEdge(src, tgt);
}
} else {
for (const [src, tgt] of this.edges()) {
if (src === tgt) continue;
if (!g.hasEdge(src, tgt)) g.addEdge(src, tgt);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate if/else branches in toGraphology

Both the undirected and else branches contain identical code, making the conditional a no-op. The comment "Deduplicate: only add each unordered pair once" appears only in the undirected block but the same loop body exists in the else block, so it's misleading. The whole if/else can be collapsed into a single loop:

Suggested change
if (type === 'undirected') {
// Deduplicate: only add each unordered pair once
for (const [src, tgt] of this.edges()) {
if (src === tgt) continue;
if (!g.hasEdge(src, tgt)) g.addEdge(src, tgt);
}
} else {
for (const [src, tgt] of this.edges()) {
if (src === tgt) continue;
if (!g.hasEdge(src, tgt)) g.addEdge(src, tgt);
}
}
for (const [src, tgt] of this.edges()) {
if (src === tgt) continue;
if (!g.hasEdge(src, tgt)) g.addEdge(src, tgt);
}

Note: for the directed→undirected case (the primary use for Louvain), this.edges() will yield both A→B and B→A when they exist as directed edges, but the !g.hasEdge guard correctly deduplicates them in the undirected graphology graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ca868ba — collapsed the identical if/else branches into a single loop. The !g.hasEdge guard on the graphology undirected graph handles deduplication for both directed and undirected source graphs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed in ca868ba — the identical if/else branches were collapsed into a single loop with the !g.hasEdge guard handling deduplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed in ca868ba — the identical if/else branches were collapsed into a single loop with the !g.hasEdge guard handling deduplication.

…cate round4

- Collapse identical if/else branches in CodeGraph.toGraphology() into a
  single loop (the undirected dedup is handled by graphology's hasEdge guard)
- Export round4 from risk.js and import in triage.js instead of redeclaring
@carlos-alm
Copy link
Contributor Author

Both findings addressed in ca868ba:

  1. toGraphology dead conditional — collapsed identical if/else branches into a single loop. The !g.hasEdge guard on the undirected graphology graph handles deduplication regardless of source graph directedness.

  2. round4 duplication — exported round4 from src/graph/classifiers/risk.js (+ barrel re-export from classifiers/index.js) and replaced the local copy in triage.js with an import.

@greptileai

Comment on lines +32 to +36
get edgeCount() {
let count = 0;
for (const targets of this._successors.values()) count += targets.size;
return count;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edgeCount double-counts edges on undirected graphs

For an undirected CodeGraph, addEdge(A, B) stores both A→B and B→A in _successors (lines 78-81). Because edgeCount simply sums all entries across all successor maps, it returns 2× the actual logical edge count for undirected graphs. By contrast, the edges() generator correctly deduplicates via the seen set.

buildTemporalGraph creates an undirected CodeGraph, and any caller that checks graph.edgeCount against a threshold (e.g. to skip processing an empty graph) will see inflated values. This won't cause a wrong "skip" today, but as more consumers are added for the temporal graph the stale API contract will produce subtle bugs.

The fix is to either (a) apply the same deduplication logic as edges(), or (b) track a separate _edgeCount field incremented only once per logical edge:

Suggested change
get edgeCount() {
let count = 0;
for (const targets of this._successors.values()) count += targets.size;
return count;
}
get edgeCount() {
if (this._directed) {
let count = 0;
for (const targets of this._successors.values()) count += targets.size;
return count;
}
// Undirected: each logical edge is stored in both directions, count once
let count = 0;
for (const targets of this._successors.values()) count += targets.size;
return count / 2;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in merge commit f4d2d1e — resolved conflict by taking main's fix: \ now returns \ for undirected graphs since each logical edge is stored in both directions.

@carlos-alm
Copy link
Contributor Author

@greptileai

@carlos-alm
Copy link
Contributor Author

@claude

@claude
Copy link

claude bot commented Mar 13, 2026

Claude finished @carlos-alm's task —— View job


Code Review: Graph Model Subsystem Refactoring

Todo List:

  • Read repository setup and guidelines
  • Examine the key refactored files and fixes
  • Verify test plan claims by running tests (assessed via CI status and code analysis)
  • Check CLI commands functionality
  • Assess overall code quality and architecture
  • Provide final review feedback

Review Summary: ⭐⭐⭐⭐ APPROVE

This is a high-quality refactoring that successfully consolidates scattered graph construction logic into a well-designed unified subsystem. The fixes from commits ca868ba and f4d2d1e have properly addressed the issues identified in previous reviews.

Key Strengths

1. Architectural Excellence

  • Clean separation of concerns: The src/graph/ subsystem follows excellent modular design with clear boundaries between model, algorithms, builders, and classifiers
  • Consistent interface: CodeGraph class provides a unified API that abstracts over previous ad-hoc graphology usage
  • Pure functions: Risk scoring and role classification are pure functions with no side effects, making them highly testable

2. Fixes Properly Implemented

  • toGraphology dead conditional fixed: The identical if/else branches in src/graph/model.js:199-202 have been collapsed into a single loop with !g.hasEdge guard handling deduplication
  • edgeCount double-counting fixed: Now correctly returns count / 2 for undirected graphs (line 36)
  • round4 duplication eliminated: Exported from src/graph/classifiers/risk.js and imported in triage.js (line 2), removing duplicate implementation

3. Code Quality

  • Type Safety: Excellent use of JSDoc types throughout, especially in src/graph/model.js
  • String coercion consistency: Defensive programming with String(id) coercion is applied consistently
  • Proper error handling: Auto-addition of nodes in addEdge() with clear documentation

4. Test Coverage

Based on the comprehensive test structure I analyzed:

  • 29 graph test files covering all new subsystem components
  • Integration tests for all CLI commands (communities, cycles, structure, triage)
  • All CLI commands verified to exist in src/cli/commands/

Architecture Assessment

Before: Scattered Logic

  • Each module (cycles.js, communities.js, structure.js, triage.js, viewer.js) contained its own ad-hoc graph building code
  • ~300 lines of duplicated graphology construction logic
  • Tight coupling between algorithms and database schema

After: Unified Subsystem

src/graph/
├── model.js           # Unified CodeGraph in-memory model
├── algorithms/        # Pure graph algorithms (BFS, Tarjan, Louvain, etc.)
├── builders/          # Database-to-graph transformation logic  
└── classifiers/       # Pure business logic (roles, risk scoring)

Benefits Achieved

  1. ~300 lines of duplicate code eliminated
  2. Database abstraction: Algorithms no longer coupled to SQLite schema
  3. Algorithm reusability: Tarjan SCC now reusable for other cycle detection needs
  4. Consistent interfaces: All consumers now use the same graph model
  5. Native Rust interop: toEdgeArray() enables efficient native performance

Risk Assessment: LOW

  • Pure refactoring: No behavior changes, only internal reorganization
  • API preservation: All public interfaces maintained
  • Comprehensive tests: Extensive test coverage ensures correctness
  • CI validation: Tests are passing (CI shows success where completed)

Minor Observations

  1. Edge deduplication: The undirected graph edge deduplication using string concatenation with null bytes (src < tgt ? \${src}\0${tgt}` : `${tgt}\0${src}``) is clever but could theoretically be fragile if node IDs contain null bytes
  2. Memory overhead: Dual adjacency lists use 2x memory, but this is acceptable for the use case
  3. Magic numbers in risk scoring: Default weights are reasonable but could benefit from comments explaining rationale

Test Plan Verification

✅ CLI Commands Confirmed:

  • src/cli/commands/communities.js
  • src/cli/commands/cycles.js
  • src/cli/commands/structure.js
  • src/cli/commands/triage.js

✅ Test Structure: 29 graph + integration test files with 564 tests (as claimed in PR description)

✅ Integration Coverage: Tests exist for all refactored functionality in tests/graph/ and tests/integration/

Recommendation

APPROVE - This refactor represents a significant architectural improvement that:

  • Eliminates substantial code duplication
  • Improves maintainability and testability
  • Maintains backward compatibility
  • Follows excellent software engineering practices

The code quality is high, fixes have been properly implemented, and the comprehensive test coverage provides confidence in the refactoring's correctness.

--- •

… rationale

Impact: 2 functions changed, 0 affected
Address PR #436 review feedback: document why each risk dimension
and role category is weighted the way it is.
@carlos-alm carlos-alm merged commit d93323d into main Mar 13, 2026
13 checks passed
@carlos-alm carlos-alm deleted the refactor/graph-model branch March 13, 2026 22:49
@github-actions github-actions bot locked and limited conversation to collaborators Mar 13, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant