Add duplicate detection#101
Merged
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new structural duplicate/near-duplicate detection capability to Codegraph, exposed as both a library API (findDuplicates) and a new codegraph duplicates CLI command, with supporting docs and tests.
Changes:
- Implement an in-memory duplicate detection engine over indexed symbol ranges + semantic chunks, producing scored suggestions with confidence tiers and omission counts.
- Add
codegraph duplicatesCLI command + top-level help integration and option parsing (--min-confidence,--max-bucket-size, etc.). - Add documentation (CLI + library API + README + skill) and a new Vitest suite covering core scenarios and CLI JSON output.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/duplicates.test.ts | New tests for exact/near/renamed duplicates, small-unit filtering, same-file behavior, and CLI JSON output. |
| src/index.ts | Re-export duplicate detection API/types from the package root. |
| src/duplicates.ts | New duplicate detection engine (unit collection, fingerprinting, candidate generation, scoring, sorting, output). |
| src/cli/options.ts | Register new value-taking CLI options used by the duplicates command. |
| src/cli/help.ts | Add duplicates to command list + add command-specific help text. |
| src/cli/duplicates.ts | New CLI handler for building an index from scoped files and emitting duplicate suggestions JSON. |
| src/cli.ts | Wire duplicates into main CLI dispatch and include-root scoping logic. |
| README.md | Document duplicates feature and add example workflow usage. |
| docs/superpowers/plans/2026-05-19-duplicate-detection.md | Design plan describing approach, taxonomy, and rollout phases. |
| docs/library-api.md | Document findDuplicates() usage and options. |
| docs/cli.md | Document codegraph duplicates usage and guidance. |
| codegraph-skill/codegraph/SKILL.md | Add duplicates command to the bundled skill workflow suggestions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/duplicates.ts:430
shouldKeepUnit()returnstruefor all units whenincludeSmallis set. Given that symbol units can be identifier/signature-only (very low token counts),--include-smallcan unintentionally include lots of tiny symbol/name units and produce false-positive “duplicates” based on names alone. Consider applyingincludeSmallonly to chunk units, or still enforcing a minimal token/line span forkind: "symbol"units even whenincludeSmallis enabled.
function shouldKeepUnit(unit: DuplicateInternalUnit, includeSmall: boolean, minTokens: number): boolean {
if (includeSmall) return true;
return unit.tokenCount >= minTokens;
}
Comment on lines
+27
to
+33
| files: context.files, | ||
| ...(minConfidence !== undefined ? { minConfidence } : {}), | ||
| limit: parsePositiveIntegerOption(context.getOpt("--limit"), "--limit", 50), | ||
| minTokens: parsePositiveIntegerOption(context.getOpt("--min-tokens"), "--min-tokens", 40), | ||
| maxTokens: parsePositiveIntegerOption(context.getOpt("--max-tokens"), "--max-tokens", 800), | ||
| maxBucketSize: parsePositiveIntegerOption(context.getOpt("--max-bucket-size"), "--max-bucket-size", 200), | ||
| ...(context.hasFlag("--include-same-file") ? { includeSameFile: true } : {}), |
Comment on lines
346
to
352
| const indexOpts: BuildOptions = { | ||
| threads: options.threads ?? 0, | ||
| discovery: context.discoveryOptions, | ||
| onProgress: context.progressHandler, | ||
| keepParsed: true, | ||
| ...(context.nativeMode !== "auto" ? { native: context.nativeMode } : {}), | ||
| ...context.workerOpts, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
codegraph duplicateswith scoped roots, confidence filters, token bounds, and JSON outputVerification