Skip to content

Add duplicate detection#101

Merged
lzehrung merged 15 commits into
mainfrom
implement-duplicate-detection
May 21, 2026
Merged

Add duplicate detection#101
lzehrung merged 15 commits into
mainfrom
implement-duplicate-detection

Conversation

@lzehrung
Copy link
Copy Markdown
Owner

Summary

  • add a structural duplicate detection engine over indexed symbols and semantic chunks
  • add codegraph duplicates with scoped roots, confidence filters, token bounds, and JSON output
  • document the CLI, library API, README workflow, bundled skill, and design plan

Verification

  • npm run lint
  • npx tsc -p tsconfig.json --noEmit
  • git diff --check
  • npx vitest run tests/duplicates.test.ts tests/cli-command-modules.test.ts
  • npm run build
  • npm run test:ci
  • npx tsx src/cli.ts review --base origin/main --head HEAD --summary
  • npx tsx src/cli.ts duplicates src --min-confidence high --limit 5

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new structural duplicate/near-duplicate detection capability to Codegraph, exposed as both a library API (findDuplicates) and a new codegraph duplicates CLI command, with supporting docs and tests.

Changes:

  • Implement an in-memory duplicate detection engine over indexed symbol ranges + semantic chunks, producing scored suggestions with confidence tiers and omission counts.
  • Add codegraph duplicates CLI command + top-level help integration and option parsing (--min-confidence, --max-bucket-size, etc.).
  • Add documentation (CLI + library API + README + skill) and a new Vitest suite covering core scenarios and CLI JSON output.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/duplicates.test.ts New tests for exact/near/renamed duplicates, small-unit filtering, same-file behavior, and CLI JSON output.
src/index.ts Re-export duplicate detection API/types from the package root.
src/duplicates.ts New duplicate detection engine (unit collection, fingerprinting, candidate generation, scoring, sorting, output).
src/cli/options.ts Register new value-taking CLI options used by the duplicates command.
src/cli/help.ts Add duplicates to command list + add command-specific help text.
src/cli/duplicates.ts New CLI handler for building an index from scoped files and emitting duplicate suggestions JSON.
src/cli.ts Wire duplicates into main CLI dispatch and include-root scoping logic.
README.md Document duplicates feature and add example workflow usage.
docs/superpowers/plans/2026-05-19-duplicate-detection.md Design plan describing approach, taxonomy, and rollout phases.
docs/library-api.md Document findDuplicates() usage and options.
docs/cli.md Document codegraph duplicates usage and guidance.
codegraph-skill/codegraph/SKILL.md Add duplicates command to the bundled skill workflow suggestions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/duplicates.ts Outdated
Comment thread src/duplicates.ts Outdated
Comment thread src/duplicates.ts Outdated
Comment thread src/duplicates.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comment thread src/duplicates.ts Outdated
Comment thread src/duplicates.ts Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread src/duplicates.ts Outdated
@lzehrung lzehrung requested a review from Copilot May 21, 2026 17:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/duplicates.ts:430

  • shouldKeepUnit() returns true for all units when includeSmall is set. Given that symbol units can be identifier/signature-only (very low token counts), --include-small can unintentionally include lots of tiny symbol/name units and produce false-positive “duplicates” based on names alone. Consider applying includeSmall only to chunk units, or still enforcing a minimal token/line span for kind: "symbol" units even when includeSmall is enabled.
function shouldKeepUnit(unit: DuplicateInternalUnit, includeSmall: boolean, minTokens: number): boolean {
  if (includeSmall) return true;
  return unit.tokenCount >= minTokens;
}

Comment thread src/duplicates.ts
Comment thread src/cli/duplicates.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread src/duplicates.ts
Comment thread src/duplicates.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread src/duplicates.ts Outdated
Comment thread src/cli/help.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread src/cli/duplicates.ts
Comment on lines +27 to +33
files: context.files,
...(minConfidence !== undefined ? { minConfidence } : {}),
limit: parsePositiveIntegerOption(context.getOpt("--limit"), "--limit", 50),
minTokens: parsePositiveIntegerOption(context.getOpt("--min-tokens"), "--min-tokens", 40),
maxTokens: parsePositiveIntegerOption(context.getOpt("--max-tokens"), "--max-tokens", 800),
maxBucketSize: parsePositiveIntegerOption(context.getOpt("--max-bucket-size"), "--max-bucket-size", 200),
...(context.hasFlag("--include-same-file") ? { includeSameFile: true } : {}),
Comment thread src/cli/impact.ts
Comment on lines 346 to 352
const indexOpts: BuildOptions = {
threads: options.threads ?? 0,
discovery: context.discoveryOptions,
onProgress: context.progressHandler,
keepParsed: true,
...(context.nativeMode !== "auto" ? { native: context.nativeMode } : {}),
...context.workerOpts,
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment thread src/duplicates.ts
Comment thread src/duplicates.ts Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.

@lzehrung lzehrung merged commit 91b9f34 into main May 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants