Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ Use Codegraph when you need fast structural answers about a repo without relying
- Cross-file go-to-definition and find-references support across the shared source-language pipeline.
- Deterministic agent search, bounded explanation packets, portable artifact bundles, and MCP tools across files, symbols, chunks, SQL objects, and graph neighborhoods with stable follow-up handles.
- Semantic chunking for code and text files, including Vue and Svelte single-file component block splitting.
- Duplicate and near-duplicate detection over indexed symbols, semantic chunks, and text chunks.
- AST grep, public API summaries, unresolved import reports, hotspot analysis, cycle detection, and shortest dependency paths.
- PR impact analysis and review bundles that map diffs to changed symbols, impacted code, likely tests, and graph deltas.
- SQL language support for `.sql` files, including statement chunks, object symbols, SQL-to-SQL graph edges, SQL navigation, and statement facts.
Expand Down Expand Up @@ -111,6 +112,9 @@ node ./dist/cli.js graph --root . ./src --compact-json --output codegraph.json

# inspect public API surface
node ./dist/cli.js apisurface

# find duplicate and near-duplicate code
node ./dist/cli.js duplicates ./src --min-confidence medium --limit 20
```

If you install the published CLI instead of using a source checkout, replace `node ./dist/cli.js` with `codegraph`.
Expand Down Expand Up @@ -190,6 +194,7 @@ The supported package import surface is the root export, `@lzehrung/codegraph`.
## Common workflows

- Repo triage: run `codegraph inspect ./src --limit 20`, then follow with `codegraph hotspots ./src --limit 20` or `codegraph unresolved` to focus the next pass.
- Duplicate cleanup: run `codegraph duplicates ./src --min-confidence medium` before refactors to find shared extraction candidates.
- Symbol navigation: use `codegraph goto <file> <line> <column>` and `codegraph refs --file <file> --line <line> --col <column> --pretty` when a question is about definitions or semantic usages rather than matching strings.
- PR review: run `codegraph impact --base origin/main --head HEAD --pretty` for a ranked map, `codegraph review --base origin/main --head HEAD --summary` for a compact reviewer handoff with actionable candidate tests, or redirect plain `review` output when a downstream tool needs the full JSON bundle.
- Worktree review: run `codegraph impact --base HEAD --head WORKTREE --pretty` for current staged and unstaged tracked-file changes, then `codegraph review --base HEAD --head WORKTREE --summary` for a compact handoff. Use `--head STAGED` to compare `HEAD` against the current index.
Expand Down
7 changes: 7 additions & 0 deletions codegraph-skill/codegraph/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Then choose the narrowest follow-up command:
- Review handoff: `codegraph review --base HEAD --head WORKTREE --summary`
- Full review JSON: `codegraph review --base origin/main --head HEAD`
- Public API: `codegraph apisurface`
- Duplicate cleanup: `codegraph duplicates --root . ./src --min-confidence medium`
- Chunks: `codegraph chunk <file>`
- Artifact bundle: `codegraph artifact build --root . --out codegraph-out --json`
- MCP server: `codegraph mcp serve --root . --stdio` or `codegraph mcp serve --root . --port 7331`
Expand Down Expand Up @@ -210,6 +211,12 @@ For git-provider impact and git-scoped review/index/graph commands, `WORKTREE` c
Reports source dependency cycles; document-only link loops remain graph edges but are filtered from cycle warnings.
- Public API surface:
`codegraph apisurface`
- Duplicate and near-duplicate code:
`codegraph duplicates --root . ./src --min-confidence medium`
Covers indexed symbols, semantic chunks, and text chunks.
A single positional directory becomes the project root unless `--root` is set.
Use `--include-small` for tiny helpers.
Use `--include-same-file` for local clone cleanup.
- Unresolved project imports:
`codegraph unresolved`
Excludes graph-only document/template link edges plus known runtime/package externals: supported-language standard libraries, URL imports, and dependencies declared in nearby manifests such as `package.json`, Python, PHP, Rust, Go, Zig, Ruby, Java/Kotlin, .NET, C/C++, and Swift package manifests.
Expand Down
14 changes: 14 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,11 @@ codegraph chunk package.json --text --max-tokens 200
# Override language detection and token limits
codegraph chunk config.yaml --language yaml --min-tokens 100 --max-tokens 300

# Detect duplicate and near-duplicate code units
codegraph duplicates ./src --min-confidence medium --limit 20
codegraph duplicates --root . ./src ./packages/app --include-same-file
codegraph duplicates --help

# Go to definition
codegraph goto <file> <line> <column>

Expand All @@ -161,6 +166,15 @@ codegraph grep --query '(function_declaration name: (identifier) @name)'
codegraph grep --pattern 'eval\(' --ignore-case
```

`duplicates` always reports scored exact, renamed, near, and weak clone candidates as JSON.

- It combines indexed symbols, semantic chunks, and text chunks.
- It reports project-relative paths, confidence, clone type, metrics, omission counts, and pair stats.
- A single positional directory becomes the project root unless `--root` is set.
- Use `--root . ./src` for scoped scans with repository-relative paths.
- Use `--include-small` for tiny helpers.
- Use `--include-same-file` for non-overlapping clones inside one file.

`search`, `explain`, `artifact`, and `mcp` each support command-specific `--help` output so agents do not have to infer their options from the top-level help. `search` is deterministic and vectorless. It returns ranked results with project-relative stable handles, rank reasons, evidence, graph neighbors, follow-up commands, result counts, per-packet limits, and omission counts. `explain` resolves file paths, symbol names, SQL object names, and search handles, including file/chunk/graph handles, into bounded packets with symbols, dependencies, reverse dependencies, references, snippets, SQL object relation facts, changed-context review tasks/candidate tests, explicit limits, omission counts, and follow-ups. Generated follow-up and suggested-question commands POSIX-shell-quote dynamic arguments when needed. SQL object names resolve by exact name first; unqualified basenames resolve only when unique, so handles or schema-qualified names are preferred. Reference and snippet omission counts are lower bounds after the bounded navigation scan reaches its cap. `artifact build` writes `codegraph.sqlite`, self-describing project-relative `graph.json`, `CODEGRAPH_REPORT.md`, `questions.json`, and `manifest.json` by default; suggested questions use unique IDs backed by stable handles when a handle is available. Use artifact flags to select a subset. `--force` permits non-empty output directories, removes recognizable stale Codegraph artifacts, preserves unrelated operator files, and refuses unrecognized reserved-name collisions. Artifact contents exclude their own output directory and linked outside-root files. `mcp serve` exposes `search`, `get_file`, `get_symbol`, `goto`, `refs`, `deps`, `rdeps`, `path`, `impact`, `review`, `query_sqlite`, and `artifact_build` over stdio by default or Streamable HTTP with `--port <number>`. HTTP serves `/mcp`, binds to `127.0.0.1` unless `--host <host>` is passed, validates the Host header, allows loopback Host headers for wildcard binds, and rejects oversized request bodies. MCP file and artifact paths are confined to `--root` after realpath resolution; tools are read-only by default, `query_sqlite` is row- and byte-bounded and rejects synthetic payload functions, and `--allow-build` enables artifact output only. `chunk` uses semantic Tree-sitter chunking for registered source and stylesheet languages, Vue and Svelte block-aware chunking for single-file components, and text chunking for JSON, YAML, and unsupported extensions. Use `--text` to force text chunking.

### Dependency analysis and diagnostics
Expand Down
32 changes: 32 additions & 0 deletions docs/library-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,38 @@ See the test suites for concrete examples:

The integration examples demonstrate semantic chunking with type-based filtering, text-file chunking for configuration processing, intelligent splitting of large blocks, and metadata useful for embeddings or retrieval pipelines.

## Duplicate detection

`findDuplicates()` scans a built `ProjectIndex` for exact, renamed, near, and weak clone candidates.

- It uses indexed symbols, semantic chunks, and text chunks.
- Results include confidence, score, clone type, metrics, omission counts, and pair stats.
- Paths are project-relative when the index has a project root.

```ts
import { buildProjectIndex, findDuplicates } from "@lzehrung/codegraph";

const root = process.cwd();
const index = await buildProjectIndex(root);
const duplicates = await findDuplicates(index, {
minConfidence: "medium",
limit: 20,
});

console.log(duplicates.suggestions);
```

Useful options:

- `minConfidence`: `high`, `medium`, or `low`; default `medium`.
- `includeSameFile`: report non-overlapping clones in the same file.
- `includeSmall`: include units below the default token floor.
- `minTokens` and `maxTokens`: tune unit and fallback chunk bounds.

Tests:

- `tests/duplicates.test.ts`

## Basic index building

Build a full project index and use go-to-definition:
Expand Down
Loading
Loading