RAG search tool for agents by jtramm · Pull Request #3861 · openmc-dev/openmc

jtramm · 2026-03-10T13:38:48Z

Description

This PR introduces a new Retrieval-Augmented Generation (RAG) search index tool (similar to what cursor provides) that agents can use when working on OpenMC. The tool runs locally and is implemented in python. The tool is configured as an MCP server that is registered with agents as a formal tool via the .mcp.json file in the top OpenMC directory. The RAG search tool works out of the box with Claude code, and should be automatically discoverable by other agents though I haven't tried codex yet. It's possible that other agents may need to have an added .md file or .json to register the tool.

Motivation & Impact

The reason I've made this tool is that I've found Claude code to be very powerful but very narrow. For instance, when I ask it to review code, it typically reviews the diff (and potentially any files the diff touches), but is blind to how these changes may impact other regions of the code. It generally lacks any global vision of what is going on and is very surgical with what it chooses to look at. This style of agentic strategy I think works great for really well constrained problems with clear victory conditions and thorough test cases that it needs to pass. However, it fails at more open ended tasks like code review, where you really have to have some level of global awareness of the code.

To remedy this, the agent can use the RAG tool to search for more general terms or concepts that may appear across multiple files without sharing identical function names. Basically, I think this tool helps to widen the context that is available to the agent without having it try to tokenize the entire repository.

Implementation

I've done a lot of experimentation balancing result quality vs. indexing time, and have put some thought into the chunking process. I've tested out some of the larger models, but the next model up was taking more like 15 minutes to index which gets to be more annoying, and I don't think the increased quality is really needed just to navigate the repo. The small model (from Hugging Face, part of the sentence-transformers python package that does this) seems to be sufficient.

This PR was mostly the work of Claude code, though I did read through and add a lot of comments/cleanup to the python implementation so it would be human readable.

Limitations and Future Work

The tool uses an MCP interface so should be portable to other agents. However, I haven't tested them with codex yet, so codex users would be welcome to add on any needed .md files/pointers etc to get the tools registered there.
The RAG just covers the C++ codebase, python codebase, tests (python files), and docs. In the future, I'd like to either extend the current RAG tool (or add a second one) that can index a large set of example input decks for larger problems, so as to provide more context to the agent for how larger OpenMC inputs are composed and organized. The agents are already decent at this, but these sorts of RAG tools can still be a nice value add.
I have also been experimenting with a Language Server Protocol (LSP) C++ search tool (similar to what IDEs with C++ comprehension (e.g., VSCode intellisense) provide). It works well, but ultimately it only helps speed up searching operations (which are already fast) and saving a little context, so is probably not worth the added complexity and additional python imports etc.

Checklist

I have performed a self-review of my own code
I have run clang-format (version 18) on any C++ source files (if applicable)
I have followed the style guidelines for Python source files (if applicable)
I have made corresponding changes to the documentation (if applicable)
I have added tests that prove my fix is effective or that my feature works (if applicable)

Adds two opt-in tools that help AI agents understand the OpenMC codebase: 1. Repo Map: Tree-sitter based structural overview (~160 lines) showing the most important classes, functions, and relationships, ranked by cross-file usage via PageRank. 2. RAG Semantic Search: Vector-based search across all source code, tests, and documentation using sentence-transformers + LanceDB. Enables finding cross-cutting concerns (e.g., "where are particle seeds initialized") even when naming differs across code paths. Both tools are fully opt-in via /enable-openmc-index (per-session) and rebuild via /refresh-openmc-index. No API keys, no cloud services, no settings.json changes required. All generated artifacts live in the gitignored .claude/cache/ directory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add examples to openmc_search.py --help output. Update enable-openmc-index skill to have the agent run --help to learn the full API rather than duplicating usage docs. Subagent guidance also references --help. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove the custom generate_repomap.py (which produced a flat list of function signatures) and replace with openmc_map.py, a thin wrapper around aider's RepoMap. This generates contextual, focused code structure maps using tree-sitter + PageRank, showing condensed class/function skeletons with elided bodies. The two tools now serve complementary purposes: - openmc_search.py: semantic RAG search ("find code related to X") - openmc_map.py: structural map ("show me the shape of these files") The map tool generates maps on the fly (no pre-built index needed), so only the RAG search index needs refreshing after code changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Aider's RepoMap excludes chat_fnames from output (since they're already in the chat). We want the opposite: show focus files AND their neighbors. Now passes focus files via mentioned_fnames to boost ranking while keeping them in the output. Also suppresses aider's stderr noise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Aider's design is correct for our use case: when the agent passes focus files, it already has those files in context. The map should show the surrounding context (headers, dependencies, neighbors) not the files themselves. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agents reflexively pipe commands through head/tail to conserve context, which defeats the purpose of pre-budgeted tools like openmc_search.py and openmc_map.py. Add explicit CLAUDE.md instruction to always read full output from these tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… files Add openmc_lsp.py which uses clangd's Language Server Protocol for compiler-accurate symbol resolution — go-to-definition, find-references, and related-file discovery with zero false edges from name collisions. Also suppress ubiquitous utility files (error.h, constants.h, span.h, etc.) from the aider repo map output to improve its signal-to-noise ratio. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add openmc_lsp.py to the enable-openmc-index skill workflow, including prerequisite checks for clangd and compile_commands.json. Update CLAUDE.md to recommend the LSP tool for C++ code navigation over the aider repo map. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The LSP tool already searches for clangd dynamically (clangd, clangd-15, clangd-16, etc.). Docs should not pin a specific version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All three tools are independent; no need to single one out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite all three doc files with fresh eyes: - Remove redundant/prescriptive content from enable skill - Remove singling out of LSP tool as special/optional - Simplify clangd prerequisite handling (tool has its own error messages) - Update refresh skill to mention LSP also doesn't need refreshing - Simplify CLAUDE.md to avoid implying tools are either/or Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Help agents make informed tool choices by explaining what each tool does under the hood: semantic search uses vector embeddings, the repo map uses tree-sitter name-matching (which creates false edges from common method names), and the LSP tool uses the C++ compiler's type system (zero false edges). Agents should know the repo map's ranking is unreliable for determining which files are truly connected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The aider repo map excludes focus files from output (assumes they're already in context) and shows condensed skeletons of neighboring files. The docs incorrectly described it as showing the focus file's code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

It does definition, references, symbols, and related — not just "what files reference this symbol." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Explain the mechanics: RAG search uses sentence-transformers embeddings in LanceDB, the repo map uses a tree-sitter reference graph with PageRank fitted to a token budget, and the LSP tool talks to clangd's compiler frontend. Agents need this context to understand when to trust each tool's output and what its blind spots are. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The repo map's name-matching can surface identically-named functions across unrelated subsystems that may need parallel changes — something the LSP tool would miss. Present the trade-off rather than prescribing one tool over the other. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…matic OpenMC's CMakeLists.txt now enables CMAKE_EXPORT_COMPILE_COMMANDS by default, so compile_commands.json is generated automatically on every cmake build. Remove instructions telling users/agents to pass the flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The first record-building loop (using slow `chunk in code_chunks` identity comparison) was replaced by an index-based approach but never deleted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The TF-IDF+SVD fallback was never exercised since sentence-transformers is in requirements.txt and always installs. Simplifies embeddings.py from 101 lines to 32 and removes the unnecessary ABC/factory pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Drop tree-sitter-based function/class chunking in favor of simple sliding windows (1000 chars, 25% overlap). This ensures every line of code is searchable — long functions no longer have their tails invisible to the embedding model. Removes tree-sitter, tree-sitter-python, and tree-sitter-cpp dependencies. Index builds in ~5 min on 10 cores. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use the official transformers API (TRANSFORMERS_VERBOSITY=error, transformers.logging.disable_progress_bar()) and HuggingFace Hub settings (token=False, local_files_only=True) to suppress load reports, auth warnings, and weight-loading progress bars. Embedding progress bars during indexing are preserved (show_progress_bar=True on .encode() uses sentence-transformers' own progress bar). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- chunker.py docstring: said 50% overlap, actually 25% - indexer.py: remove unused pyarrow import - requirements.txt: remove pygls (LSP tool uses raw JSON-RPC), pyarrow (transitive dep of lancedb), numpy (transitive dep of sentence-transformers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

A diff-only review misses cross-file impacts. The tools (especially LSP references and RAG search) help reviewers understand what else in the codebase depends on or is affected by the changed code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The agent's trained instinct is to reach for grep/Read, which only finds exact text matches. Add a mandatory demo step in the skill that shows the RAG tool finding cross-cutting results grep would miss, and add guidance in CLAUDE.md to use RAG search before grep when exploring unfamiliar code or checking change impact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Based on real-world feedback from a PR review session where the agent felt overly compelled to use RAG search for exact symbol lookups that grep handles better. RAG is for semantic discovery across subsystems; grep is for precise symbol tracing. Don't force one when the other is the right tool. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The references and definition commands used first-non-whitespace position when no column was given, landing on the return type (e.g., `int`) instead of the function name (e.g., `openmc_run`). Added find_symbol_on_line() which uses clangd's document symbols to locate the actual symbol name, with a keyword-skipping fallback for lines without symbol definitions. Added Step 5 to the enable-openmc-index skill demonstrating LSP's type-accurate references using Tally::reset() — where grep returns 62 mixed hits across 20 files but LSP resolves exactly the 10 files that reference this specific class's reset(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rom where/how.

Copilot

Pull request overview

This PR adds a local, MCP-exposed RAG semantic search toolchain intended to help AI coding agents navigate the OpenMC repository by meaning-based retrieval, and documents how contributors/agents should use it.

Changes:

Registers an MCP server (openmc-code-tools) via .mcp.json and provides bootstrap/start scripts to run it with an isolated Python venv.
Implements local indexing + semantic search over OpenMC code/docs using sentence-transformers embeddings stored in LanceDB.
Adds developer documentation and agent guidance (Devguide + AGENTS.md + CLAUDE.md) describing the tools and workflow.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
docs/source/devguide/index.rst	Adds the new devguide page to the toctree.
docs/source/devguide/agentic-tools.rst	Documents MCP-based agent tools and the RAG semantic search.
CLAUDE.md	Claude Code-facing guidance for using the RAG tools (first-call behavior).
AGENTS.md	General agent/developer documentation for the new MCP tools.
.mcp.json	Registers the MCP server command/args at repo root.
.gitignore	Ignores generated/cache artifacts under `.claude/cache/`.
.claude/tools/start_server.sh	Bootstraps a venv and launches the MCP server.
.claude/tools/requirements.txt	Declares Python dependencies for the MCP server + RAG tooling.
.claude/tools/rag/openmc_search.py	Implements query-time semantic search and “related file” search.
.claude/tools/rag/indexer.py	Implements index build/rebuild pipeline over code/docs.
.claude/tools/rag/embeddings.py	Wraps sentence-transformers and configures HF/transformers env behavior.
.claude/tools/rag/chunker.py	Splits files into overlapping chunks with line number metadata.
.claude/tools/openmc_mcp_server.py	Exposes `openmc_rag_search` / `openmc_rag_rebuild` via MCP and manages session state.

You can also share your feedback on Copilot code review. Take the survey.

docs/source/devguide/agentic-tools.rst

.claude/tools/rag/indexer.py

.claude/tools/openmc_mcp_server.py

.claude/tools/rag/openmc_search.py

John Tramm and others added 30 commits March 4, 2026 22:51

edited claude.md for brevity

b268956

Remove hardcoded clangd-15 references from docs

8a2ed72

The LSP tool already searches for clangd dynamically (clangd, clangd-15, clangd-16, etc.). Docs should not pin a specific version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove unnecessary 'LSP tool is optional' note from skill

7b38b26

All three tools are independent; no need to single one out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Broaden LSP tool description to cover all its commands

8d825d6

It does definition, references, symbols, and related — not just "what files reference this symbol." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove dead code in indexer.py

2fbc0b3

The first record-building loop (using slow `chunk in code_chunks` identity comparison) was replaced by an index-based approach but never deleted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add index build time estimate to CLAUDE.md offer text

974f904

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add embedding model details and expand LSP acronym in skill docs

faf1e08

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Note that RAG embedding runs locally on CPU with no GPU or API key

d29a39e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

John Tramm and others added 13 commits March 9, 2026 17:44

added link

fe6f550

removing rename

efcd09e

ran auto pep 8

c0c66ea

fixed splitting issue

539dfd0

removed LSP tool for now

6de0440

added hint about needing longer queries with the RAG

89f4b52

tweaks to claude.md, dialing back RAG usage

a1c0fab

Adding some env checks to MCP server launcher

5b1e1af

cleanup of python comments/docstrings etc

0aa99ea

adding more specific info regarding how the model is downloaded and f…

feccb79

…rom where/how.

cleanup of embedding file

dec12c3

explaining some jargon

6a44a25

moving almost everything out of CLAUDE.md into AGENTS.md

e71bbf6

jtramm requested review from Copilot and paulromano March 10, 2026 13:38

jtramm added the Agentic Tools label Mar 10, 2026

Copilot started reviewing on behalf of jtramm March 10, 2026 13:39 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

claude response to copilot comments

277e19d

jtramm requested a review from Copilot March 10, 2026 14:12

Copilot started reviewing on behalf of jtramm March 10, 2026 14:13 View session

This comment was marked as duplicate.

Sign in to view

claude response to copilot review

cfa89a7

jtramm requested a review from Copilot March 10, 2026 14:32

This comment was marked as duplicate.

Sign in to view

Copilot started reviewing on behalf of jtramm March 10, 2026 14:56 View session

claude response to copilot review

77e4036

jtramm requested a review from Copilot March 10, 2026 15:05

Copilot started reviewing on behalf of jtramm March 10, 2026 15:06 View session

This comment was marked as duplicate.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG search tool for agents#3861

RAG search tool for agents#3861
jtramm wants to merge 68 commits intoopenmc-dev:developfrom
jtramm:claude_rag_no_lsp

jtramm commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jtramm commented Mar 10, 2026

Description

Motivation & Impact

Implementation

Limitations and Future Work

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants