Fix C++ parser: pointer returns, field parsing, destructor macros, and namespace disambiguation#8
Fix C++ parser: pointer returns, field parsing, destructor macros, and namespace disambiguation#8Codeturion wants to merge 6 commits intomasterfrom
Conversation
Regex-based parser for .h/.hpp/.hxx/.h++ files that captures public API: classes, structs, unions, methods, constructors, destructors, operator overloads, free functions, fields, enums, typedefs, and using aliases. Key features: - Access specifier state machine (public/protected/private sections) - Doxygen doc comment extraction (@brief, @param, @return) - Template declaration handling - Export macro stripping (SFML_API, etc.) - Namespace and class nesting with brace-depth tracking - FQN uses :: separator (C++ convention) Also fixes db.py to support :: in namespace splitting and FTS escaping.
- Fix enum forward declarations (e.g. `enum Foo;`) incorrectly setting
pending_enum, which consumed subsequent lines containing `{`
(root cause of ImVec2 constructor capture failure)
- Cancel pending_enum if next line isn't a standalone `{` brace
- Fix inline method body content polluting operator signatures
by truncating at body `{` before searching for trailing qualifiers
- Add _find_body_brace helper for paren-aware `{` detection
- Remove all debug prints
Tested on SFML (4684 records), fmt (1729), imgui (4766),
bullet3 (10890), opencv (34435).
- Skip lines starting with ':' (initializer list) and ',' with '('
(continuation) to prevent false method captures
- Skip function matches where ')' is followed by '=' assignment
(macro-declared variables like B3_ATTRIBUTE_ALIGNED16)
- Preserves = 0, = default, = delete correctly
Verified: 0 false positives across 183,379 records from 9 codebases.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust C++ header parser designed to automatically extract public API definitions from C++ source files. This enhancement significantly expands the tool's language support, enabling it to process complex C++ constructs, Doxygen comments, and various C++-specific syntax elements. The accompanying database and search logic updates ensure seamless integration and discoverability of the newly parsed C++ API information. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive, regex-based C++ header parser, which is a significant and complex addition. The supporting changes in db.py to handle C++'s :: namespace separator are appropriate. My review of the new parser in src/codesurface/parsers/cpp.py identified a few issues, primarily related to state management when parsing multi-line constructs. These could lead to incorrect parsing results. I've provided detailed comments and suggestions to address these points. Overall, this is an impressive piece of work on a challenging parsing task.
There was a problem hiding this comment.
Pull request overview
Adds first-class C++ header indexing support to CodeSurface by introducing a regex/state-machine parser for C++ headers and updating search/indexing to better handle ::-qualified names.
Changes:
- Introduces a new
CppParserthat extracts public API declarations from common C++ header extensions. - Auto-registers the C++ parser in the parsers registry.
- Updates DB search tokenization and FTS query escaping to better handle
::namespaces and:characters.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/codesurface/parsers/cpp.py |
New C++ header parser that emits API records for types/members with :: FQNs and Doxygen extraction. |
src/codesurface/parsers/__init__.py |
Registers CppParser under the cpp language key. |
src/codesurface/db.py |
Adjusts namespace token splitting and FTS escaping to accommodate C++ :: and :. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ipping - Fix multi-line signature handling: move _collect_signature before conditional blocks so end_i is always set, accumulate brace depth across consumed lines, advance i to end_i+1 - Add param types to method/free function FQN for overload disambiguation - Fix @returns tag parsing to match longest tag first, preventing accidental stripping of 's' from descriptions like "success" - Fix _look_back_for_doc to require /** marker, not scan to file top - Fix _count_braces to skip /* */ inline comments - Fix template brace accumulation to use += instead of recalculating - Optimize parse_directory to walk tree once instead of per-extension - Normalize _extract_param_types separator to comma without space - Fix server.py get_class to split on both . and :: for C++ FQNs - Remove unused parent_class variable
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Parser fixes (cpp.py): - Fix _FUNC_RE to handle pointer/ref return types with Type *name( style (Godot, bullet3, OpenCV use this convention extensively) - Add ALL_CAPS macro qualifiers to _FUNC_RE leading quals (_FORCE_INLINE_, etc.) - Fix _FIELD_RE to handle Type*Name fields with no space before name - Fix _DTOR_RE to allow export macros before ~ (IMGUI_API ~ImDrawList()) - Fix trailing qualifier check: const no longer matches constexpr/consteval, and const followed by a type name is not swallowed as a trailing qualifier - Add [[nodiscard]] attribute support in _CLASS_RE and _FORWARD_DECL_RE - Add export macro suffix support (CV_EXPORTS_W, etc.) across all regexes - Add MACRO(class) with inheritance pattern (_BARE_NAME_INHERIT_RE) - Add const overload FQN disambiguation (method const suffix) - Add constructor trailing qualifiers (= default, = delete, noexcept) - Add operator++/-- support in _OPERATOR_RE - Strip inline comments from trailing qualifiers to prevent leak - Reject copyright/license block comments as doc comments (max 40 lines) DB fixes (db.py): - Add namespace-aware get_class_members() with optional namespace filter - Add get_class_namespaces() for discovering class name collisions - Expand FTS5 special character escaping (commas, brackets, etc.) Server fixes (server.py): - Support namespace-qualified get_class queries (e.g., "cv::Mat") - Auto-disambiguate when multiple namespaces share a class name - Add _pick_primary_namespace() heuristic preferring non-thirdparty paths - Show disambiguation note with alternative qualified names Tested against 5 real codebases: bullet3, godot, opencv, imgui, nlohmann/json. Total records recovered: +16,097 across all codebases.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # If no namespace filter was given, check for ambiguity | ||
| if not ns_filter: | ||
| namespaces = db.get_class_namespaces(_conn, short_name) | ||
| if len(namespaces) > 1: | ||
| # Pick the most likely namespace: prefer non-empty, non-thirdparty | ||
| # Show disambiguation notice | ||
| ns_filter = _pick_primary_namespace(namespaces, members) | ||
| if ns_filter is not None: | ||
| members = db.get_class_members(_conn, short_name, namespace=ns_filter) | ||
|
|
There was a problem hiding this comment.
When multiple namespaces exist, get_class may auto-select a namespace by setting ns_filter, but no message is added to the output indicating an automatic disambiguation happened (the later "also found" note is only shown when ns_filter is falsy). Consider adding an explicit note like "Ambiguous class name; showing ::." when _pick_primary_namespace selects a namespace so users aren’t silently shown a potentially unintended class.
There was a problem hiding this comment.
Good catch — when _pick_primary_namespace auto-selects, the "also found in other namespaces" note at line 419 is skipped because ns_filter is now set. The selected namespace is still visible in the output header, but an explicit disambiguation note would be clearer. Will add this.
| # @return or \return or @returns or \returns | ||
| if (line.startswith("@return") or line.startswith("\\return")): | ||
| # Match longest tag first to avoid stripping 's' from description | ||
| for tag in ("@returns", "\\returns", "@return", "\\return"): | ||
| if line.startswith(tag): | ||
| rest = line[len(tag):].strip() | ||
| break | ||
| returns = rest |
There was a problem hiding this comment.
_parse_doxygen_lines claims to support @returns/\\returns, but the conditional only checks @return/\\return. As a result, lines starting with @returns will be skipped and returns stays empty. Update the condition to include @returns and \\returns (or just match all four tags in the outer if).
There was a problem hiding this comment.
This is actually a false positive — "@returns".startswith("@return") is True in Python, so the outer if catches both @return and @returns. The inner for loop then matches the longest tag first (@returns before @return), so the full tag is stripped correctly. No change needed.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- db.py: Change `if namespace:` to `if namespace is not None:` so filtering by global namespace (empty string) works correctly - cpp.py: Move leading *& from param name token back to type in _extract_param_types, so `int *p` produces type `int*` not `int`
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # @return or \return or @returns or \returns | ||
| if (line.startswith("@return") or line.startswith("\\return")): | ||
| # Match longest tag first to avoid stripping 's' from description | ||
| for tag in ("@returns", "\\returns", "@return", "\\return"): | ||
| if line.startswith(tag): | ||
| rest = line[len(tag):].strip() | ||
| break | ||
| returns = rest | ||
| i += 1 | ||
| continue |
There was a problem hiding this comment.
The condition only checks @return / \return, so lines that start with @returns / \returns will never enter this block (they’ll be skipped by the generic “starts with @” handler). Update the condition to also accept @returns and \returns.
| for ch in '."-*():,;{}[]!@#$%^&+|\\~`': | ||
| q = q.replace(ch, " ") | ||
| terms = [t for t in q.split() if t] |
There was a problem hiding this comment.
Replacing characters via a loop repeatedly scans the string (O(k·n) per query, where k is the number of escaped characters). Since this runs on every search, consider switching to a single-pass approach (e.g., str.translate with a translation table, or a precompiled regex) to reduce per-query overhead.
Summary
_FUNC_REto handleType *name(pointer/reference return style used by Godot, bullet3, and OpenCV — this was the biggest gap, recovering +11,500 records across codebases_FIELD_REto parseType*Name;fields (no space before name)_DTOR_REto allow export macros before~(e.g.,IMGUI_API ~ImDrawList())constno longer matchesconstexpror swallows next declarations starting withconst Type&[[nodiscard]]attribute support, export macro suffixes (CV_EXPORTS_W),MACRO(class)inheritance,operator++/--, constructor qualifiers (= default/= delete), const overload FQN disambiguationget_classwithns::ClassNamesyntax and auto-disambiguationRecord gains
Test plan
_FUNC_REregex unit tests (all pointer/ref styles)_FIELD_REregex unit tests_DTOR_REregex unit tests_TRAILING_CONST_REunit testsNode::get_child,SceneTree::get_root, bullet3btRigidBody::getInvInertiaTensorWorld, ImGuiImVec2(float,float)constructor,ImDrawListdestructor,ImGuiIO.Fontsfield