Skip to content

Fix C++ parser: pointer returns, field parsing, destructor macros, and namespace disambiguation#8

Open
Codeturion wants to merge 6 commits intomasterfrom
dev/cpp
Open

Fix C++ parser: pointer returns, field parsing, destructor macros, and namespace disambiguation#8
Codeturion wants to merge 6 commits intomasterfrom
dev/cpp

Conversation

@Codeturion
Copy link
Owner

@Codeturion Codeturion commented Mar 2, 2026

Summary

  • Fix _FUNC_RE to handle Type *name( pointer/reference return style used by Godot, bullet3, and OpenCV — this was the biggest gap, recovering +11,500 records across codebases
  • Fix _FIELD_RE to parse Type*Name; fields (no space before name)
  • Fix _DTOR_RE to allow export macros before ~ (e.g., IMGUI_API ~ImDrawList())
  • Fix trailing qualifier check so const no longer matches constexpr or swallows next declarations starting with const Type&
  • Add [[nodiscard]] attribute support, export macro suffixes (CV_EXPORTS_W), MACRO(class) inheritance, operator++/--, constructor qualifiers (= default/= delete), const overload FQN disambiguation
  • Reject copyright/license block comments from leaking into doc summaries
  • Add namespace-aware get_class with ns::ClassName syntax and auto-disambiguation
  • Expand FTS5 special character escaping (commas, brackets, etc.)

Record gains

Codebase Before After Gain
bullet3 10,770 12,831 +2,061
godot 130,910 140,577 +9,667
opencv 38,756 43,023 +4,267
imgui 4,780 4,840 +60
json 1,045 1,087 +42
Total 186,261 202,358 +16,097

Test plan

  • 15/15 _FUNC_RE regex unit tests (all pointer/ref styles)
  • 9/9 _FIELD_RE regex unit tests
  • 6/6 _DTOR_RE regex unit tests
  • 19/19 _TRAILING_CONST_RE unit tests
  • Verified against real source files: Godot Node::get_child, SceneTree::get_root, bullet3 btRigidBody::getInvInertiaTensorWorld, ImGui ImVec2(float,float) constructor, ImDrawList destructor, ImGuiIO.Fonts field
  • Full parse of all 5 codebases with no errors
  • Namespace disambiguation tested with synthetic and real data

Regex-based parser for .h/.hpp/.hxx/.h++ files that captures public API:
classes, structs, unions, methods, constructors, destructors, operator
overloads, free functions, fields, enums, typedefs, and using aliases.

Key features:
- Access specifier state machine (public/protected/private sections)
- Doxygen doc comment extraction (@brief, @param, @return)
- Template declaration handling
- Export macro stripping (SFML_API, etc.)
- Namespace and class nesting with brace-depth tracking
- FQN uses :: separator (C++ convention)

Also fixes db.py to support :: in namespace splitting and FTS escaping.
- Fix enum forward declarations (e.g. `enum Foo;`) incorrectly setting
  pending_enum, which consumed subsequent lines containing `{`
  (root cause of ImVec2 constructor capture failure)
- Cancel pending_enum if next line isn't a standalone `{` brace
- Fix inline method body content polluting operator signatures
  by truncating at body `{` before searching for trailing qualifiers
- Add _find_body_brace helper for paren-aware `{` detection
- Remove all debug prints

Tested on SFML (4684 records), fmt (1729), imgui (4766),
bullet3 (10890), opencv (34435).
- Skip lines starting with ':' (initializer list) and ',' with '('
  (continuation) to prevent false method captures
- Skip function matches where ')' is followed by '=' assignment
  (macro-declared variables like B3_ATTRIBUTE_ALIGNED16)
- Preserves = 0, = default, = delete correctly

Verified: 0 false positives across 183,379 records from 9 codebases.
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust C++ header parser designed to automatically extract public API definitions from C++ source files. This enhancement significantly expands the tool's language support, enabling it to process complex C++ constructs, Doxygen comments, and various C++-specific syntax elements. The accompanying database and search logic updates ensure seamless integration and discoverability of the newly parsed C++ API information.

Highlights

  • New C++ Header Parser: A new regex-based C++ header parser has been added, capable of extracting public API declarations from .h, .hpp, .hxx, and .h++ files. It tracks brace-depth scope, namespace nesting, and class/struct/union hierarchy with access specifier state.
  • Comprehensive API Capture: The parser captures a wide range of C++ public API elements including classes, structs, unions, methods, constructors, destructors, operator overloads, free functions, public fields, enums, typedefs, and using aliases.
  • Doxygen Integration: Doxygen documentation comments (/** */ blocks and /// lines) are now extracted, specifically capturing @brief, @param, and @return tags.
  • Advanced C++ Feature Handling: The parser correctly handles template declarations, export macros (e.g., SFML_API, CV_EXPORTS), and uses :: as the FQN (Fully Qualified Name) separator, aligning with C++ conventions.
  • Database and Search Enhancements: The db.py utility has been updated to support :: in namespace splitting and to correctly escape :: characters for Full-Text Search (FTS) queries, accommodating the new C++ FQN format.
Changelog
  • src/codesurface/db.py
    • Updated namespace splitting logic to correctly handle :: as a separator for C++ fully qualified names.
    • Modified the FTS escaping function to include : as a character to be replaced, improving search accuracy for C++ symbols.
  • src/codesurface/parsers/init.py
    • Added an import statement for the new CppParser.
    • Registered the CppParser under the 'cpp' identifier, making it available for use.
  • src/codesurface/parsers/cpp.py
    • Added a new C++ header parser module (cpp.py) that implements BaseParser.
    • Implemented comprehensive regex patterns and state-tracking logic to parse C++ header files and extract public API declarations.
    • Included functionality to skip specific build/vendor/test directories and generated files.
    • Developed Doxygen comment extraction for @brief, @param, and @return tags.
    • Incorporated logic to handle C++ namespaces, classes, structs, unions, enums, typedefs, using aliases, methods, constructors, destructors, operator overloads, and public fields.
    • Provided helper functions for brace/paren/angle counting, signature collection, and FQN construction.
Activity
  • The pull request author, Codeturion, has provided a detailed summary of the changes, including the parser's capabilities and extensive testing results across nine different codebases.
  • A comprehensive test plan was outlined and marked as completed, verifying various aspects of the parser's accuracy, such as constructor/destructor/operator parsing, access specifier tracking, enum value capture, handling of qualifiers, and Doxygen extraction.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive, regex-based C++ header parser, which is a significant and complex addition. The supporting changes in db.py to handle C++'s :: namespace separator are appropriate. My review of the new parser in src/codesurface/parsers/cpp.py identified a few issues, primarily related to state management when parsing multi-line constructs. These could lead to incorrect parsing results. I've provided detailed comments and suggestions to address these points. Overall, this is an impressive piece of work on a challenging parsing task.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class C++ header indexing support to CodeSurface by introducing a regex/state-machine parser for C++ headers and updating search/indexing to better handle ::-qualified names.

Changes:

  • Introduces a new CppParser that extracts public API declarations from common C++ header extensions.
  • Auto-registers the C++ parser in the parsers registry.
  • Updates DB search tokenization and FTS query escaping to better handle :: namespaces and : characters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
src/codesurface/parsers/cpp.py New C++ header parser that emits API records for types/members with :: FQNs and Doxygen extraction.
src/codesurface/parsers/__init__.py Registers CppParser under the cpp language key.
src/codesurface/db.py Adjusts namespace token splitting and FTS escaping to accommodate C++ :: and :.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ipping

- Fix multi-line signature handling: move _collect_signature before
  conditional blocks so end_i is always set, accumulate brace depth
  across consumed lines, advance i to end_i+1
- Add param types to method/free function FQN for overload disambiguation
- Fix @returns tag parsing to match longest tag first, preventing
  accidental stripping of 's' from descriptions like "success"
- Fix _look_back_for_doc to require /** marker, not scan to file top
- Fix _count_braces to skip /* */ inline comments
- Fix template brace accumulation to use += instead of recalculating
- Optimize parse_directory to walk tree once instead of per-extension
- Normalize _extract_param_types separator to comma without space
- Fix server.py get_class to split on both . and :: for C++ FQNs
- Remove unused parent_class variable
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Parser fixes (cpp.py):
- Fix _FUNC_RE to handle pointer/ref return types with Type *name( style
  (Godot, bullet3, OpenCV use this convention extensively)
- Add ALL_CAPS macro qualifiers to _FUNC_RE leading quals (_FORCE_INLINE_, etc.)
- Fix _FIELD_RE to handle Type*Name fields with no space before name
- Fix _DTOR_RE to allow export macros before ~ (IMGUI_API ~ImDrawList())
- Fix trailing qualifier check: const no longer matches constexpr/consteval,
  and const followed by a type name is not swallowed as a trailing qualifier
- Add [[nodiscard]] attribute support in _CLASS_RE and _FORWARD_DECL_RE
- Add export macro suffix support (CV_EXPORTS_W, etc.) across all regexes
- Add MACRO(class) with inheritance pattern (_BARE_NAME_INHERIT_RE)
- Add const overload FQN disambiguation (method const suffix)
- Add constructor trailing qualifiers (= default, = delete, noexcept)
- Add operator++/-- support in _OPERATOR_RE
- Strip inline comments from trailing qualifiers to prevent leak
- Reject copyright/license block comments as doc comments (max 40 lines)

DB fixes (db.py):
- Add namespace-aware get_class_members() with optional namespace filter
- Add get_class_namespaces() for discovering class name collisions
- Expand FTS5 special character escaping (commas, brackets, etc.)

Server fixes (server.py):
- Support namespace-qualified get_class queries (e.g., "cv::Mat")
- Auto-disambiguate when multiple namespaces share a class name
- Add _pick_primary_namespace() heuristic preferring non-thirdparty paths
- Show disambiguation note with alternative qualified names

Tested against 5 real codebases: bullet3, godot, opencv, imgui, nlohmann/json.
Total records recovered: +16,097 across all codebases.
@Codeturion Codeturion changed the title Add C++ header parser Fix C++ parser: pointer returns, field parsing, destructor macros, and namespace disambiguation Mar 3, 2026
@Codeturion Codeturion requested a review from Copilot March 3, 2026 17:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +390 to +399
# If no namespace filter was given, check for ambiguity
if not ns_filter:
namespaces = db.get_class_namespaces(_conn, short_name)
if len(namespaces) > 1:
# Pick the most likely namespace: prefer non-empty, non-thirdparty
# Show disambiguation notice
ns_filter = _pick_primary_namespace(namespaces, members)
if ns_filter is not None:
members = db.get_class_members(_conn, short_name, namespace=ns_filter)

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When multiple namespaces exist, get_class may auto-select a namespace by setting ns_filter, but no message is added to the output indicating an automatic disambiguation happened (the later "also found" note is only shown when ns_filter is falsy). Consider adding an explicit note like "Ambiguous class name; showing ::." when _pick_primary_namespace selects a namespace so users aren’t silently shown a potentially unintended class.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — when _pick_primary_namespace auto-selects, the "also found in other namespaces" note at line 419 is skipped because ns_filter is now set. The selected namespace is still visible in the output header, but an explicit disambiguation note would be clearer. Will add this.

Comment on lines +1327 to +1334
# @return or \return or @returns or \returns
if (line.startswith("@return") or line.startswith("\\return")):
# Match longest tag first to avoid stripping 's' from description
for tag in ("@returns", "\\returns", "@return", "\\return"):
if line.startswith(tag):
rest = line[len(tag):].strip()
break
returns = rest
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_doxygen_lines claims to support @returns/\\returns, but the conditional only checks @return/\\return. As a result, lines starting with @returns will be skipped and returns stays empty. Update the condition to include @returns and \\returns (or just match all four tags in the outer if).

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a false positive — "@returns".startswith("@return") is True in Python, so the outer if catches both @return and @returns. The inner for loop then matches the longest tag first (@returns before @return), so the full tag is stripped correctly. No change needed.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- db.py: Change `if namespace:` to `if namespace is not None:` so
  filtering by global namespace (empty string) works correctly
- cpp.py: Move leading *& from param name token back to type in
  _extract_param_types, so `int *p` produces type `int*` not `int`
@Codeturion Codeturion requested a review from Copilot March 3, 2026 19:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1327 to +1336
# @return or \return or @returns or \returns
if (line.startswith("@return") or line.startswith("\\return")):
# Match longest tag first to avoid stripping 's' from description
for tag in ("@returns", "\\returns", "@return", "\\return"):
if line.startswith(tag):
rest = line[len(tag):].strip()
break
returns = rest
i += 1
continue
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition only checks @return / \return, so lines that start with @returns / \returns will never enter this block (they’ll be skipped by the generic “starts with @” handler). Update the condition to also accept @returns and \returns.

Copilot uses AI. Check for mistakes.
Comment on lines +272 to 274
for ch in '."-*():,;{}[]!@#$%^&+|\\~`':
q = q.replace(ch, " ")
terms = [t for t in q.split() if t]
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing characters via a loop repeatedly scans the string (O(k·n) per query, where k is the number of escaped characters). Since this runs on every search, consider switching to a single-pass approach (e.g., str.translate with a translation table, or a precompiled regex) to reduce per-query overhead.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants