Skip to content

Fix indexing crashes on deep or non-ASCII sources#1

Merged
etr merged 1 commit intoetr:mainfrom
deepskyblue86:main
Apr 4, 2026
Merged

Fix indexing crashes on deep or non-ASCII sources#1
etr merged 1 commit intoetr:mainfrom
deepskyblue86:main

Conversation

@deepskyblue86
Copy link
Copy Markdown
Contributor

Avoid UTF-8 slicing panics when truncating doc comments, and switch
tree-sitter node traversal from recursive to iterative DFS to prevent
stack overflows on deep syntax trees seen in large codebases such as
the Linux kernel.

Add truncate_doc regression tests for non-ASCII input.

Avoid UTF-8 slicing panics when truncating doc comments, and switch
tree-sitter node traversal from recursive to iterative DFS to prevent
stack overflows on deep syntax trees seen in large codebases such as
the Linux kernel.

Add truncate_doc regression tests for non-ASCII input.
Copy link
Copy Markdown
Owner

@etr etr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Groundwork PR Review

Agent Score Verdict Critical Major Minor
Code Quality 88 approve 0 0 4
Test Quality 76 request-changes 0 2 2
Security 88 approve 0 0 2
Performance 84 approve 0 0 3
Code Simplifier 82 approve 0 0 3
Architecture 97 approve 0 0 2
Housekeeper -- error -- -- --
Design -- skipped -- -- --

General Findings

[major] missing-test (from: test-quality)

No test covers a doc string composed entirely of multi-byte characters (e.g. CJK or emoji) that exceeds MAX_DOC_COMMENT_LEN with no ASCII space. The NBSP test exercises only the "no space found" fallback with 199 ASCII chars before the NBSP -- it does not exercise truncation of a purely non-ASCII string. A string like '\u{4e2d}'.repeat(250) would take the hard-cut branch, and nothing verifies it produces a valid, non-panicking result.

Recommendation: Add a test such as:

let s = "\u{4e2d}".repeat(MAX_DOC_COMMENT_LEN + 10);
let out = truncate_doc(&s);
assert!(out.ends_with("..."));
assert_eq!(out.chars().count(), MAX_DOC_COMMENT_LEN + 3);

[major] missing-test (from: test-quality)

No test exercises the word-boundary truncation path (rfind(' ') succeeds) when the input contains multi-byte characters. The existing truncate_doc_long_truncates_at_last_space test uses only ASCII "word " repetition. If a multi-byte character appears before the last space in the 200-char prefix, the rfind and slice must still be safe.

Recommendation: Add a test mixing ASCII and multi-byte content before the word boundary, e.g. a string starting with '\u{00e9}'.repeat(50) followed by " rest " repeated to exceed the limit, then assert the output ends with "..." and does not include a truncated multi-byte sequence.


[minor] insecure-design (from: security)

find_identifier_in_declarator (line 1213) remains recursive and walks all named children as a fallback. The four primary walk functions were converted to iterative DFS by this PR, but this helper was missed. Declarator nesting is typically shallow (2-5 levels), so the risk is low but non-zero for adversarial C/C++ source.

Recommendation: Consider converting to iterative in a follow-up, consistent with the pattern applied here.


Reviewed by 6 agents | 8 findings (5 inline, 3 general)

if is_container(kind, lang) {
for i in (0..node.child_count()).rev() {
if let Some(child) = node.child(i as u32) {
stack.push((child, Some(new_scope.clone())));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] memory-allocation (from: performance, code-simplifier)

walk_node clones the scope String for every child of a container node. A container with N direct children allocates N String copies of the symbol name. Previously the scope was an &str reference with zero allocation cost.

Recommendation: Consider wrapping scope in Rc<str> or Arc<str> so all children of a container share one heap allocation via a reference-count increment instead of N independent copies.

let mut chars = s.chars();
let prefix: String = chars.by_ref().take(max).collect();
if chars.next().is_none() {
return s.to_string();
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] code-elegance (from: code-quality, performance)

prefix is already allocated on line 516 containing the first max chars. On this short-path branch, prefix contains exactly what the caller needs, but it is discarded and s.to_string() allocates a second String.

Recommendation: Replace return s.to_string(); with return prefix; to reuse the existing allocation.

let truncated = &s[..MAX_DOC_COMMENT_LEN];
if let Some(pos) = truncated.rfind(' ') {
format!("{}...", &s[..pos])
if let Some(pos) = prefix.rfind(' ') {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] unicode-handling (from: security)

rfind(' ') matches only ASCII space (U+0020). Non-ASCII Unicode whitespace characters (e.g. U+2009 THIN SPACE, U+00A0 NO-BREAK SPACE) won't match, falling through to the hard-truncation branch.

Recommendation: Use prefix.rfind(char::is_whitespace) to handle the full Unicode whitespace category, consistent with the char-based approach used elsewhere in this function.

@@ -1796,34 +1796,39 @@ pub fn extract_references(tree: &Tree, source: &str, file: &str, lang: Lang) ->
}

/// Recursively walk a node tree collecting references.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] documentation (from: code-quality, architecture)

Doc comment says "Recursively walk a node tree" but the implementation now uses iterative DFS with an explicit stack.

Recommendation: Update to "Walk a node tree with an explicit stack (iterative DFS)" to match the implementation.

assert!(s.len() > MAX_DOC_COMMENT_LEN);
let out = truncate_doc(&s);
assert!(out.ends_with("..."));
assert!(out.chars().count() <= MAX_DOC_COMMENT_LEN + 4);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] test-precision (from: code-quality, test-quality, architecture)

The assertion uses MAX_DOC_COMMENT_LEN + 4 but the maximum possible output is MAX_DOC_COMMENT_LEN + 3 (up to max prefix chars + 3 chars for "..."). The slack of +1 weakens the assertion's ability to catch an off-by-one regression.

Recommendation: Tighten to <= MAX_DOC_COMMENT_LEN + 3.

@etr etr merged commit 1092677 into etr:main Apr 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants