Fix indexing crashes on deep or non-ASCII sources#1
Conversation
Avoid UTF-8 slicing panics when truncating doc comments, and switch tree-sitter node traversal from recursive to iterative DFS to prevent stack overflows on deep syntax trees seen in large codebases such as the Linux kernel. Add truncate_doc regression tests for non-ASCII input.
etr
left a comment
There was a problem hiding this comment.
Groundwork PR Review
| Agent | Score | Verdict | Critical | Major | Minor |
|---|---|---|---|---|---|
| Code Quality | 88 | approve | 0 | 0 | 4 |
| Test Quality | 76 | request-changes | 0 | 2 | 2 |
| Security | 88 | approve | 0 | 0 | 2 |
| Performance | 84 | approve | 0 | 0 | 3 |
| Code Simplifier | 82 | approve | 0 | 0 | 3 |
| Architecture | 97 | approve | 0 | 0 | 2 |
| Housekeeper | -- | error | -- | -- | -- |
| Design | -- | skipped | -- | -- | -- |
General Findings
[major] missing-test (from: test-quality)
No test covers a doc string composed entirely of multi-byte characters (e.g. CJK or emoji) that exceeds MAX_DOC_COMMENT_LEN with no ASCII space. The NBSP test exercises only the "no space found" fallback with 199 ASCII chars before the NBSP -- it does not exercise truncation of a purely non-ASCII string. A string like '\u{4e2d}'.repeat(250) would take the hard-cut branch, and nothing verifies it produces a valid, non-panicking result.
Recommendation: Add a test such as:
let s = "\u{4e2d}".repeat(MAX_DOC_COMMENT_LEN + 10);
let out = truncate_doc(&s);
assert!(out.ends_with("..."));
assert_eq!(out.chars().count(), MAX_DOC_COMMENT_LEN + 3);[major] missing-test (from: test-quality)
No test exercises the word-boundary truncation path (rfind(' ') succeeds) when the input contains multi-byte characters. The existing truncate_doc_long_truncates_at_last_space test uses only ASCII "word " repetition. If a multi-byte character appears before the last space in the 200-char prefix, the rfind and slice must still be safe.
Recommendation: Add a test mixing ASCII and multi-byte content before the word boundary, e.g. a string starting with '\u{00e9}'.repeat(50) followed by " rest " repeated to exceed the limit, then assert the output ends with "..." and does not include a truncated multi-byte sequence.
[minor] insecure-design (from: security)
find_identifier_in_declarator (line 1213) remains recursive and walks all named children as a fallback. The four primary walk functions were converted to iterative DFS by this PR, but this helper was missed. Declarator nesting is typically shallow (2-5 levels), so the risk is low but non-zero for adversarial C/C++ source.
Recommendation: Consider converting to iterative in a follow-up, consistent with the pattern applied here.
Reviewed by 6 agents | 8 findings (5 inline, 3 general)
| if is_container(kind, lang) { | ||
| for i in (0..node.child_count()).rev() { | ||
| if let Some(child) = node.child(i as u32) { | ||
| stack.push((child, Some(new_scope.clone()))); |
There was a problem hiding this comment.
[minor] memory-allocation (from: performance, code-simplifier)
walk_node clones the scope String for every child of a container node. A container with N direct children allocates N String copies of the symbol name. Previously the scope was an &str reference with zero allocation cost.
Recommendation: Consider wrapping scope in Rc<str> or Arc<str> so all children of a container share one heap allocation via a reference-count increment instead of N independent copies.
| let mut chars = s.chars(); | ||
| let prefix: String = chars.by_ref().take(max).collect(); | ||
| if chars.next().is_none() { | ||
| return s.to_string(); |
There was a problem hiding this comment.
[minor] code-elegance (from: code-quality, performance)
prefix is already allocated on line 516 containing the first max chars. On this short-path branch, prefix contains exactly what the caller needs, but it is discarded and s.to_string() allocates a second String.
Recommendation: Replace return s.to_string(); with return prefix; to reuse the existing allocation.
| let truncated = &s[..MAX_DOC_COMMENT_LEN]; | ||
| if let Some(pos) = truncated.rfind(' ') { | ||
| format!("{}...", &s[..pos]) | ||
| if let Some(pos) = prefix.rfind(' ') { |
There was a problem hiding this comment.
[minor] unicode-handling (from: security)
rfind(' ') matches only ASCII space (U+0020). Non-ASCII Unicode whitespace characters (e.g. U+2009 THIN SPACE, U+00A0 NO-BREAK SPACE) won't match, falling through to the hard-truncation branch.
Recommendation: Use prefix.rfind(char::is_whitespace) to handle the full Unicode whitespace category, consistent with the char-based approach used elsewhere in this function.
| @@ -1796,34 +1796,39 @@ pub fn extract_references(tree: &Tree, source: &str, file: &str, lang: Lang) -> | |||
| } | |||
|
|
|||
| /// Recursively walk a node tree collecting references. | |||
There was a problem hiding this comment.
[minor] documentation (from: code-quality, architecture)
Doc comment says "Recursively walk a node tree" but the implementation now uses iterative DFS with an explicit stack.
Recommendation: Update to "Walk a node tree with an explicit stack (iterative DFS)" to match the implementation.
| assert!(s.len() > MAX_DOC_COMMENT_LEN); | ||
| let out = truncate_doc(&s); | ||
| assert!(out.ends_with("...")); | ||
| assert!(out.chars().count() <= MAX_DOC_COMMENT_LEN + 4); |
There was a problem hiding this comment.
[minor] test-precision (from: code-quality, test-quality, architecture)
The assertion uses MAX_DOC_COMMENT_LEN + 4 but the maximum possible output is MAX_DOC_COMMENT_LEN + 3 (up to max prefix chars + 3 chars for "..."). The slack of +1 weakens the assertion's ability to catch an off-by-one regression.
Recommendation: Tighten to <= MAX_DOC_COMMENT_LEN + 3.
Avoid UTF-8 slicing panics when truncating doc comments, and switch
tree-sitter node traversal from recursive to iterative DFS to prevent
stack overflows on deep syntax trees seen in large codebases such as
the Linux kernel.
Add truncate_doc regression tests for non-ASCII input.