Fix indexing crashes on deep or non-ASCII sources by deepskyblue86 · Pull Request #1 · etr/wonk

deepskyblue86 · 2026-04-03T09:01:07Z

Avoid UTF-8 slicing panics when truncating doc comments, and switch
tree-sitter node traversal from recursive to iterative DFS to prevent
stack overflows on deep syntax trees seen in large codebases such as
the Linux kernel.

Add truncate_doc regression tests for non-ASCII input.

Avoid UTF-8 slicing panics when truncating doc comments, and switch tree-sitter node traversal from recursive to iterative DFS to prevent stack overflows on deep syntax trees seen in large codebases such as the Linux kernel. Add truncate_doc regression tests for non-ASCII input.

etr

Groundwork PR Review

Agent	Score	Verdict	Critical	Major	Minor
Code Quality	88	approve	0	0	4
Test Quality	76	request-changes	0	2	2
Security	88	approve	0	0	2
Performance	84	approve	0	0	3
Code Simplifier	82	approve	0	0	3
Architecture	97	approve	0	0	2
Housekeeper	--	error	--	--	--
Design	--	skipped	--	--	--

General Findings

[major] missing-test (from: test-quality)

No test covers a doc string composed entirely of multi-byte characters (e.g. CJK or emoji) that exceeds MAX_DOC_COMMENT_LEN with no ASCII space. The NBSP test exercises only the "no space found" fallback with 199 ASCII chars before the NBSP -- it does not exercise truncation of a purely non-ASCII string. A string like '\u{4e2d}'.repeat(250) would take the hard-cut branch, and nothing verifies it produces a valid, non-panicking result.

Recommendation: Add a test such as:

let s = "\u{4e2d}".repeat(MAX_DOC_COMMENT_LEN + 10);
let out = truncate_doc(&s);
assert!(out.ends_with("..."));
assert_eq!(out.chars().count(), MAX_DOC_COMMENT_LEN + 3);

[major] missing-test (from: test-quality)

No test exercises the word-boundary truncation path (rfind(' ') succeeds) when the input contains multi-byte characters. The existing truncate_doc_long_truncates_at_last_space test uses only ASCII "word " repetition. If a multi-byte character appears before the last space in the 200-char prefix, the rfind and slice must still be safe.

Recommendation: Add a test mixing ASCII and multi-byte content before the word boundary, e.g. a string starting with '\u{00e9}'.repeat(50) followed by " rest " repeated to exceed the limit, then assert the output ends with "..." and does not include a truncated multi-byte sequence.

[minor] insecure-design (from: security)

find_identifier_in_declarator (line 1213) remains recursive and walks all named children as a fallback. The four primary walk functions were converted to iterative DFS by this PR, but this helper was missed. Declarator nesting is typically shallow (2-5 levels), so the risk is low but non-zero for adversarial C/C++ source.

Recommendation: Consider converting to iterative in a follow-up, consistent with the pattern applied here.

Reviewed by 6 agents | 8 findings (5 inline, 3 general)

etr · 2026-04-03T22:08:07Z

src/indexer.rs

+            if is_container(kind, lang) {
+                for i in (0..node.child_count()).rev() {
+                    if let Some(child) = node.child(i as u32) {
+                        stack.push((child, Some(new_scope.clone())));


[minor] memory-allocation (from: performance, code-simplifier)

walk_node clones the scope String for every child of a container node. A container with N direct children allocates N String copies of the symbol name. Previously the scope was an &str reference with zero allocation cost.

Recommendation: Consider wrapping scope in Rc<str> or Arc<str> so all children of a container share one heap allocation via a reference-count increment instead of N independent copies.

etr · 2026-04-03T22:08:07Z

src/indexer.rs

+    let mut chars = s.chars();
+    let prefix: String = chars.by_ref().take(max).collect();
+    if chars.next().is_none() {
        return s.to_string();


[minor] code-elegance (from: code-quality, performance)

prefix is already allocated on line 516 containing the first max chars. On this short-path branch, prefix contains exactly what the caller needs, but it is discarded and s.to_string() allocates a second String.

Recommendation: Replace return s.to_string(); with return prefix; to reuse the existing allocation.

etr · 2026-04-03T22:08:07Z

src/indexer.rs

-    let truncated = &s[..MAX_DOC_COMMENT_LEN];
-    if let Some(pos) = truncated.rfind(' ') {
-        format!("{}...", &s[..pos])
+    if let Some(pos) = prefix.rfind(' ') {


[minor] unicode-handling (from: security)

rfind(' ') matches only ASCII space (U+0020). Non-ASCII Unicode whitespace characters (e.g. U+2009 THIN SPACE, U+00A0 NO-BREAK SPACE) won't match, falling through to the hard-truncation branch.

Recommendation: Use prefix.rfind(char::is_whitespace) to handle the full Unicode whitespace category, consistent with the char-based approach used elsewhere in this function.

etr · 2026-04-03T22:08:07Z

src/indexer.rs

@@ -1796,34 +1796,39 @@ pub fn extract_references(tree: &Tree, source: &str, file: &str, lang: Lang) ->
 }

 /// Recursively walk a node tree collecting references.


[minor] documentation (from: code-quality, architecture)

Doc comment says "Recursively walk a node tree" but the implementation now uses iterative DFS with an explicit stack.

Recommendation: Update to "Walk a node tree with an explicit stack (iterative DFS)" to match the implementation.

etr · 2026-04-03T22:08:07Z

src/indexer.rs

+        assert!(s.len() > MAX_DOC_COMMENT_LEN);
+        let out = truncate_doc(&s);
+        assert!(out.ends_with("..."));
+        assert!(out.chars().count() <= MAX_DOC_COMMENT_LEN + 4);


[minor] test-precision (from: code-quality, test-quality, architecture)

The assertion uses MAX_DOC_COMMENT_LEN + 4 but the maximum possible output is MAX_DOC_COMMENT_LEN + 3 (up to max prefix chars + 3 chars for "..."). The slack of +1 weakens the assertion's ability to catch an off-by-one regression.

Recommendation: Tighten to <= MAX_DOC_COMMENT_LEN + 3.

etr requested changes Apr 3, 2026

View reviewed changes

etr approved these changes Apr 4, 2026

View reviewed changes

etr merged commit 1092677 into etr:main Apr 4, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix indexing crashes on deep or non-ASCII sources#1

Fix indexing crashes on deep or non-ASCII sources#1
etr merged 1 commit intoetr:mainfrom
deepskyblue86:main

deepskyblue86 commented Apr 3, 2026

Uh oh!

etr left a comment

Uh oh!

etr Apr 3, 2026

Uh oh!

etr Apr 3, 2026

Uh oh!

etr Apr 3, 2026

Uh oh!

etr Apr 3, 2026

Uh oh!

etr Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1796,34 +1796,39 @@ pub fn extract_references(tree: &Tree, source: &str, file: &str, lang: Lang) ->
		}

		/// Recursively walk a node tree collecting references.

Conversation

deepskyblue86 commented Apr 3, 2026

Uh oh!

etr left a comment

Choose a reason for hiding this comment

Groundwork PR Review

General Findings

Uh oh!

etr Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

etr Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

etr Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

etr Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

etr Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants