Skip to content

Windows: tree-sitter definitions pass fails on ALL files when repo_path contains non-ASCII (CJK / Cyrillic / Arabic / accented Latin / etc.) characters #636

Description

@maxniu1

Bug Description

On Windows, when the repo_path passed to index_repository contains Chinese (CJK) characters in directory or file names, the definitions pass (tree-sitter AST parsing) fails on every filedefs=0, calls=0, imports=0, errors=N for all files. The resulting knowledge graph contains only File/Folder/Project nodes with no Function, Class, Method, or Variable nodes, making the index effectively useless for code intelligence.

This is more severe than issue #571 (project name truncation for CJK paths) — that issue is cosmetic, while this one prevents all code-level analysis entirely.

Steps to Reproduce

  1. On Windows, create or index a project whose path contains Chinese characters, e.g.:
    F:/项目资产-ProjectAssets/智能体/智能体原型/
    
  2. Run:
    codebase-memory-mcp cli index_repository '{"repo_path":"F:/项目资产-ProjectAssets/智能体/智能体原型"}'
    
  3. Observe the pipeline log — the definitions pass shows:
    pass=definitions defs=0 calls=0 imports=0 errors=41
    
  4. get_architecture returns only File/Folder nodes, zero Function/Class/Method nodes

Expected Behavior

Files at paths with Chinese characters should be parsed by tree-sitter identically to ASCII-only paths. A 41-file TypeScript/JS/CSS project should produce, e.g., defs=200+ with errors=0.

Actual Behavior (Verbose Pipeline Log)

pass=discover files=41
pass=structure nodes=60 edges=58           ← File/Folder tree works fine
pass=definitions defs=0 calls=0 imports=0 errors=41  ← ALL files fail
pass=calls total=0 resolved=0 unresolved=0 errors=41
pass=semantic inherits=0 decorates=0 implements=0 errors=41

Root Cause Verified

Identical file content, copied to an ASCII-only path, indexes correctly:

Test Path defs calls errors
ASCII path C:/tmp/core-test.js 30 103 0
Chinese path F:/...智能体.../core.js 0 0 41

The 417-line JS file is byte-identical. The only variable is whether the parent path contains CJK characters. Languages tested: JavaScript, TypeScript (.ts/.tsx), Python, CSS — all fail identically when the path has CJK characters.

Environment

  • OS: Windows 10 (64-bit)
  • codebase-memory-mcp version: v0.8.1
  • Filesystem: NTFS (UTF-8)
  • Shell tested: git-bash (MSYS2) — same result via direct Python subprocess call

Suspected Cause

Likely a Windows-specific encoding issue. The structure pass (directory enumeration via FindFirstFileW/FindNextFileW) works because it natively uses wide-char UTF-16 paths. But the definitions pass probably reads file paths as plain char* and passes them to tree-sitter via fopen or similar, which on Windows uses the ANSI codepage (e.g., GBK on zh-CN systems) rather than UTF-8. This causes tree-sitter to receive garbled paths or fail to open file contents for AST parsing.

On Linux/macOS, char* paths in UTF-8 work natively because the filesystem encoding is UTF-8 — this explains why similar issues (#571) were reported but the definitions-pass failure was not caught.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingparsing/qualityGraph extraction bugs, false positives, missing edgespriority/highNeeds near-term maintainer attention; high-impact bug, regression, safety issue, or release blocker.windowsWindows-specific issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions