Conversation
Migrated from FalkorDB/code-graph-backend PR #59. Original issue: FalkorDB/code-graph-backend#51 Resolves #540 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| path = path.resolve() | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 1 day ago
In general, to fix this you should ensure that any filesystem path derived from user input is constrained to a safe root directory and normalized before use. That means: (1) define a base directory under which all analyses must occur; (2) resolve the user-supplied path relative to that base; (3) reject anything whose normalized/real path escapes that base (e.g., via .. or absolute paths); and (4) only then pass the safe resolved path into the rest of the code. This prevents a client from causing the server to traverse or operate on arbitrary parts of the filesystem.
For this codebase, the best minimally invasive fix is to introduce such validation in SourceAnalyzer.analyze_local_folder, since that is the point where the untrusted path: str first enters the analyzer layer and gets converted to a Path. We can:
- Decide on a safe root for analysis, e.g. an environment-variable-controlled base like
CODE_GRAPH_PROJECTS_ROOT, defaulting to the current working directory if unset. This keeps behavior similar while allowing operators to constrain where analyses can occur. - In
analyze_local_folder, convert both the configured base path and the userpathinto absolute, resolvedPathobjects using.resolve(). - If
pathis not under the base (check viarelative_toor a simple prefix check), log an error and raise an exception instead of proceeding. - Pass the resolved, safe
Pathtoanalyze_sourcesso that downstream calls (path.rglob(...)) operate only within the validated directory tree.
This change only touches api/analyzers/source_analyzer.py, keeps the external API of analyze_local_folder unchanged, and preserves existing functionality for valid paths that lie under the configured base directory. We will add a small helper method inside SourceAnalyzer to encapsulate the “ensure path under base” logic and call it from analyze_local_folder. We do not need new imports beyond what already exists.
| @@ -1,6 +1,7 @@ | ||
| from contextlib import nullcontext | ||
| from pathlib import Path | ||
| from typing import Optional | ||
| import os | ||
|
|
||
| from api.entities.entity import Entity | ||
| from api.entities.file import File | ||
| @@ -184,6 +185,34 @@ | ||
| # Second pass analysis of the source code | ||
| self.second_pass(graph, files, path) | ||
|
|
||
| def _resolve_and_validate_path(self, user_path: str) -> Path: | ||
| """ | ||
| Resolve a user-supplied path against a safe root and ensure it does not escape. | ||
|
|
||
| The safe root can be configured via the CODE_GRAPH_PROJECTS_ROOT environment | ||
| variable; if unset, the current working directory is used. | ||
| """ | ||
| base_dir_env = os.environ.get("CODE_GRAPH_PROJECTS_ROOT") | ||
| if base_dir_env: | ||
| base_dir = Path(base_dir_env) | ||
| else: | ||
| base_dir = Path.cwd() | ||
|
|
||
| base_dir = base_dir.resolve() | ||
| candidate = Path(user_path) | ||
| if not candidate.is_absolute(): | ||
| candidate = base_dir / candidate | ||
| candidate = candidate.resolve() | ||
|
|
||
| try: | ||
| # Ensure candidate is within base_dir | ||
| candidate.relative_to(base_dir) | ||
| except ValueError: | ||
| logging.error(f"Requested path '{candidate}' is outside of allowed base directory '{base_dir}'") | ||
| raise ValueError("Requested path is not allowed") | ||
|
|
||
| return candidate | ||
|
|
||
| def analyze_local_folder(self, path: str, g: Graph, ignore: Optional[list[str]] = []) -> None: | ||
| """ | ||
| Analyze path. | ||
| @@ -195,8 +224,11 @@ | ||
|
|
||
| logging.info(f"Analyzing local folder {path}") | ||
|
|
||
| # Resolve and validate the provided path against a safe root | ||
| safe_path = self._resolve_and_validate_path(path) | ||
|
|
||
| # Analyze source files | ||
| self.analyze_sources(Path(path), ignore, g) | ||
| self.analyze_sources(safe_path, ignore, g) | ||
|
|
||
| logging.info("Done analyzing path") | ||
|
|
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| path = path.resolve() | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 1 day ago
General approach: treat user-provided paths as relative or confined within a known safe root directory on the server, and validate/normalize them before using them as a root for filesystem traversal. Use Path.resolve() and then verify that the resolved path is within a configured base directory (for example, the directory where repositories are stored). Reject or error out if the requested path escapes this root. This prevents a client from causing analysis of arbitrary directories like /etc or /.
Best concrete fix here: centralize the trust boundary in SourceAnalyzer.analyze_local_folder by:
- Introducing a function that returns the safe root directory (e.g., from an environment variable or a default like
REPOS_ROOTunder the project). Since we must not assume wider project structure, we’ll use an environment variableCODEGRAPH_ROOTwith a reasonable default (current working directory). - In
analyze_local_folder, convert the incomingpathstring into aPath, resolve it, and then check that it is contained within the safe root by comparingPath.is_relative_to(Python 3.9+) or an equivalent prefix check. - If the check fails, log an error and raise an exception; if it passes, proceed to call
analyze_sourceswith the resolved safePath. - Optionally, apply similar confinement in
analyze_local_repository, which also takes a user-controlledpath, before interacting withpygit2.Repository.
This confines all downstream uses, including the rglob on line 180, without changing the external API signatures or the functional behavior for legitimate, in-root paths. The only edits needed are within api/analyzers/source_analyzer.py: adding imports for os (for environment variable) if needed, a helper to get/validate the root, and modifications to analyze_local_folder (and analyze_local_repository) to perform normalization and the “within root” check.
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| path = path.resolve() | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 1 day ago
General approach: normalize and validate the user-supplied path before using it for filesystem traversal. Optionally (and recommended), restrict analysis to live under a configured safe root directory. Even if we cannot see configuration here, we can at least normalize and refuse paths that do not resolve to directories or that attempt to climb above an optional root.
Best concrete fix within the shown code:
- In
SourceAnalyzer.analyze_local_folder, convert the incomingpath: strto a normalized, absolutePathusingPath(path).resolve(strict=True)inside atryblock. - Optionally support an environment variable
CODEGRAPH_REPOS_ROOT(a safe base directory). If set, ensure the requested path is inside that root. This mirrors the “safe-root” pattern in the background section and doesn’t change existing behavior when the env var is unset. - Pass this validated
Pathintoanalyze_sourcesinstead of creating a newPathfrom the raw string. - Keep the public function signature unchanged to avoid breaking callers.
Concretely, in api/analyzers/source_analyzer.py:
- Modify
analyze_local_folderso that:- It resolves
pathwithbase_path = Path(path).resolve(strict=True). - If
CODEGRAPH_REPOS_ROOTis set in the environment, also resolves that toroot = Path(os.environ["CODEGRAPH_REPOS_ROOT"]).resolve(strict=True)and verifiesstr(base_path).startswith(str(root)). If not, it raisesValueErroror logs and returns early. - It calls
self.analyze_sources(base_path, ignore, g)instead ofPath(path).
- It resolves
This adds robust validation at the boundary where untrusted data enters the analyzer, addresses all CodeQL variants that flow through this method, and preserves existing functionality when no root is configured.
We will need to add an import os at the top of api/analyzers/source_analyzer.py to access environment variables.
| @@ -18,6 +18,7 @@ | ||
| from multilspy.multilspy_logger import MultilspyLogger | ||
|
|
||
| import logging | ||
| import os | ||
| # Configure logging | ||
| logging.basicConfig(level=logging.DEBUG, format='%(filename)s - %(asctime)s - %(levelname)s - %(message)s') | ||
|
|
||
| @@ -193,10 +194,29 @@ | ||
| ignore (List(str)): List of paths to skip | ||
| """ | ||
|
|
||
| logging.info(f"Analyzing local folder {path}") | ||
| try: | ||
| base_path = Path(path).resolve(strict=True) | ||
| except FileNotFoundError: | ||
| logging.error("Path '%s' does not exist or is not accessible", path) | ||
| return | ||
|
|
||
| safe_root = os.environ.get("CODEGRAPH_REPOS_ROOT") | ||
| if safe_root: | ||
| try: | ||
| root_path = Path(safe_root).resolve(strict=True) | ||
| except FileNotFoundError: | ||
| logging.error("Configured CODEGRAPH_REPOS_ROOT '%s' does not exist", safe_root) | ||
| return | ||
| base_path_str = str(base_path) | ||
| root_path_str = str(root_path) | ||
| if not base_path_str.startswith(root_path_str.rstrip(os.sep) + os.sep) and base_path_str != root_path_str: | ||
| logging.error("Path '%s' is outside of allowed root '%s'", base_path, root_path) | ||
| return | ||
|
|
||
| logging.info(f"Analyzing local folder {base_path}") | ||
|
|
||
| # Analyze source files | ||
| self.analyze_sources(Path(path), ignore, g) | ||
| self.analyze_sources(base_path, ignore, g) | ||
|
|
||
| logging.info("Done analyzing path") | ||
|
|
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| path = path.resolve() | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 1 day ago
In general, to fix uncontrolled path usage you must (1) normalize the user-provided path, and (2) enforce that it lies within an allowed base directory (or otherwise belongs to an allow‑list) before using it. For directory‑wide operations like rglob, define a server‑side base directory (e.g., from an environment variable or a constant), resolve both base and user path, and then verify that the user path is a subpath of the base. If the check fails, reject the request.
For this codebase, the most direct and non‑disruptive fix is inside SourceAnalyzer.analyze_sources, because that is where Path(path) is used to traverse the filesystem with rglob. We can:
- Define an allowed root directory for analysis using an environment variable such as
CODE_GRAPH_BASE_DIR(defaulting to the current working directory or another reasonable root). - Resolve both the base directory and the user‑supplied
path. - Check that
pathis equal to the base dir or is a subdirectory of it. In Python ≥3.9 this can be done safely withPath.is_relative_to; we can implement a small helper usingPath.relative_toif needed. - If the check fails, raise a clear exception instead of proceeding.
This keeps all existing behavior for callers who already point to directories under the chosen base, and restricts path traversal to safe locations. The only file that needs code changes is api/analyzers/source_analyzer.py. We will:
- Import
osto read the environment variable. - In
analyze_sources, computebase_dir = Path(os.getenv("CODE_GRAPH_BASE_DIR", ".")).resolve()and validate that the resolvedpathis underbase_dirbefore doingrglob. - Log an error and raise
ValueError(or similar) if the check fails.
No changes are needed in tests/index.py or api/index.py to achieve the core mitigation, since their use of SourceAnalyzer will automatically be constrained by this validation.
| @@ -1,6 +1,7 @@ | ||
| from contextlib import nullcontext | ||
| from pathlib import Path | ||
| from typing import Optional | ||
| import os | ||
|
|
||
| from api.entities.entity import Entity | ||
| from api.entities.file import File | ||
| @@ -176,7 +177,16 @@ | ||
| self.second_pass(graph, files, path) | ||
|
|
||
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| # Resolve the target path and enforce that it lies within an allowed base directory. | ||
| path = path.resolve() | ||
| base_dir_env = os.getenv("CODE_GRAPH_BASE_DIR", ".") | ||
| base_dir = Path(base_dir_env).resolve() | ||
| try: | ||
| # This will raise ValueError if 'path' is not inside 'base_dir'. | ||
| path.relative_to(base_dir) | ||
| except ValueError: | ||
| logging.error("Refusing to analyze path '%s' outside of base directory '%s'", path, base_dir) | ||
| raise ValueError(f"Path '{path}' is not allowed for analysis") | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) | ||
| # First pass analysis of the source code | ||
| self.first_pass(path, files, ignore, graph) |
Migrated from falkordb/code-graph-backend#59
Summary
Add support for JavaScript code analysis using tree-sitter.
Changes:
JavaScriptAnalyzerclass using tree-sitter for JavaScriptsource_analyzer.pyto include JavaScripttree-sitter-javascriptdependencyResolves #540
Originally authored by @gkorland in falkordb/code-graph-backend#59