Add in-repository Tree-sitter parser and grammar for Bash++#25
Add in-repository Tree-sitter parser and grammar for Bash++#25NatnaelTaddese wants to merge 6 commits into
Conversation
|
From our conversation before we were talking about adding an extension for the Zed editor. Is tree-sitter absolutely necessary for this? That's a truly massive amount of added complexity -- to maintain an entirely new parser on top of the one the compiler actually uses. I'm not willing to try to maintain feature parity between two distinct parsers, that's too much work to actually keep alive over time. Basic syntax highlighting can be done with simple regular expressions (as evidenced by the TextMate grammar used by the VSCode extension). For anything more advanced (requiring the kind of deeper semantic understanding provided by a full parser), we can eventually add Semantic Tokens support to the language server tree-sitter-bashpp/src/parser.c is literally 674,618 lines of code. The entire source tree currently is fewer than 20,000 lines of code, and that provides a complete compiler, language server, standard library, and editor extension for VSCode. How can one feature add more than half a million lines? Clearly we need to re-think this. If our goal is to add an extension for Zed, let's start by looking at the Zed documentation to see how extensions are structured in that editor. I really doubt that we need to write a new parser for this |
|
From my brief glance at the Zed documentation, it seems that unfortunately tree-sitter might in fact be required for syntax highlighting in that editor. That's an odd design choice, but it's their editor. If there's an alternative (simpler) route available to us, we should take it, but that remains to be seen. At least we could more properly review this patch if we separated the source code from the generated code. I believe the 600,000+ line file was auto-generated by tree-sitter? Would you mind editing this patch to strip it down to just the source code? After that we could do a better review Also, maybe there's a way to restrict the tree-sitter grammar to a basic set of regular expressions to recognize token types? Rather than a full parser requiring feature parity. I've never used tree-sitter before, so I'm not 100% sure what's possible or conventional. |
|
I agree that this PR has grown beyond what is reasonable for the original goal of adding Bash++ support to Zed. One clarification about the diff size: the approximately 674,000-line Excluding all generated artifacts (
I also reviewed Zed's current extension documentation more carefully. Zed requires each language definition to specify a Tree-sitter grammar and does not directly consume the existing TextMate grammar. However, Bash++ does not necessarily need a complete Tree-sitter parser: a smaller Zed extension can start by reusing Zed's existing Bash grammar for A Bash++ Tree-sitter grammar could still have value beyond Zed. The same grammar could support Neovim, Helix, Emacs, structural selection, folding, code outlines, text objects, and other editor-independent tooling. That broader potential was why I initially kept the parser in this repository and tested it against the compiler's examples and test sources. |
Yes, One caveat is that Zed compiles a grammar's generated Tree-sitter rules can use regular expressions for lexical tokens, so I also agree that we should investigate a deliberately shallow, highlighting-oriented grammar instead of promising full compiler-parser parity. The goal would be to reuse the upstream Bash grammar, recognize only the Bash++ constructs needed for highlighting, and rely on Tree-sitter error recovery plus the language server for deeper semantic behavior. There is a tradeoff: inheriting I think a revision around two principles would be nice?:
|
Wow. Clearly I have a fair bit to learn about the Zed editor. Well at any rate, anything required by the editor at runtime should be generated by the build procedure rather than bundled into the source tree; the principle is that only source code should exist in the source tree. Let's just start with removing generated code and go from there. It looks like this might end up being a somewhat longer conversation re: tree-sitter and the cost/benefit of substantial added complexity. You and I can meet over coffee at some point to discuss |
|
Sounds good. I have removed the generated parser files and pushed the update. I agree that the broader maintenance cost and intended scope deserve a separate discussion. I’d be happy to meet over coffee(Tea actually, I don't like coffee) and work through the tradeoffs before we decide how far the Tree-sitter integration should go. |
Checks
[✓] The code follows the general style of this project
[✓] All tests are passing within this branch (
make test)[✓] Appropriate regression tests have been added (if new functionality is being introduced)
[ ] No new tests are required
Description
This PR adds a complete in-repository Tree-sitter parser for Bash++ under
tree-sitter-bashpp/.The parser extends the pinned
tree-sitter-bash0.25.1 grammar rather than reimplementing Bash. This follows the project principle of processing as little Bash as possible while providing structured syntax nodes for Bash++ features.The named nodes and fields introduced here establish the public syntax-tree interface for future highlighting, editor, and language-server integrations.
Parser infrastructure
The new parser directory includes:
tree-sitter.jsonThe following tooling is pinned:
tree-sitter-bash0.25.1tree-sitter-cli0.25.10Generated parser files are committed so downstream editor integrations can compile the parser without requiring Node.js or regenerating the grammar.
Supported declarations
The grammar exposes named nodes and fields for:
Supported expressions
The grammar supports:
@thisand@superreferences@new@delete@nullptr@typeofBash++ interpolation is recognized inside:
Bash compatibility
The parser inherits upstream Bash grammar behavior and adds only the syntax required for Bash++.
The adapted external scanner handles cases requiring token-boundary awareness or lookahead, including:
Compatibility coverage also includes Bash arrays,
${array[@]}, C-style arithmetic loops, conditionals, pipelines, redirections, and compound commands.Build integration
The following optional root Make targets are added:
Tree-sitter is deliberately excluded from the default compiler build.
make test-tree-sitterperforms:.bppparsingA dedicated GitHub Actions workflow runs these checks when the parser or Bash++ source fixtures change.
Test results
.bppfiles parse withoutERRORorMISSINGnodesparser-errors-1.bppandparser-errors-2.bppare verified as expected recovery casesmake test-tree-sitterpassesmake testsuite passes on LinuxRepository-wide parsing covers:
examples/test-suite/wiki/_includes/code/snippets/Licensing
New parser work is licensed under GPL-3.0-or-later.
Adapted
tree-sitter-bashmaterial retains its MIT notice in:tree-sitter-bashpp/THIRD_PARTY_LICENSES/tree-sitter-bash-MIT.txtReview guidance
Most of the diff consists of generated files. The primary files for manual review are:
tree-sitter-bashpp/grammar.jstree-sitter-bashpp/src/scanner.ctree-sitter-bashpp/test/corpus/tree-sitter-bashpp/test/parse-repository-files.shtree-sitter-bashpp/Makefilemakefile.github/workflows/tree-sitter.ymlThe generated files can be verified by running:
Scope
This PR intentionally excludes:
These integrations can be developed separately against the named nodes and fields introduced by this parser.
License Agreement
I hereby confirm that the work submitted in this pull request is my own and I agree that my contributions will be licensed under the same license as the project, which is the GNU General Public License v3.0 or later (GPL-3.0-or-later).