Skip to content

Add in-repository Tree-sitter parser and grammar for Bash++#25

Open
NatnaelTaddese wants to merge 6 commits into
rail5:mainfrom
NatnaelTaddese:feature/tree-sitter
Open

Add in-repository Tree-sitter parser and grammar for Bash++#25
NatnaelTaddese wants to merge 6 commits into
rail5:mainfrom
NatnaelTaddese:feature/tree-sitter

Conversation

@NatnaelTaddese

Copy link
Copy Markdown

Checks

Description

This PR adds a complete in-repository Tree-sitter parser for Bash++ under tree-sitter-bashpp/.

The parser extends the pinned tree-sitter-bash 0.25.1 grammar rather than reimplementing Bash. This follows the project principle of processing as little Bash as possible while providing structured syntax nodes for Bash++ features.

The named nodes and fields introduced here establish the public syntax-tree interface for future highlighting, editor, and language-server integrations.

Parser infrastructure

The new parser directory includes:

  • Grammar metadata and tree-sitter.json
  • Locked npm dependencies
  • ESLint configuration
  • Generated C parser sources
  • An adapted external scanner
  • Parser documentation
  • Corpus and repository-wide tests
  • GPL-3.0-or-later licensing and the required upstream MIT notice

The following tooling is pinned:

  • tree-sitter-bash 0.25.1
  • tree-sitter-cli 0.25.10

Generated parser files are committed so downstream editor integrations can compile the parser without requiring Node.js or regenerating the grammar.

Supported declarations

The grammar exposes named nodes and fields for:

  • Classes and inheritance
  • Public, private, protected, and virtual modifiers
  • Data members
  • Methods and parameters
  • Constructors and destructors
  • Object declarations
  • Pointer declarations
  • Static, dynamic, and one-time includes

Supported expressions

The grammar supports:

  • Object references
  • @this and @super references
  • Braced and indexed references
  • Address expressions
  • Pointer dereferences
  • Object assignments
  • @new
  • @delete
  • @nullptr
  • @typeof
  • Dynamic casts
  • Supershells

Bash++ interpolation is recognized inside:

  • Double-quoted strings
  • Heredocs
  • Arithmetic expressions
  • Test expressions
  • Variable assignments
  • Command arguments and substitutions

Bash compatibility

The parser inherits upstream Bash grammar behavior and adds only the syntax required for Bash++.

The adapted external scanner handles cases requiring token-boundary awareness or lookahead, including:

  • Distinguishing object and pointer declarations from references
  • Object-assignment targets and operators
  • Reference termination
  • Newline handling
  • Bash++ interpolation inside heredocs

Compatibility coverage also includes Bash arrays, ${array[@]}, C-style arithmetic loops, conditionals, pipelines, redirections, and compound commands.

Build integration

The following optional root Make targets are added:

make tree-sitter
make test-tree-sitter
make clean-tree-sitter

Tree-sitter is deliberately excluded from the default compiler build.

make test-tree-sitter performs:

  1. Parser generation
  2. Grammar linting
  3. Corpus testing
  4. Repository-wide .bpp parsing
  5. Recovery-fixture validation
  6. Generated-file reproducibility checking

A dedicated GitHub Actions workflow runs these checks when the parser or Bash++ source fixtures change.

Test results

  • 14 Tree-sitter corpus tests pass
  • 157 valid repository .bpp files parse without ERROR or MISSING nodes
  • parser-errors-1.bpp and parser-errors-2.bpp are verified as expected recovery cases
  • Regenerating the tracked parser sources produces no diff
  • make test-tree-sitter passes
  • The existing compiler make test suite passes on Linux

Repository-wide parsing covers:

  • examples/
  • test-suite/
  • wiki/_includes/code/snippets/

Licensing

New parser work is licensed under GPL-3.0-or-later.

Adapted tree-sitter-bash material retains its MIT notice in:

tree-sitter-bashpp/THIRD_PARTY_LICENSES/tree-sitter-bash-MIT.txt

Review guidance

Most of the diff consists of generated files. The primary files for manual review are:

  • tree-sitter-bashpp/grammar.js
  • tree-sitter-bashpp/src/scanner.c
  • tree-sitter-bashpp/test/corpus/
  • tree-sitter-bashpp/test/parse-repository-files.sh
  • tree-sitter-bashpp/Makefile
  • makefile
  • .github/workflows/tree-sitter.yml

The generated files can be verified by running:

make test-tree-sitter

Scope

This PR intentionally excludes:

  • Syntax-highlighting queries
  • Zed extension configuration
  • Language bindings
  • LSP integration
  • Changes to the existing compiler parser

These integrations can be developed separately against the named nodes and fields introduced by this parser.

License Agreement

I hereby confirm that the work submitted in this pull request is my own and I agree that my contributions will be licensed under the same license as the project, which is the GNU General Public License v3.0 or later (GPL-3.0-or-later).

@rail5

rail5 commented Jun 13, 2026

Copy link
Copy Markdown
Owner

From our conversation before we were talking about adding an extension for the Zed editor. Is tree-sitter absolutely necessary for this?

That's a truly massive amount of added complexity -- to maintain an entirely new parser on top of the one the compiler actually uses. I'm not willing to try to maintain feature parity between two distinct parsers, that's too much work to actually keep alive over time.

Basic syntax highlighting can be done with simple regular expressions (as evidenced by the TextMate grammar used by the VSCode extension). For anything more advanced (requiring the kind of deeper semantic understanding provided by a full parser), we can eventually add Semantic Tokens support to the language server

tree-sitter-bashpp/src/parser.c is literally 674,618 lines of code. The entire source tree currently is fewer than 20,000 lines of code, and that provides a complete compiler, language server, standard library, and editor extension for VSCode. How can one feature add more than half a million lines? Clearly we need to re-think this.

If our goal is to add an extension for Zed, let's start by looking at the Zed documentation to see how extensions are structured in that editor. I really doubt that we need to write a new parser for this

@rail5

rail5 commented Jun 13, 2026

Copy link
Copy Markdown
Owner

From my brief glance at the Zed documentation, it seems that unfortunately tree-sitter might in fact be required for syntax highlighting in that editor. That's an odd design choice, but it's their editor. If there's an alternative (simpler) route available to us, we should take it, but that remains to be seen.

At least we could more properly review this patch if we separated the source code from the generated code. I believe the 600,000+ line file was auto-generated by tree-sitter?

Would you mind editing this patch to strip it down to just the source code? After that we could do a better review

Also, maybe there's a way to restrict the tree-sitter grammar to a basic set of regular expressions to recognize token types? Rather than a full parser requiring feature parity. I've never used tree-sitter before, so I'm not 100% sure what's possible or conventional.

@NatnaelTaddese

Copy link
Copy Markdown
Author

I agree that this PR has grown beyond what is reasonable for the original goal of adding Bash++ support to Zed.

One clarification about the diff size: the approximately 674,000-line src/parser.c file is generated by tree-sitter generate; it is not handwritten code. Tree-sitter projects commonly commit this generated C parser so consumers can compile it without Node.js, npm dependencies, or the Tree-sitter CLI. The CI check regenerates it to ensure the committed output is reproducible.

Excluding all generated artifacts (parser.c, grammar.json, node-types.json, generated headers, and lockfile), the PR contains:

  • 2,687 additions
  • 2 deletions
  • Core maintained parser code: 1,898 lines
    • grammar.js: 500
    • scanner.c: 1,398
  • Remaining lines are tests, CI, metadata, licensing, and documentation.

I also reviewed Zed's current extension documentation more carefully. Zed requires each language definition to specify a Tree-sitter grammar and does not directly consume the existing TextMate grammar. However, Bash++ does not necessarily need a complete Tree-sitter parser: a smaller Zed extension can start by reusing Zed's existing Bash grammar for .bpp files.

A Bash++ Tree-sitter grammar could still have value beyond Zed. The same grammar could support Neovim, Helix, Emacs, structural selection, folding, code outlines, text objects, and other editor-independent tooling. That broader potential was why I initially kept the parser in this repository and tested it against the compiler's examples and test sources.

@NatnaelTaddese

Copy link
Copy Markdown
Author

Would you mind editing this patch to strip it down to just the source code? After that we could do a better review

Yes, src/parser.c is generated by tree-sitter generate. I can remove it, along with the other generated parser artifacts, so this PR contains only the maintained source, tests, documentation, and build configuration. That reduces the reviewable patch to roughly 2,687 added lines, including 500 lines of grammar.js, 1,398 lines of scanner code inherited and adapted from tree-sitter-bash, and the test corpus.

One caveat is that Zed compiles a grammar's generated src/parser.c directly into WebAssembly, so generated output will eventually need to exist somewhere that the Zed extension can fetch. It does not need to be part of this review, though; it could be generated later in a dedicated grammar repository or introduced separately after the maintained source is accepted.

Tree-sitter rules can use regular expressions for lexical tokens, so I also agree that we should investigate a deliberately shallow, highlighting-oriented grammar instead of promising full compiler-parser parity. The goal would be to reuse the upstream Bash grammar, recognize only the Bash++ constructs needed for highlighting, and rely on Tree-sitter error recovery plus the language server for deeper semantic behavior.

There is a tradeoff: inheriting tree-sitter-bash already produces about 350,000 lines of generated C before any Bash++ additions, and extending it currently increases that to about 675,000 lines. A regex-only grammar written from scratch could be smaller, but it would lose the existing Bash parsing and highlighting that Bash++ needs.

I think a revision around two principles would be nice?:

  1. Keep generated artifacts out of the reviewable patch.
  2. Reduce the maintained grammar to the smallest highlighting-focused extension of Bash that works reliably in Zed, without treating it as an authoritative parser for the language.

@rail5

rail5 commented Jun 13, 2026

Copy link
Copy Markdown
Owner

One caveat is that Zed compiles a grammar's generated src/parser.c directly into WebAssembly

Wow. Clearly I have a fair bit to learn about the Zed editor. Well at any rate, anything required by the editor at runtime should be generated by the build procedure rather than bundled into the source tree; the principle is that only source code should exist in the source tree.

Let's just start with removing generated code and go from there. It looks like this might end up being a somewhat longer conversation re: tree-sitter and the cost/benefit of substantial added complexity. You and I can meet over coffee at some point to discuss

@NatnaelTaddese

Copy link
Copy Markdown
Author

Sounds good. I have removed the generated parser files and pushed the update.

I agree that the broader maintenance cost and intended scope deserve a separate discussion. I’d be happy to meet over coffee(Tea actually, I don't like coffee) and work through the tradeoffs before we decide how far the Tree-sitter integration should go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants