Skip to content

Split keywords into reserved and non-reserved sets#275

Open
git-hulk wants to merge 4 commits into
masterfrom
reserved-keyword-split
Open

Split keywords into reserved and non-reserved sets#275
git-hulk wants to merge 4 commits into
masterfrom
reserved-keyword-split

Conversation

@git-hulk

Copy link
Copy Markdown
Member

Summary

matchTokenKind previously coerced every keyword token to match TokenKindIdent, so any keyword could silently fill an identifier slot. A missing name before a clause keyword then swallowed the keyword instead of erroring:

Input Before After
SELECT a FROM WHERE b = 1 parsed as SELECT a AS FROM WHERE b = 1 error at WHERE: expected table name
SELECT a, b FROM GROUP BY a parsed as SELECT a, b AS FROM GROUP BY a error at GROUP: expected table name
INSERT INTO SELECT 1 SELECT eaten as table name, confusing error at 1 error at SELECT
SELECT a FROM t JOIN ON a = b ON eaten as table name, confusing error at = error at ON

This over-acceptance is the root of the recurring keyword-as-identifier bug class (#268, #269, #270).

Changes

  • Add a reservedKeywords set (statement starters, clause starters, expression operators) in keyword.go; everything else stays non-reserved.
  • Narrow the keyword→ident coercion in matchTokenKind to non-reserved keywords only.
  • Add parseIdentAnyKeyword for positions where context proves even a reserved keyword is a name, and use it at those sites:
    • after AS (select-item and table aliases — SELECT 2 AS from still works)
    • after a dot in qualified names (db.from, t.from)
    • the lookahead-disambiguated select items from Allow reserved keywords as bare column names in SELECT list #269 (SELECT a, limit FROM t, SELECT case;)
    • query parameters ({end:UInt32})
    • window names (OVER (order), WINDOW order AS (...))
    • GRANT options (WITH GRANT OPTION)
    • gated function names (select-item modifiers EXCEPT/APPLY/REPLACE, INSERT INTO FUNCTION)
  • Bare (no-AS) select-item aliases now only accept non-reserved keywords, fixing the silent mis-parses above.

Keywords that double as ClickHouse function or engine names (IF, LEFT, RIGHT, ANY, MIN, MAX, TRIM, SET, JOIN, ...) are intentionally non-reserved, so if(a, 1, 2), any(b), left(c, 1) and ENGINE = Join(...) keep parsing.

Testing

  • New keyword_test.go with regression tests for rejected mis-parses, non-reserved keywords as identifiers, and reserved keywords in disambiguated positions, plus a consistency check that reservedKeywords ⊆ keywords.
  • Full suite passes including -race -compatible (stateful corpus).

🤖 Generated with Claude Code

matchTokenKind previously coerced every keyword token to match
TokenKindIdent, so any keyword could silently fill an identifier slot.
A missing name before a clause keyword then swallowed the keyword
instead of erroring: `SELECT a FROM WHERE b = 1` parsed FROM as a bare
alias of `a`, `INSERT INTO SELECT 1` ate SELECT as a table name. This
over-acceptance is the root of the recurring keyword-as-identifier bug
class (#268, #269, #270).

Introduce reservedKeywords - statement starters, clause starters and
expression operators - and narrow the coercion in matchTokenKind to
non-reserved keywords only. Positions where context proves even a
reserved keyword is a name use the new parseIdentAnyKeyword:

- after AS (select-item and table aliases)
- after a dot in qualified names (`db.from`, `t.from`)
- lookahead-disambiguated select items from #269
- query parameters (`{end:UInt32}`)
- window names (`OVER (order)`, `WINDOW order AS (...)`)
- GRANT options (`WITH GRANT OPTION`)
- gated function names (select-item modifiers, INSERT INTO FUNCTION)

Non-reserved keywords (DATE, KEY, FIRST, IF, LEFT, ANY, ...) keep
working as identifiers everywhere, including keywords that double as
ClickHouse function or engine names.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a3c35ea6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread parser/parser_query.go Outdated
git-hulk and others added 3 commits June 11, 2026 11:14
The WINDOW definition side accepted `WINDOW order AS (...)` but the
unparenthesized reference `OVER order` still went through
matchTokenKind(TokenKindIdent), which rejects reserved keywords. After
OVER a bare token can only be a window name - no clause keyword can
legally follow - so accept any keyword there, mirroring the definition
side.

Addresses Codex review on PR #275.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TokenKind stays a string (its values are woven into the public AST and
error messages), so multi-kind membership checks get an internal bit
set instead: kindClass with classIdent/classKeyword bits and a
matchTokenKindIn(classIdent | classKeyword) helper, replacing the
repeated two-comparison ident-or-keyword checks. Unlike matchTokenKind,
matchTokenKindIn is a raw kind check with no keyword-to-identifier
coercion.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
matchTokenKind(TokenKindIdent, TokenKindKeyword) now expresses the
ident-or-any-keyword membership check directly, so the kindClass bit
set and matchTokenKindIn helper are no longer needed. Existing
single-kind call sites are unchanged.

Raw kind equality is checked for all kinds before the keyword-to-ident
coercion arm, so multi-kind calls that include TokenKindKeyword match
keyword tokens without the ToUpper/reserved-set lookup; allocs/op stay
at the master baseline (verified with BenchmarkParseSQLFiles).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
git-hulk

This comment was marked as off-topic.

@git-hulk git-hulk left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword reservation needs to preserve valid function-style operator calls.

Comment thread parser/keyword.go
// RIGHT, ANY, MIN, MAX, TRIM, SET, JOIN...) must stay non-reserved.
var reservedKeywords = NewSet(
KeywordAlter,
KeywordAnd,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting AND in the reserved set makes matchTokenKind(TokenKindIdent) reject and(...) at expression start, so valid ClickHouse/function-style queries that this repo already keeps in parser/testdata/benchdata/posthog_huge_*.sql no longer parse. Repro on this head: go test -run '^$' -bench 'BenchmarkParseComplexQueries/testdata/benchdata/posthog_huge_0.sql$' -benchtime=1x ./parser fails at if(and(...)), while the base commit parses it. OR and IN have the same issue (SELECT or(a,b), SELECT in(x, tuple(1))). Please either keep operator function names out of reservedKeywords or add a keyword+( function path before the reserved-identifier rejection.

@git-hulk

Copy link
Copy Markdown
Member Author

The keyword-function and reserved-keyword alias regressions need to be addressed.


parser/keyword.go:275

Adding callable operator keywords to reservedKeywords blocks them from the normal function-call path because parseColumnExpr only enters parseIdentOrFunction when matchTokenKind(TokenKindIdent) is true. This head now rejects queries that base parsed, such as SELECT and(a, b) FROM t, SELECT or(a, b) FROM t, SELECT in(1, [1]), SELECT like(s, '%a%') FROM t, and SELECT ilike(s, '%a%') FROM t; the repo's bench SQL also contains functional and(...), or(...), and in(...) forms. Please keep reserved operator keywords callable when they are followed by (, or avoid marking these callable names reserved in identifier/function-call dispatch.


parser/parser_column.go:853

AS already disambiguates the next token as an alias here, but this path still calls parseIdent, so reserved keyword aliases inside expression lists now fail. Base parsed SELECT (1 AS from) and SELECT sum(x AS from) FROM t; this head errors at from.

		alias, err = p.parseAnyKeyword()

parser/parser_table.go:980

This ORDER BY ... AS <alias> path still rejects reserved keyword aliases after the global parseIdent contract change. Base parsed SELECT a FROM t ORDER BY x AS from, while this head errors at from; after the existing AS SELECT guard above, the token is an alias name.

		alias, err = p.parseAnyKeyword()

parser/parser_query.go:1203

Non-parenthesized CTE aliases after AS can no longer be reserved keywords because this still uses parseIdent. Base parsed WITH 1 AS from SELECT from, but this head errors at the alias; after AS, this should use the same explicit keyword-as-name helper as SELECT aliases.

	name, err := p.parseAnyKeyword()

@git-hulk git-hulk left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address the reserved-keyword alias regression in generic expression lists.

The following comments reference lines outside the diff:


parser/parser_column.go:853

This still uses parseIdent after AS for generic expression-list aliases, so reserved-keyword aliases now fail in contexts like SELECT tuple(1 AS from); even though the top-level SELECT 1 AS from path was updated. After AS the token is unambiguously an alias here too.

		alias, err = p.parseAnyKeyword()

@git-hulk git-hulk left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several keyword-as-name regressions need to be addressed.

The following comments reference lines outside the diff:


parser/parser_table.go:546

This AS ident branch still rejects reserved keywords even though AS has already disambiguated the next token as the source table or table-function name. On this PR, CREATE TABLE t AS from.table ENGINE=Memory now fails at from; the base commit parsed it, and the new helper comment explicitly lists after-AS positions as safe for reserved keyword names.

			ident, err := p.parseAnyKeyword()

parser/parser_column.go:1121

This hand-rolled dotted JSON path bypasses the new reserved-keyword handling for member names after a dot. As a result, head rejects valid paths that base accepted, e.g. CREATE TABLE t (x JSON(a.from String)) ENGINE=Memory and JSON(SKIP a.from), even though after-dot identifiers are one of the positions this PR intends to keep accepting.

		ident, err := p.parseAnyKeyword()

parser/parse_system.go:1304

Grant sources have the same missed after-dot case: GRANT SELECT ON db.from TO john now fails because parseIdentOrStar no longer accepts reserved keywords, while the token after db. can only be the table name or *. Please special-case keyword tokens here with parseAnyKeyword() while preserving the existing * support.

@git-hulk git-hulk left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address the GRANT source parser inconsistency for reserved keywords after dotted database names.

The following comments reference lines outside the diff:


parser/parse_system.go:1304

After the dot this is the same disambiguated position handled by tryParseDotIdentOrString, but this separate GRANT path still calls parseIdentOrStar, so reserved table names are rejected. For example, GRANT SELECT ON db.from TO john now fails with expected <ident> or *, but got "<keyword>", while SELECT * FROM db.from, DROP TABLE db.from, and CREATE TABLE db.from ... parse successfully. Keep the existing * support, but parse keyword tokens after the dot as identifiers here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant