Split keywords into reserved and non-reserved sets#275
Conversation
matchTokenKind previously coerced every keyword token to match TokenKindIdent, so any keyword could silently fill an identifier slot. A missing name before a clause keyword then swallowed the keyword instead of erroring: `SELECT a FROM WHERE b = 1` parsed FROM as a bare alias of `a`, `INSERT INTO SELECT 1` ate SELECT as a table name. This over-acceptance is the root of the recurring keyword-as-identifier bug class (#268, #269, #270). Introduce reservedKeywords - statement starters, clause starters and expression operators - and narrow the coercion in matchTokenKind to non-reserved keywords only. Positions where context proves even a reserved keyword is a name use the new parseIdentAnyKeyword: - after AS (select-item and table aliases) - after a dot in qualified names (`db.from`, `t.from`) - lookahead-disambiguated select items from #269 - query parameters (`{end:UInt32}`) - window names (`OVER (order)`, `WINDOW order AS (...)`) - GRANT options (`WITH GRANT OPTION`) - gated function names (select-item modifiers, INSERT INTO FUNCTION) Non-reserved keywords (DATE, KEY, FIRST, IF, LEFT, ANY, ...) keep working as identifiers everywhere, including keywords that double as ClickHouse function or engine names. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a3c35ea6d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The WINDOW definition side accepted `WINDOW order AS (...)` but the unparenthesized reference `OVER order` still went through matchTokenKind(TokenKindIdent), which rejects reserved keywords. After OVER a bare token can only be a window name - no clause keyword can legally follow - so accept any keyword there, mirroring the definition side. Addresses Codex review on PR #275. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TokenKind stays a string (its values are woven into the public AST and error messages), so multi-kind membership checks get an internal bit set instead: kindClass with classIdent/classKeyword bits and a matchTokenKindIn(classIdent | classKeyword) helper, replacing the repeated two-comparison ident-or-keyword checks. Unlike matchTokenKind, matchTokenKindIn is a raw kind check with no keyword-to-identifier coercion. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
matchTokenKind(TokenKindIdent, TokenKindKeyword) now expresses the ident-or-any-keyword membership check directly, so the kindClass bit set and matchTokenKindIn helper are no longer needed. Existing single-kind call sites are unchanged. Raw kind equality is checked for all kinds before the keyword-to-ident coercion arm, so multi-kind calls that include TokenKindKeyword match keyword tokens without the ToUpper/reserved-set lookup; allocs/op stay at the master baseline (verified with BenchmarkParseSQLFiles). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
git-hulk
left a comment
There was a problem hiding this comment.
Keyword reservation needs to preserve valid function-style operator calls.
| // RIGHT, ANY, MIN, MAX, TRIM, SET, JOIN...) must stay non-reserved. | ||
| var reservedKeywords = NewSet( | ||
| KeywordAlter, | ||
| KeywordAnd, |
There was a problem hiding this comment.
Putting AND in the reserved set makes matchTokenKind(TokenKindIdent) reject and(...) at expression start, so valid ClickHouse/function-style queries that this repo already keeps in parser/testdata/benchdata/posthog_huge_*.sql no longer parse. Repro on this head: go test -run '^$' -bench 'BenchmarkParseComplexQueries/testdata/benchdata/posthog_huge_0.sql$' -benchtime=1x ./parser fails at if(and(...)), while the base commit parses it. OR and IN have the same issue (SELECT or(a,b), SELECT in(x, tuple(1))). Please either keep operator function names out of reservedKeywords or add a keyword+( function path before the reserved-identifier rejection.
|
The keyword-function and reserved-keyword alias regressions need to be addressed.
Adding callable operator keywords to
This
Non-parenthesized CTE aliases after |
git-hulk
left a comment
There was a problem hiding this comment.
Address the reserved-keyword alias regression in generic expression lists.
The following comments reference lines outside the diff:
parser/parser_column.go:853
This still uses parseIdent after AS for generic expression-list aliases, so reserved-keyword aliases now fail in contexts like SELECT tuple(1 AS from); even though the top-level SELECT 1 AS from path was updated. After AS the token is unambiguously an alias here too.
alias, err = p.parseAnyKeyword()
git-hulk
left a comment
There was a problem hiding this comment.
Several keyword-as-name regressions need to be addressed.
The following comments reference lines outside the diff:
parser/parser_table.go:546
This AS ident branch still rejects reserved keywords even though AS has already disambiguated the next token as the source table or table-function name. On this PR, CREATE TABLE t AS from.table ENGINE=Memory now fails at from; the base commit parsed it, and the new helper comment explicitly lists after-AS positions as safe for reserved keyword names.
ident, err := p.parseAnyKeyword()
parser/parser_column.go:1121
This hand-rolled dotted JSON path bypasses the new reserved-keyword handling for member names after a dot. As a result, head rejects valid paths that base accepted, e.g. CREATE TABLE t (x JSON(a.from String)) ENGINE=Memory and JSON(SKIP a.from), even though after-dot identifiers are one of the positions this PR intends to keep accepting.
ident, err := p.parseAnyKeyword()
parser/parse_system.go:1304
Grant sources have the same missed after-dot case: GRANT SELECT ON db.from TO john now fails because parseIdentOrStar no longer accepts reserved keywords, while the token after db. can only be the table name or *. Please special-case keyword tokens here with parseAnyKeyword() while preserving the existing * support.
git-hulk
left a comment
There was a problem hiding this comment.
Address the GRANT source parser inconsistency for reserved keywords after dotted database names.
The following comments reference lines outside the diff:
parser/parse_system.go:1304
After the dot this is the same disambiguated position handled by tryParseDotIdentOrString, but this separate GRANT path still calls parseIdentOrStar, so reserved table names are rejected. For example, GRANT SELECT ON db.from TO john now fails with expected <ident> or *, but got "<keyword>", while SELECT * FROM db.from, DROP TABLE db.from, and CREATE TABLE db.from ... parse successfully. Keep the existing * support, but parse keyword tokens after the dot as identifiers here too.
Summary
matchTokenKindpreviously coerced every keyword token to matchTokenKindIdent, so any keyword could silently fill an identifier slot. A missing name before a clause keyword then swallowed the keyword instead of erroring:SELECT a FROM WHERE b = 1SELECT a AS FROM WHERE b = 1WHERE: expected table nameSELECT a, b FROM GROUP BY aSELECT a, b AS FROM GROUP BY aGROUP: expected table nameINSERT INTO SELECT 1SELECTeaten as table name, confusing error at1SELECTSELECT a FROM t JOIN ON a = bONeaten as table name, confusing error at=ONThis over-acceptance is the root of the recurring keyword-as-identifier bug class (#268, #269, #270).
Changes
reservedKeywordsset (statement starters, clause starters, expression operators) inkeyword.go; everything else stays non-reserved.matchTokenKindto non-reserved keywords only.parseIdentAnyKeywordfor positions where context proves even a reserved keyword is a name, and use it at those sites:AS(select-item and table aliases —SELECT 2 AS fromstill works)db.from,t.from)SELECT a, limit FROM t,SELECT case;){end:UInt32})OVER (order),WINDOW order AS (...))WITH GRANT OPTION)EXCEPT/APPLY/REPLACE,INSERT INTO FUNCTION)Keywords that double as ClickHouse function or engine names (
IF,LEFT,RIGHT,ANY,MIN,MAX,TRIM,SET,JOIN, ...) are intentionally non-reserved, soif(a, 1, 2),any(b),left(c, 1)andENGINE = Join(...)keep parsing.Testing
keyword_test.gowith regression tests for rejected mis-parses, non-reserved keywords as identifiers, and reserved keywords in disambiguated positions, plus a consistency check thatreservedKeywords ⊆ keywords.-race -compatible(stateful corpus).🤖 Generated with Claude Code