Optimize MySQL lexer (2× speedup) by JanJakes · Pull Request #424 · WordPress/sqlite-database-integration

JanJakes · 2026-06-06T08:38:38Z

Summary

A standalone, lexer-only subset of the parser-performance work (#378 / #373 / #375), based on trunk on its own. It's worth landing separately because it's a clean, low-risk win that carries no per-process startup cost.

Lexer-only throughput on the 69,577-query MySQL server corpus (MB Pro M4, PHP 8.5), measured with the run-lexer-benchmark.php added in this PR (best pass after warmup), trunk vs this branch back-to-back:

Config	Trunk	This branch	Speedup
Lex-only, no JIT	105,181 QPS	183,093 QPS	1.74×
Lex-only, warm tracing JIT	184,241 QPS	403,986 QPS	2.19×

Lexer benchmark CI job

I added a new "Lexer benchmark" CI job that triggers on any changes to the lexer-related files, runs base vs. PR benchmarks on the CI, posts a comment with informative results, and keeps it updated (see below).

The speedups here on the CI seem to be more modest than locally, but noisy CI workers may skew the results.

Optimizations

Approximate cumulative contribution to the lexer throughput improvements:

Inline skipping leading-whitespace (~+17% no-JIT / +36% JIT). One strspn() at the top of the token loop skips whitespace runs instead of a read_next_token() round-trip per run.
Catch identifiers/keywords at the top of the dispatch chain (~+19% / +20%). The most common token class is matched first.
Single-byte operator map (~+8% / +10%). A static byte → token id map for ( ) , ; + ~ % ^ ? { } =.
Inline the keyword lookup on the hot identifier path (~+3% / +6%). This avoids two method calls per identifier/keyword token; the keyword-only post-processing is reached only for actual keywords.
Skip the parent constructor in WP_MySQL_Token (~+5%). The token subclass assigns its fields directly.
Structural [codex] Speed up MySQL lexing and parsing #375 wins. Cache strlen($sql) once, strpos()-based comment-end and quote scans, inlined remaining_tokens().

Two changes from earlier drafts were measured as neutral-to-negative once the dispatch chain was reordered and were dropped: the strspn()→byte-comparison swaps (slightly slower under warm JIT, and they add code) and the ---comment whitespace unroll (~0%).

Relationship to #378 / #373 / #375

These are lexer-side optimizations from the consolidated #378 branch, isolated onto trunk. They're independent of the parser/grammar preprocessing and add no startup cost, so they can land first. The parser-side work (which does carry a per-process grammar-build cost) can be evaluated separately.

github-actions · 2026-06-06T11:16:35Z

🤖 Lexer benchmark

Changes to lexer-related files were detected and triggered a benchmark:

Config	Base (QPS)	This PR (QPS)	Speedup
no JIT	45,182	71,930	1.59×
tracing JIT	118,464	161,410	1.36×

Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally.

To reproduce locally:

cd packages/mysql-on-sqlite && composer run bench-lexer

run-lexer-benchmark.php timed a single pass, which is too noisy to compare a change against. Rework it into a reliable throughput benchmark that the lexer optimisations in this branch can be measured against: - Load through src/load.php (parity with run-parser-benchmark.php) so a loaded native extension is benchmarked via the same public WP_MySQL_Lexer wrapper. - Warm up with discarded passes (heating opcache, the tracing JIT, and CPU caches), then run N timed passes over the whole corpus. - Headline the best pass: lexing is deterministic and CPU-bound, so outside interference can only slow a pass down, making the fastest pass the most reproducible estimate of intrinsic cost and the most stable basis for a before/after comparison. Median and best-vs-worst spread are reported too so a noisy machine is obvious. - Detect and report the active config (opcache / tracing JIT) and the implementation (php / native-extension), and warn when opcache.jit is set but the JIT did not actually activate. - Add --iterations / --warmup; keep --json (headline kept as "qps"). Add a `bench-lexer` script to the mysql-on-sqlite package's composer.json that runs the benchmark twice — without and with the tracing JIT — so both configurations are measured with one `composer run bench-lexer` (JIT is a start-up setting that cannot be toggled mid-process).

On pull requests that touch the lexer (or the benchmark tool), run run-lexer-benchmark.php for both the base commit and the PR head on the same runner, without and with the tracing JIT, and post the before/after numbers as a single comment that updates in place on every push. The job is informational, not gating: hosted CI runners are too noisy for absolute-throughput thresholds. Measuring base and head back-to-back on the same runner cancels the runner's absolute speed, so the same-runner speedup ratio is the meaningful signal. Only the source tree is swapped to the base commit; the PR's benchmark tool is reused for both sides so they are timed identically.

Apply the structural lexer optimisations from PR #375: - Cache strlen($sql) once in $sql_length instead of recomputing it on each EOF/bounds check. - Use strpos($sql, '*/', $pos) instead of a manual scan loop in read_comment_content(). - In read_quoted_text(), use strpos() to find the next quote, dropping the separate end-of-input check that followed the strcspn() scan. - Inline next_token() + get_token() in remaining_tokens() so the hot loop builds tokens directly. The #375 strspn()->byte-comparison swaps are intentionally not included: once the dispatch chain is reordered by later commits those checks are off the hot path and strspn() is marginally faster than the inline comparisons, so the swaps were net-neutral-to-negative while adding code. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

Both next_token() and remaining_tokens() previously paid a read_next_token() function call per whitespace run only to recognise and skip the resulting WHITESPACE token. A single unguarded strspn() at the top of each loop iteration absorbs the run inline, saving the call overhead for ~one whitespace run per real token across millions of tokens. The strspn() call is unguarded because an unconditional strspn() (which returns 0 in a single C-side call when nothing matches) is faster than gating it on a five-arm '$byte === ...' precheck.

ASCII letters and UTF-8 multibyte start bytes account for most token-start bytes on the MySQL corpus. They previously fell into the catch-all `else` at the bottom of read_next_token() after walking every operator arm in between. The new branch sits at the top of the elseif chain and dispatches them directly. The `next_byte !== "'"` guard keeps the x'..', n'..' and similar specials on their dedicated branches. `_` and `$` starters stay on the catch-all so the UNDERSCORE_CHARSET lookup still fires.

The ASCII bytes (, ), ',' ;, +, ~, %, ^, ?, {, }, and = each map to a unique single-byte token type with no lookahead. A static array + isset() arm dispatches them in one lookup, ahead of the per-byte elseif chain, and the now-shadowed individual arms further down the chain are removed so the table is the single source of truth for these tokens. '*' and '|' are deliberately excluded because their token type depends on context (in_mysql_comment for '*/', SQL_MODE_PIPES_AS_CONCAT for '||').

Three review-noted spots that were terse in the code: - The remaining_tokens() loop guard now spells out why both EOF and `null === token_type && bytes_already_read > 0` are needed (EOF on clean end-of-input vs invalid byte mid-stream, with the `> 0` guard letting the very first iteration through). - The identifier/keyword fast path now explains `$byte > "\x7F"` (UTF-8 multi-byte starter; MySQL identifiers allow U+0080-U+FFFF) and `next_byte !== "'"` (only single quotes form the special hex/bin/n-char literal starters; `"` never does, regardless of SQL mode). No behavior change.

Token construction is on the lexer hot path; bypassing the `WP_Parser_Token::__construct()` indirection and assigning the four properties directly removes one method call per token. Requires `$input` on `WP_Parser_Token` to be `protected` instead of `private` so the subclass can write to it. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

The identifier/keyword branch handles the single largest share of tokens (~17% identifiers plus all keywords, ~35-45% of the corpus). It called two methods per token: get_current_token_bytes() to extract the token string and determine_identifier_or_keyword_type() to classify it. Inline that fast path into read_next_token(): extract the bytes and do the strtoupper + TOKENS lookup directly, returning the identifier without any call when it is not a keyword (the common case). The post-lookup keyword logic (version gating, function-call lookahead, high-NOT precedence, synonyms) moves to a new resolve_keyword_type() that is reached only for actual keywords; determine_identifier_or_keyword_type() now delegates to it for its other caller. Lex-only throughput on the MySQL server corpus: +5-6% under tracing JIT, +3.4% without (best-of-seven, ABAB-confirmed). The keyword path's measured ceiling was ~10% (JIT), most of which is the irreducible substr/strtoupper/ hash-lookup work that remains.

JanJakes changed the title ~~Lexer performance: ~1.8× faster MySQL lexer (no startup cost)~~ Optimize MySQL lexer (1.8× speedup) Jun 6, 2026

JanJakes force-pushed the lexer-performance branch 2 times, most recently from d4a5156 to 8e72371 Compare June 6, 2026 11:15

JanJakes force-pushed the lexer-performance branch 2 times, most recently from 49b362e to e1ed138 Compare June 6, 2026 11:31

JanJakes changed the title ~~Optimize MySQL lexer (1.8× speedup)~~ Optimize MySQL lexer (2× speedup) Jun 6, 2026

JanJakes force-pushed the lexer-performance branch 2 times, most recently from c8b7218 to 9929691 Compare June 6, 2026 11:54

JanJakes force-pushed the lexer-performance branch 2 times, most recently from ae31e13 to 3845f79 Compare June 6, 2026 12:03

JanJakes and others added 8 commits June 6, 2026 14:12

JanJakes force-pushed the lexer-performance branch from 3845f79 to 12a78da Compare June 6, 2026 12:12

JanJakes requested a review from adamziel June 6, 2026 12:15

JanJakes marked this pull request as ready for review June 6, 2026 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize MySQL lexer (2× speedup)#424

Optimize MySQL lexer (2× speedup)#424
JanJakes wants to merge 9 commits into
trunkfrom
lexer-performance

JanJakes commented Jun 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JanJakes commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Lexer benchmark CI job

Optimizations

Relationship to #378 / #373 / #375

Uh oh!

github-actions Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Lexer benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JanJakes commented Jun 6, 2026 •

edited

Loading

github-actions Bot commented Jun 6, 2026 •

edited

Loading