Skip to content

Optimize MySQL lexer (2× speedup)#424

Open
JanJakes wants to merge 9 commits into
trunkfrom
lexer-performance
Open

Optimize MySQL lexer (2× speedup)#424
JanJakes wants to merge 9 commits into
trunkfrom
lexer-performance

Conversation

@JanJakes

@JanJakes JanJakes commented Jun 6, 2026

Copy link
Copy Markdown
Member

Summary

A standalone, lexer-only subset of the parser-performance work (#378 / #373 / #375), based on trunk on its own. It's worth landing separately because it's a clean, low-risk win that carries no per-process startup cost.

Lexer-only throughput on the 69,577-query MySQL server corpus (MB Pro M4, PHP 8.5), measured with the run-lexer-benchmark.php added in this PR (best pass after warmup), trunk vs this branch back-to-back:

Config Trunk This branch Speedup
Lex-only, no JIT 105,181 QPS 183,093 QPS 1.74×
Lex-only, warm tracing JIT 184,241 QPS 403,986 QPS 2.19×

Lexer benchmark CI job

I added a new "Lexer benchmark" CI job that triggers on any changes to the lexer-related files, runs base vs. PR benchmarks on the CI, posts a comment with informative results, and keeps it updated (see below).

The speedups here on the CI seem to be more modest than locally, but noisy CI workers may skew the results.

Optimizations

Approximate cumulative contribution to the lexer throughput improvements:

  • Inline skipping leading-whitespace (~+17% no-JIT / +36% JIT). One strspn() at the top of the token loop skips whitespace runs instead of a read_next_token() round-trip per run.
  • Catch identifiers/keywords at the top of the dispatch chain (~+19% / +20%). The most common token class is matched first.
  • Single-byte operator map (~+8% / +10%). A static byte → token id map for ( ) , ; + ~ % ^ ? { } =.
  • Inline the keyword lookup on the hot identifier path (~+3% / +6%). This avoids two method calls per identifier/keyword token; the keyword-only post-processing is reached only for actual keywords.
  • Skip the parent constructor in WP_MySQL_Token (~+5%). The token subclass assigns its fields directly.
  • Structural [codex] Speed up MySQL lexing and parsing #375 wins. Cache strlen($sql) once, strpos()-based comment-end and quote scans, inlined remaining_tokens().

Two changes from earlier drafts were measured as neutral-to-negative once the dispatch chain was reordered and were dropped: the strspn()→byte-comparison swaps (slightly slower under warm JIT, and they add code) and the ---comment whitespace unroll (~0%).

Relationship to #378 / #373 / #375

These are lexer-side optimizations from the consolidated #378 branch, isolated onto trunk. They're independent of the parser/grammar preprocessing and add no startup cost, so they can land first. The parser-side work (which does carry a per-process grammar-build cost) can be evaluated separately.

@JanJakes JanJakes changed the title Lexer performance: ~1.8× faster MySQL lexer (no startup cost) Optimize MySQL lexer (1.8× speedup) Jun 6, 2026
@JanJakes JanJakes force-pushed the lexer-performance branch 2 times, most recently from d4a5156 to 8e72371 Compare June 6, 2026 11:15
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

🤖 Lexer benchmark

Changes to lexer-related files were detected and triggered a benchmark:

Config Base (QPS) This PR (QPS) Speedup
no JIT 45,182 71,930 1.59×
tracing JIT 118,464 161,410 1.36×

Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally.

To reproduce locally:

cd packages/mysql-on-sqlite && composer run bench-lexer

@JanJakes JanJakes force-pushed the lexer-performance branch 2 times, most recently from 49b362e to e1ed138 Compare June 6, 2026 11:31
@JanJakes JanJakes changed the title Optimize MySQL lexer (1.8× speedup) Optimize MySQL lexer (2× speedup) Jun 6, 2026
@JanJakes JanJakes force-pushed the lexer-performance branch 2 times, most recently from c8b7218 to 9929691 Compare June 6, 2026 11:54
run-lexer-benchmark.php timed a single pass, which is too noisy to compare a
change against. Rework it into a reliable throughput benchmark that the lexer
optimisations in this branch can be measured against:

- Load through src/load.php (parity with run-parser-benchmark.php) so a loaded
  native extension is benchmarked via the same public WP_MySQL_Lexer wrapper.
- Warm up with discarded passes (heating opcache, the tracing JIT, and CPU
  caches), then run N timed passes over the whole corpus.
- Headline the best pass: lexing is deterministic and CPU-bound, so outside
  interference can only slow a pass down, making the fastest pass the most
  reproducible estimate of intrinsic cost and the most stable basis for a
  before/after comparison. Median and best-vs-worst spread are reported too so
  a noisy machine is obvious.
- Detect and report the active config (opcache / tracing JIT) and the
  implementation (php / native-extension), and warn when opcache.jit is set but
  the JIT did not actually activate.
- Add --iterations / --warmup; keep --json (headline kept as "qps").

Add a `bench-lexer` script to the mysql-on-sqlite package's composer.json that
runs the benchmark twice — without and with the tracing JIT — so both
configurations are measured with one `composer run bench-lexer` (JIT is a
start-up setting that cannot be toggled mid-process).
@JanJakes JanJakes force-pushed the lexer-performance branch 2 times, most recently from ae31e13 to 3845f79 Compare June 6, 2026 12:03
JanJakes and others added 8 commits June 6, 2026 14:12
On pull requests that touch the lexer (or the benchmark tool), run
run-lexer-benchmark.php for both the base commit and the PR head on the same
runner, without and with the tracing JIT, and post the before/after numbers as
a single comment that updates in place on every push.

The job is informational, not gating: hosted CI runners are too noisy for
absolute-throughput thresholds. Measuring base and head back-to-back on the
same runner cancels the runner's absolute speed, so the same-runner speedup
ratio is the meaningful signal. Only the source tree is swapped to the base
commit; the PR's benchmark tool is reused for both sides so they are timed
identically.
Apply the structural lexer optimisations from PR #375:

- Cache strlen($sql) once in $sql_length instead of recomputing it on each
  EOF/bounds check.
- Use strpos($sql, '*/', $pos) instead of a manual scan loop in
  read_comment_content().
- In read_quoted_text(), use strpos() to find the next quote, dropping the
  separate end-of-input check that followed the strcspn() scan.
- Inline next_token() + get_token() in remaining_tokens() so the hot loop
  builds tokens directly.

The #375 strspn()->byte-comparison swaps are intentionally not included: once
the dispatch chain is reordered by later commits those checks are off the hot
path and strspn() is marginally faster than the inline comparisons, so the
swaps were net-neutral-to-negative while adding code.

Co-authored-by: Adam Zieliński <adam@adamziel.com>
Adapted from #375
Both next_token() and remaining_tokens() previously paid a
read_next_token() function call per whitespace run only to
recognise and skip the resulting WHITESPACE token. A single
unguarded strspn() at the top of each loop iteration absorbs
the run inline, saving the call overhead for ~one whitespace
run per real token across millions of tokens.

The strspn() call is unguarded because an unconditional strspn()
(which returns 0 in a single C-side call when nothing matches)
is faster than gating it on a five-arm '$byte === ...' precheck.
ASCII letters and UTF-8 multibyte start bytes account for most
token-start bytes on the MySQL corpus. They previously fell into
the catch-all `else` at the bottom of read_next_token() after
walking every operator arm in between. The new branch sits at
the top of the elseif chain and dispatches them directly.

The `next_byte !== "'"` guard keeps the x'..', n'..' and similar
specials on their dedicated branches. `_` and `$` starters stay
on the catch-all so the UNDERSCORE_CHARSET lookup still fires.
The ASCII bytes (, ), ',' ;, +, ~, %, ^, ?, {, }, and = each map to a
unique single-byte token type with no lookahead. A static array + isset()
arm dispatches them in one lookup, ahead of the per-byte elseif chain, and
the now-shadowed individual arms further down the chain are removed so the
table is the single source of truth for these tokens.

'*' and '|' are deliberately excluded because their token type depends on
context (in_mysql_comment for '*/', SQL_MODE_PIPES_AS_CONCAT for '||').
Three review-noted spots that were terse in the code:

- The remaining_tokens() loop guard now spells out why both EOF
  and `null === token_type && bytes_already_read > 0` are needed
  (EOF on clean end-of-input vs invalid byte mid-stream, with
  the `> 0` guard letting the very first iteration through).

- The identifier/keyword fast path now explains `$byte > "\x7F"`
  (UTF-8 multi-byte starter; MySQL identifiers allow U+0080-U+FFFF)
  and `next_byte !== "'"` (only single quotes form the special
  hex/bin/n-char literal starters; `"` never does, regardless of
  SQL mode).

No behavior change.
Token construction is on the lexer hot path; bypassing the
`WP_Parser_Token::__construct()` indirection and assigning the four
properties directly removes one method call per token.

Requires `$input` on `WP_Parser_Token` to be `protected` instead of
`private` so the subclass can write to it.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
The identifier/keyword branch handles the single largest share of tokens
(~17% identifiers plus all keywords, ~35-45% of the corpus). It called two
methods per token: get_current_token_bytes() to extract the token string and
determine_identifier_or_keyword_type() to classify it.

Inline that fast path into read_next_token(): extract the bytes and do the
strtoupper + TOKENS lookup directly, returning the identifier without any call
when it is not a keyword (the common case). The post-lookup keyword logic
(version gating, function-call lookahead, high-NOT precedence, synonyms) moves
to a new resolve_keyword_type() that is reached only for actual keywords;
determine_identifier_or_keyword_type() now delegates to it for its other caller.

Lex-only throughput on the MySQL server corpus: +5-6% under tracing JIT,
+3.4% without (best-of-seven, ABAB-confirmed). The keyword path's measured
ceiling was ~10% (JIT), most of which is the irreducible substr/strtoupper/
hash-lookup work that remains.
@JanJakes JanJakes force-pushed the lexer-performance branch from 3845f79 to 12a78da Compare June 6, 2026 12:12
@JanJakes JanJakes requested a review from adamziel June 6, 2026 12:15
@JanJakes JanJakes marked this pull request as ready for review June 6, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant