Optimize MySQL lexer (2× speedup)#424
Open
JanJakes wants to merge 9 commits into
Open
Conversation
d4a5156 to
8e72371
Compare
Contributor
🤖 Lexer benchmarkChanges to lexer-related files were detected and triggered a benchmark:
Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally. To reproduce locally: |
49b362e to
e1ed138
Compare
c8b7218 to
9929691
Compare
run-lexer-benchmark.php timed a single pass, which is too noisy to compare a change against. Rework it into a reliable throughput benchmark that the lexer optimisations in this branch can be measured against: - Load through src/load.php (parity with run-parser-benchmark.php) so a loaded native extension is benchmarked via the same public WP_MySQL_Lexer wrapper. - Warm up with discarded passes (heating opcache, the tracing JIT, and CPU caches), then run N timed passes over the whole corpus. - Headline the best pass: lexing is deterministic and CPU-bound, so outside interference can only slow a pass down, making the fastest pass the most reproducible estimate of intrinsic cost and the most stable basis for a before/after comparison. Median and best-vs-worst spread are reported too so a noisy machine is obvious. - Detect and report the active config (opcache / tracing JIT) and the implementation (php / native-extension), and warn when opcache.jit is set but the JIT did not actually activate. - Add --iterations / --warmup; keep --json (headline kept as "qps"). Add a `bench-lexer` script to the mysql-on-sqlite package's composer.json that runs the benchmark twice — without and with the tracing JIT — so both configurations are measured with one `composer run bench-lexer` (JIT is a start-up setting that cannot be toggled mid-process).
ae31e13 to
3845f79
Compare
On pull requests that touch the lexer (or the benchmark tool), run run-lexer-benchmark.php for both the base commit and the PR head on the same runner, without and with the tracing JIT, and post the before/after numbers as a single comment that updates in place on every push. The job is informational, not gating: hosted CI runners are too noisy for absolute-throughput thresholds. Measuring base and head back-to-back on the same runner cancels the runner's absolute speed, so the same-runner speedup ratio is the meaningful signal. Only the source tree is swapped to the base commit; the PR's benchmark tool is reused for both sides so they are timed identically.
Apply the structural lexer optimisations from PR #375: - Cache strlen($sql) once in $sql_length instead of recomputing it on each EOF/bounds check. - Use strpos($sql, '*/', $pos) instead of a manual scan loop in read_comment_content(). - In read_quoted_text(), use strpos() to find the next quote, dropping the separate end-of-input check that followed the strcspn() scan. - Inline next_token() + get_token() in remaining_tokens() so the hot loop builds tokens directly. The #375 strspn()->byte-comparison swaps are intentionally not included: once the dispatch chain is reordered by later commits those checks are off the hot path and strspn() is marginally faster than the inline comparisons, so the swaps were net-neutral-to-negative while adding code. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375
Both next_token() and remaining_tokens() previously paid a read_next_token() function call per whitespace run only to recognise and skip the resulting WHITESPACE token. A single unguarded strspn() at the top of each loop iteration absorbs the run inline, saving the call overhead for ~one whitespace run per real token across millions of tokens. The strspn() call is unguarded because an unconditional strspn() (which returns 0 in a single C-side call when nothing matches) is faster than gating it on a five-arm '$byte === ...' precheck.
ASCII letters and UTF-8 multibyte start bytes account for most token-start bytes on the MySQL corpus. They previously fell into the catch-all `else` at the bottom of read_next_token() after walking every operator arm in between. The new branch sits at the top of the elseif chain and dispatches them directly. The `next_byte !== "'"` guard keeps the x'..', n'..' and similar specials on their dedicated branches. `_` and `$` starters stay on the catch-all so the UNDERSCORE_CHARSET lookup still fires.
The ASCII bytes (, ), ',' ;, +, ~, %, ^, ?, {, }, and = each map to a
unique single-byte token type with no lookahead. A static array + isset()
arm dispatches them in one lookup, ahead of the per-byte elseif chain, and
the now-shadowed individual arms further down the chain are removed so the
table is the single source of truth for these tokens.
'*' and '|' are deliberately excluded because their token type depends on
context (in_mysql_comment for '*/', SQL_MODE_PIPES_AS_CONCAT for '||').
Three review-noted spots that were terse in the code: - The remaining_tokens() loop guard now spells out why both EOF and `null === token_type && bytes_already_read > 0` are needed (EOF on clean end-of-input vs invalid byte mid-stream, with the `> 0` guard letting the very first iteration through). - The identifier/keyword fast path now explains `$byte > "\x7F"` (UTF-8 multi-byte starter; MySQL identifiers allow U+0080-U+FFFF) and `next_byte !== "'"` (only single quotes form the special hex/bin/n-char literal starters; `"` never does, regardless of SQL mode). No behavior change.
Token construction is on the lexer hot path; bypassing the `WP_Parser_Token::__construct()` indirection and assigning the four properties directly removes one method call per token. Requires `$input` on `WP_Parser_Token` to be `protected` instead of `private` so the subclass can write to it. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375
The identifier/keyword branch handles the single largest share of tokens (~17% identifiers plus all keywords, ~35-45% of the corpus). It called two methods per token: get_current_token_bytes() to extract the token string and determine_identifier_or_keyword_type() to classify it. Inline that fast path into read_next_token(): extract the bytes and do the strtoupper + TOKENS lookup directly, returning the identifier without any call when it is not a keyword (the common case). The post-lookup keyword logic (version gating, function-call lookahead, high-NOT precedence, synonyms) moves to a new resolve_keyword_type() that is reached only for actual keywords; determine_identifier_or_keyword_type() now delegates to it for its other caller. Lex-only throughput on the MySQL server corpus: +5-6% under tracing JIT, +3.4% without (best-of-seven, ABAB-confirmed). The keyword path's measured ceiling was ~10% (JIT), most of which is the irreducible substr/strtoupper/ hash-lookup work that remains.
3845f79 to
12a78da
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A standalone, lexer-only subset of the parser-performance work (#378 / #373 / #375), based on
trunkon its own. It's worth landing separately because it's a clean, low-risk win that carries no per-process startup cost.Lexer-only throughput on the 69,577-query MySQL server corpus (MB Pro M4, PHP 8.5), measured with the
run-lexer-benchmark.phpadded in this PR (best pass after warmup),trunkvs this branch back-to-back:Lexer benchmark CI job
I added a new "Lexer benchmark" CI job that triggers on any changes to the lexer-related files, runs base vs. PR benchmarks on the CI, posts a comment with informative results, and keeps it updated (see below).
The speedups here on the CI seem to be more modest than locally, but noisy CI workers may skew the results.
Optimizations
Approximate cumulative contribution to the lexer throughput improvements:
strspn()at the top of the token loop skips whitespace runs instead of aread_next_token()round-trip per run.staticbyte → token idmap for( ) , ; + ~ % ^ ? { } =.WP_MySQL_Token(~+5%). The token subclass assigns its fields directly.strlen($sql)once,strpos()-based comment-end and quote scans, inlinedremaining_tokens().Two changes from earlier drafts were measured as neutral-to-negative once the dispatch chain was reordered and were dropped: the
strspn()→byte-comparison swaps (slightly slower under warm JIT, and they add code) and the---comment whitespace unroll (~0%).Relationship to #378 / #373 / #375
These are lexer-side optimizations from the consolidated #378 branch, isolated onto
trunk. They're independent of the parser/grammar preprocessing and add no startup cost, so they can land first. The parser-side work (which does carry a per-process grammar-build cost) can be evaluated separately.