diff --git a/WIP.md b/WIP.md index 3bd29986..ef72dfa5 100644 --- a/WIP.md +++ b/WIP.md @@ -433,7 +433,7 @@ From `docs/`: - `check.bat` — link check (offline Lychee against `_site/`). - `book.bat` — renders the PDF from `_site-pdf/book.html` via `pagedjs-cli` into `_pdf/book.pdf`. Run `build.bat` first to populate `_site-pdf/`. -The HTML whitespace compression that wraps every page's render chain is handled by `_plugins/html-compress.rb` rather than the just-the-docs theme's `vendor/compress.html` Liquid layout — see [_plugins/html-compress.md](docs/_plugins/html-compress.md) for the full writeup. The Liquid layout's per-page cost in the profile was ~2.4s of Liquid filter dispatch (a `split: " " | join: " "` over the outside-of-`
` content, lowering to a per-page Array allocation of every whitespace-delimited token across 837 pages — millions of small `String` objects). The layout is short-circuited via `compress_html.ignore.envs: all` in `_config.yml`; it then outputs a bare `{{ content }}` and the plugin takes over at `:pages, :post_render` / `:documents, :post_render` with `priority :high`, doing the same pre-block-protected whitespace collapse via `content.split(PRE_BLOCK_RE).each { |s| s.split(" ").join(" ") }` in C-implemented Ruby. The `priority :high` annotation places this hook before offlinify and pdfify (both `:normal`) so they see the compressed bytes. Pages whose layout chain doesn't reach `vendor/compress` are gated out via a `:site, :pre_render` precompute that walks `site.layouts[name].data["layout"]` for every layout key and marks the entire compress-reaching chain (default → table_wrappers → vendor/compress) -- jekyll-redirect-from stubs, the SCSS-derived CSS pages, `assets/js/zzzz-search-data.json`, and `book.html` (which uses the minimal `book-combined` layout that has no parent) all stay un-gated and pass through verbatim, matching exactly what the Liquid layout would have processed. Output is byte-identical to the layout-based version: a recursive `diff -rq` of `_site/` against a vendor/compress.html baseline reports zero differences across all ~840 HTML pages, 290 redirect stubs, every CSS / JSON / SVG / image asset. The plugin's correctness depended on two non-obvious details that broke an earlier cut -- the layout-chain walk has to compare against the layout *key* (`"vendor/compress"`) rather than `layout.name` (which carries the `.html` extension), and the per-segment `split(" ").join(" ")` strips trailing whitespace that the Liquid layout's *template* re-adds via its trailing-newline source character, so the plugin captures `content.end_with?("\n")` before the split and re-appends a `\n` after the join. Both regressions surfaced as nonzero `diff -rq` counts during development and are flagged in the plugin's header comment and [_plugins/html-compress.md](docs/_plugins/html-compress.md).
+The HTML whitespace compression that wraps every page's render chain is handled by `_plugins/html-compress.rb` rather than the just-the-docs theme's `vendor/compress.html` Liquid layout — see [_plugins/html-compress.md](docs/_plugins/html-compress.md) for the full writeup. The Liquid layout's per-page cost in the profile was ~2.4s of Liquid filter dispatch (a `split: " " | join: " "` over the outside-of-`
` content, lowering to a per-page Array allocation of every whitespace-delimited token across 837 pages — millions of small `String` objects). The layout is short-circuited via `compress_html.ignore.envs: all` in `_config.yml`; it then outputs a bare `{{ content }}` and the plugin takes over at `:pages, :post_render` / `:documents, :post_render` with `priority :normal`, doing the same pre-block-protected whitespace collapse via `content.split(PRE_BLOCK_RE).each { |s| s.split(" ").join(" ") }` in C-implemented Ruby. The `:normal` priority is the *middle* tier of a three-level convention across the site's `:post_render` hooks: mutators (`book-href-rewrite`) run at `:high`, this cleanup pass at `:normal`, readers (`pdfify`, `offlinify`) at `:low`. The invariant "compress runs after every mutator and before every reader" therefore holds by construction; no downstream plugin has to be whitespace-aware. Pages whose layout chain doesn't reach `vendor/compress` are gated out via a `:site, :pre_render` precompute that walks `site.layouts[name].data["layout"]` for every layout key and marks the entire compress-reaching chain (default → table_wrappers → vendor/compress) -- jekyll-redirect-from stubs, the SCSS-derived CSS pages, and `assets/js/zzzz-search-data.json` all stay un-gated and pass through verbatim. `book.html` (which uses the minimal `book-combined` layout that has no parent) is *also* outside that chain but is explicitly added to the compress-eligible set at the end of the precompute, so the same whitespace collapse runs on it -- saves paged.js's render-time `WhiteSpaceFilter` ~37k DOM mutations (~28k `textContent` overwrites + ~9k `removeChild` calls) at the cost of ~480 ms once per Jekyll build. Output is byte-identical to the layout-based version: a recursive `diff -rq` of `_site/` against a vendor/compress.html baseline reports zero differences across all ~840 HTML pages, 290 redirect stubs, every CSS / JSON / SVG / image asset. The plugin's correctness depended on two non-obvious details that broke an earlier cut -- the layout-chain walk has to compare against the layout *key* (`"vendor/compress"`) rather than `layout.name` (which carries the `.html` extension), and the per-segment `split(" ").join(" ")` strips trailing whitespace that the Liquid layout's *template* re-adds via its trailing-newline source character, so the plugin captures `content.end_with?("\n")` before the split and re-appends a `\n` after the join. Both regressions surfaced as nonzero `diff -rq` counts during development and are flagged in the plugin's header comment and [_plugins/html-compress.md](docs/_plugins/html-compress.md).
 
 ### Profiling the build
 
diff --git a/docs/_plugins/book-href-rewrite.rb b/docs/_plugins/book-href-rewrite.rb
index add4b3ef..28acdc26 100644
--- a/docs/_plugins/book-href-rewrite.rb
+++ b/docs/_plugins/book-href-rewrite.rb
@@ -372,7 +372,11 @@ def self.process(page)
   end
 end
 
-Jekyll::Hooks.register :pages, :post_render do |page|
+# :high so this MUTATOR runs before html-compress (priority :normal).
+# Otherwise the landing-heading strip leaves a double-space run that
+# no downstream pass cleans up. See html-compress.rb's priority
+# convention comment for the full layering.
+Jekyll::Hooks.register :pages, :post_render, priority: :high do |page|
   next unless page.path == "book.html"
   BookHrefRewrite.process(page)
 end
diff --git a/docs/_plugins/html-compress.md b/docs/_plugins/html-compress.md
index 114e3889..7430862a 100644
--- a/docs/_plugins/html-compress.md
+++ b/docs/_plugins/html-compress.md
@@ -1,6 +1,6 @@
 # HtmlCompress
 
-`_plugins/html-compress.rb` runs the HTML whitespace compression that wraps every page's render chain — the same job just-the-docs's vendor/compress.html Liquid layout was doing, but in Ruby instead of Liquid filters. Output is byte-identical to the layout-based version (verified by recursive diff of every file in `_site/` against a vendor/compress.html baseline). The Liquid layout is short-circuited to a `{{ content }}` passthrough via `compress_html.ignore.envs: all` in `_config.yml`; the plugin then runs at `:pages, :post_render` / `:documents, :post_render` with `priority :high`, so the compressed bytes are what offlinify and Jekyll's writer see.
+`_plugins/html-compress.rb` runs the HTML whitespace compression that wraps every page's render chain — the same job just-the-docs's vendor/compress.html Liquid layout was doing, but in Ruby instead of Liquid filters. Output is byte-identical to the layout-based version for the 837 vendor/compress-reaching pages (verified by recursive diff of every file in `_site/` against a vendor/compress.html baseline). The Liquid layout is short-circuited to a `{{ content }}` passthrough via `compress_html.ignore.envs: all` in `_config.yml`; the plugin then runs at `:pages, :post_render` / `:documents, :post_render` with `priority :normal` as the *cleanup* step in a three-tier `:high` → `:normal` → `:low` ordering (mutators → compress → readers — see [Hook priority convention](#hook-priority-convention) below). It also picks up one page the original layout didn't process, `book.html`, via an explicit `book-combined` addition to the compress-eligible set — see [book.html inclusion](#bookhtml-inclusion).
 
 This file sits in `_plugins/` for the same reasons as `offlinify.md` and `pdfify.md`: it lives next to the code it documents, and Jekyll's `_plugins/` folder is plugin-only territory, so this Markdown never gets rendered into the public site.
 
@@ -74,7 +74,7 @@ page.md   (layout: default)
         └── vendor/compress.html (no layout)
 ```
 
-Pages that don't use any of these layouts — jekyll-redirect-from stubs, the SCSS-derived CSS pages, `assets/js/zzzz-search-data.json`, `book.html` (which uses the minimal `book-combined` layout that has no parent) — were left untouched by the layout. The plugin has to match that gating, otherwise it would compress files that compress.html doesn't, breaking byte-identity.
+Pages that don't use any of these layouts — jekyll-redirect-from stubs, the SCSS-derived CSS pages, `assets/js/zzzz-search-data.json` — were left untouched by the layout. The plugin has to match that gating, otherwise it would compress files that compress.html doesn't, breaking byte-identity. `book.html` (which uses the minimal `book-combined` layout that has no parent) was originally in this list, but is now explicitly added to the compress-eligible set — see [book.html inclusion](#bookhtml-inclusion).
 
 The gate is precomputed once at `:site, :pre_render`:
 
@@ -114,20 +114,42 @@ Jekyll::Hooks.register :site, :pre_render do |site|
   HtmlCompress.precompute_compress_layouts!(site)
 end
 
-Jekyll::Hooks.register :pages, :post_render, priority: :high do |page|
+Jekyll::Hooks.register :pages, :post_render, priority: :normal do |page|
   next unless page.output.is_a?(String)
   next unless HtmlCompress.compress?(page)
   HtmlCompress.compress!(page.output)
 end
 
-Jekyll::Hooks.register :documents, :post_render, priority: :high do |doc|
+Jekyll::Hooks.register :documents, :post_render, priority: :normal do |doc|
   next unless doc.output.is_a?(String)
   next unless HtmlCompress.compress?(doc)
   HtmlCompress.compress!(doc.output)
 end
 ```
 
-The `priority: :high` is what places the plugin *before* `offlinify.rb` and `pdfify.rb` in the per-page render-hook order — both of those use the default `:normal` priority and rely on reading the final compressed `page.output`. Jekyll runs `:post_render` hooks in descending priority, so `:high` (30) fires before `:normal` (20). Without the priority annotation the order would be insertion-order across all `.rb` files in `_plugins/`, which is not a stable contract.
+## Hook priority convention
+
+The `priority: :normal` is the middle tier of a three-level ordering for `:pages, :post_render` and `:documents, :post_render` hooks across the plugin set. Jekyll runs hooks in descending priority (`:high` (30) → `:normal` (20) → `:low` (10)), and the three tiers carry distinct roles:
+
+| Tier | Role | Plugins |
+| --- | --- | --- |
+| `:high` (30) | **Mutators.** Modify `page.output` so the final bytes reflect this pass. | `book-href-rewrite` (chapter href rewrites + landing-heading strip on `book.html`). |
+| `:normal` (20) | **Compress.** The cleanup pass. Sandwiched between mutators and readers so any whitespace runs left behind by a mutator's `gsub` get collapsed before any reader captures the bytes. | `html-compress` (this plugin). |
+| `:low` (10) | **Readers.** Snapshot or consume `page.output` after the cleanup pass. | `pdfify` (captures `book.html` for the PDF pipeline), `offlinify` (per-page href / src rewrites + write to `_site-offline/`). |
+
+The layering was originally implicit: the plugin sat at `:high` next to no other priority-annotated `:post_render` hooks. That worked until `book-href-rewrite` joined the set at default `:normal`. Its landing-heading strip ran *after* compress, removing `

` blocks but leaving the (already-collapsed) single-space runs on either side adjacent — producing literal `> <` blobs in three chapter openings that paged.js's WhiteSpaceFilter then had to handle at render time. Promoting `book-href-rewrite` to `:high` and demoting compress to `:normal` makes the invariant "compress is the last cleanup step among mutators" hold by construction; demoting the readers to `:low` makes "readers see the final compressed output" hold by construction. Future plugins choose their tier by their role and the ordering composes automatically. + +The full priority story is documented as a comment block above the `Jekyll::Hooks.register` calls in [`html-compress.rb`](html-compress.rb); each of the four affected plugins (this one, `book-href-rewrite`, `pdfify`, `offlinify`) carries a one-line note pointing back to that block. + +## book.html inclusion + +The layout-chain walk above only marks layouts that reach `vendor/compress`. `book.html` uses the minimal `book-combined` layout, which has no parent, so the walk never reaches it and the page was originally skipped (matching the layout's behaviour). After investigation of paged.js's per-render `WhiteSpaceFilter` work (see [`perf/README.md`](../../perf/README.md)) showed it doing ~37k DOM mutations at render time to handle whitespace text nodes that *would* have been collapsed if the page had been compressed at Jekyll build time, the precompute was extended to mark `book-combined` explicitly: + +```ruby +@compress_layouts << "book-combined" if site.layouts.key?("book-combined") +``` + +at the end of `precompute_compress_layouts!`. Output: `book.html` now passes through `compress!` once per build (~480 ms of additional `String#split` work on the ~5.5 MB document), saving roughly the same wall-clock at paged.js render time (~28k `textContent` overwrites + ~9k `removeChild` calls eliminated). Net is approximately wall-clock-neutral for full builds, and a small net win for incremental Jekyll workflows that skip the PDF (`also_build_pdf: false`) — the compress cost is paid once per Jekyll build, the render saving is paid every PDF build, and decoupling the two is the structural improvement. ## Verification @@ -157,6 +179,6 @@ In source order in [`html-compress.rb`](html-compress.rb): - `precompute_compress_layouts!(site)` — `:site, :pre_render` entry. Walks every layout chain via `data["layout"]`, marks each layout on the path as compress-ending the moment the walk hits `vendor/compress`. Idempotent; the resulting `@compress_layouts` set persists across builds in `jekyll serve` and gets rebuilt fresh each `:pre_render`. -- `compress?(page)` — gate check. Returns `true` when the page's `data["layout"]` is in `@compress_layouts`. Pages without a layout (jekyll-redirect-from stubs, SCSS-derived CSS, JSON-via-page-rendering, `book.html` via `book-combined`) return `false` and skip the compression entirely. +- `compress?(page)` — gate check. Returns `true` when the page's `data["layout"]` is in `@compress_layouts`. Pages without a layout (jekyll-redirect-from stubs, SCSS-derived CSS, JSON-via-page-rendering) return `false` and skip the compression entirely. `book.html` (which uses `book-combined`, a minimal layout with no parent) used to land here too; it is now explicitly added to the set by `precompute_compress_layouts!` — see [book.html inclusion](#bookhtml-inclusion). - `compress!(content)` — the actual compression, in place. Captures the trailing-newline state, splits by `PRE_BLOCK_RE` with the capture group so pre bodies are preserved in the result array, runs `split(" ").join(" ")` on every outside-of-pre segment, joins, restores the trailing newline if needed, then mutates the input string via `String#replace`. The `replace` is what lets us hand back the same string object the caller passed in — Jekyll's writer reads `page.output` after `:post_render`, so in-place mutation is the cheapest way to update what gets written. diff --git a/docs/_plugins/html-compress.rb b/docs/_plugins/html-compress.rb index 7603a58d..ddbb1949 100644 --- a/docs/_plugins/html-compress.rb +++ b/docs/_plugins/html-compress.rb @@ -72,6 +72,12 @@ def self.precompute_compress_layouts!(site) cur_name = cur ? cur.data["layout"] : nil end end + # book-combined is a minimal layout with no parent, so the walk + # above doesn't reach it. Compressing its only consumer (book.html) + # at Jekyll time saves paged.js's WhiteSpaceFilter ~37k DOM + # mutations and ~300-400 ms once per render -- see + # perf/README.md "WhiteSpaceFilter that wasn't" section. + @compress_layouts << "book-combined" if site.layouts.key?("book-combined") end # True when `page` (or document) uses a layout chain ending in @@ -117,16 +123,43 @@ def self.compress!(content) HtmlCompress.precompute_compress_layouts!(site) end -# Run before offlinify (default :normal priority) so the offline-tree -# rewrites see the compressed page.output, and before Jekyll's -# `:site, :post_write` writes _site/ for the same reason. -Jekyll::Hooks.register :pages, :post_render, priority: :high do |page| +# Priority convention for :pages, :post_render hooks in this site: +# +# :high = MUTATORS. Plugins that modify page.output. Run first so +# their mutations are visible to compress and downstream +# readers. Examples: book-href-rewrite (landing heading +# strip + in-book href rewrites). +# +# :normal = COMPRESS. This plugin. The cleanup pass, sandwiched +# between mutators and readers so any whitespace runs left +# behind by a mutator's gsub get collapsed before anyone +# reads the final bytes. +# +# :low = READERS. Plugins that snapshot or consume page.output +# after all mutations and the compress pass. Run last so +# they see final output. Examples: pdfify (captures +# book.html for the PDF pipeline), offlinify (rewrites +# root-absolute hrefs and writes to _site-offline/). +# +# Without this layering, a mutator running after compress leaves +# adjacent whitespace runs that no downstream pass collapses; a +# reader running before compress captures uncompressed bytes. Both +# regressions surfaced when book-href-rewrite (default :normal) ran +# after html-compress (originally :high) -- its 3 landing-heading +# strips left double-space artifacts that paged.js's WhiteSpaceFilter +# had to handle at render time. +# +# Offlinify also runs at :site, :post_write (a later phase entirely), +# where it always sees the final compressed bytes regardless of +# per-page priority. The :low designation here governs its per-page +# capture hook specifically. +Jekyll::Hooks.register :pages, :post_render, priority: :normal do |page| next unless page.output.is_a?(String) next unless HtmlCompress.compress?(page) HtmlCompress.compress!(page.output) end -Jekyll::Hooks.register :documents, :post_render, priority: :high do |doc| +Jekyll::Hooks.register :documents, :post_render, priority: :normal do |doc| next unless doc.output.is_a?(String) next unless HtmlCompress.compress?(doc) HtmlCompress.compress!(doc.output) diff --git a/docs/_plugins/offlinify.rb b/docs/_plugins/offlinify.rb index ab5032e1..782038f5 100644 --- a/docs/_plugins/offlinify.rb +++ b/docs/_plugins/offlinify.rb @@ -1443,11 +1443,13 @@ def self.decode(path) Offlinify.setup(site) end -Jekyll::Hooks.register :pages, :post_render do |page| +# :low so these READERS see page.output after html-compress (:normal) +# has run. See html-compress.rb's priority convention. +Jekyll::Hooks.register :pages, :post_render, priority: :low do |page| Offlinify.process_page(page) end -Jekyll::Hooks.register :documents, :post_render do |doc| +Jekyll::Hooks.register :documents, :post_render, priority: :low do |doc| Offlinify.process_page(doc) end diff --git a/docs/_plugins/pdfify.rb b/docs/_plugins/pdfify.rb index 49eebeac..f3feabb9 100644 --- a/docs/_plugins/pdfify.rb +++ b/docs/_plugins/pdfify.rb @@ -287,7 +287,9 @@ def self.copy_file(src, dst) Pdfify.setup(site) end -Jekyll::Hooks.register :pages, :post_render do |page| +# :low so this READER captures page.output after html-compress +# (:normal) has run. See html-compress.rb's priority convention. +Jekyll::Hooks.register :pages, :post_render, priority: :low do |page| Pdfify.maybe_capture(page) end diff --git a/docs/lib/paged.browser.js b/docs/lib/paged.browser.js index 8d212d9e..83c9e53a 100644 --- a/docs/lib/paged.browser.js +++ b/docs/lib/paged.browser.js @@ -2486,29 +2486,22 @@ } addResizeObserver(contents) { - let wrapper = this.wrapper; - let prevHeight = wrapper.getBoundingClientRect().height; - this.ro = new ResizeObserver(entries => { - - if (!this.listening) { - return; - } - requestAnimationFrame(() => { - for (let entry of entries) { - const cr = entry.contentRect; - - if (cr.height > prevHeight) { - this.checkOverflowAfterResize(contents); - prevHeight = wrapper.getBoundingClientRect().height; - } else if (cr.height < prevHeight) { // TODO: calc line height && (prevHeight - cr.height) >= 22 - this.checkUnderflowAfterResize(contents); - prevHeight = cr.height; - } - } - }); - }); - - this.ro.observe(wrapper); + // [PATCH: disable-resize-observer] The RO existed to catch + // post-layout content reflow -- late-loading fonts, image + // dimensions resolving after layout, etc. -- by re-running + // findBreakToken whenever the wrapper grew/shrunk after + // renderTo returned. Our pipeline navigates with + // `waitUntil: "load"` and uses embedded fonts; nothing + // resizes after layout. The `_onOverflow` rescue path + // (Chunker.addPage line 3296) only fires while + // `!chunker.rendered`, and would emit a console.warn + // before re-rendering, so a regression would be loud. + // Disabling the RO removes a per-page allocation plus the + // stream of async findBreakToken / gBCR calls its callback + // would otherwise drive after every page's renderTo. + // checkUnderflowAfterResize is already gated by an absent + // _onUnderflow (see README "Attempt C"); checkOverflowAfterResize + // was the only live consumer. } checkOverflowAfterResize(contents) { @@ -3008,10 +3001,21 @@ } - async flow(content, renderTo) { + // [PATCH: sync-chain] flow() is now synchronous. All five await + // sites turn into sync calls: + // - beforeParsed / afterParsed / afterRendered hooks: handlers + // on our pipeline are all sync, so _assertSync guards them + // the same way the per-page hot path does. + // - loadFonts: now a sync assert (throws if any face isn't + // loaded; page.goto waitUntil:'load' ensures they are). + // - render: now a plain sync function. + // This was the last load-bearing await in the bundle. With + // flow() sync, the entire per-render call chain executes + // without yielding to a microtask boundary. + flow(content, renderTo) { let parsed; - await this.hooks.beforeParsed.trigger(content, this); + _assertSync(this.hooks.beforeParsed.trigger(content, this), "beforeParsed"); parsed = new ContentParser(content); @@ -3029,20 +3033,20 @@ this.emit("rendering", parsed); - await this.hooks.afterParsed.trigger(parsed, this); + _assertSync(this.hooks.afterParsed.trigger(parsed, this), "afterParsed"); - await this.loadFonts(); + this.loadFonts(); - let rendered = await this.render(parsed, this.breakToken); + let rendered = this.render(parsed, this.breakToken); while (rendered.canceled) { this.start(); - rendered = await this.render(parsed, this.breakToken); + rendered = this.render(parsed, this.breakToken); } this.rendered = true; this.pagesArea.style.setProperty("--pagedjs-page-count", this.total); - await this.hooks.afterRendered.trigger(this.pages, this); + _assertSync(this.hooks.afterRendered.trigger(this.pages, this), "afterRendered"); this.emit("rendered", this.pages); @@ -3079,12 +3083,11 @@ // } // } - // [PATCH: sync-chain] *layout is a sync generator now, so - // renderer.next() returns synchronously -- no per-page await. - // render() itself stays `async` because callers (flow()) await - // it and other once-per-render awaits in flow() (loadFonts, - // beforeParsed / afterParsed / afterRendered) still need it. - async render(parsed, startAt) { + // [PATCH: sync-chain] render() is now plain sync. *layout is a + // sync generator (renderer.next() returns synchronously), and + // flow() no longer awaits this call -- the entire per-render + // chain (preview -> flow -> render) is sync end to end. + render(parsed, startAt) { let renderer = this.layout(parsed, startAt); let result; @@ -3374,7 +3377,12 @@ } */ - async clonePage(originalPage) { + // [PATCH: sync-chain] clonePage is now synchronous. Only caller + // is the Footnotes handler (line ~31625) which is itself gated + // out for documents without `[data-note='footnote']` -- dead + // path on our content but kept sync-clean for consistency with + // the rest of the per-page hook surface. + clonePage(originalPage) { let lastPage = this.pages[this.pages.length - 1]; let page = new Page(this.pagesArea, this.pageTemplate, false, this.hooks); @@ -3386,7 +3394,7 @@ page.index(this.total); - await this.hooks.beforePageLayout.trigger(page, undefined, undefined, this); + _assertSync(this.hooks.beforePageLayout.trigger(page, undefined, undefined, this), "beforePageLayout"); this.emit("page", page); for (const className of originalPage.element.classList) { @@ -3395,27 +3403,32 @@ } } - await this.hooks.afterPageLayout.trigger(page.element, page, undefined, this); - await this.hooks.finalizePage.trigger(page.element, page, undefined, this); + _assertSync(this.hooks.afterPageLayout.trigger(page.element, page, undefined, this), "afterPageLayout"); + _assertSync(this.hooks.finalizePage.trigger(page.element, page, undefined, this), "finalizePage"); this.emit("renderedPage", page); } + // [PATCH: sync-chain] loadFonts is now a synchronous assertion. + // Upstream walked document.fonts and kicked off fontFace.load() + // for any not-yet-loaded face, returning a Promise.all. Our + // headless pipeline drives `page.goto(url, { waitUntil: "load" })` + // before paged.js runs, which settles document.fonts.ready -- + // every face is already in state "loaded" by the time we get + // here. The walk is a safety check: if a face is still loading + // (or hit an error), pipeline assumptions are broken and we + // should fail loudly rather than silently re-asyncify. loadFonts() { - let fontPromises = []; (document.fonts || []).forEach((fontFace) => { if (fontFace.status !== "loaded") { - let fontLoaded = fontFace.load().then((r) => { - return fontFace.family; - }, (r) => { - console.warn("Failed to preload font-family:", fontFace.family); - return fontFace.family; - }); - fontPromises.push(fontLoaded); + throw new Error( + "paged.js (forked): font-face '" + fontFace.family + + "' is not yet loaded (status=" + fontFace.status + + "). The headless pipeline expects every font to be " + + "loaded before PagedPolyfill.preview() runs; ensure " + + "page.goto uses { waitUntil: 'load' } or 'networkidle0'." + ); } }); - return Promise.all(fontPromises).catch((err) => { - console.warn(err); - }); } destroy() { @@ -26508,16 +26521,22 @@ - // parse - async parse(text) { + // [PATCH: sync-chain] parse() is now synchronous. Upstream awaited + // the three Polisher.hooks.{beforeTreeParse, beforeTreeWalk, + // afterTreeWalk} triggers; with our pipeline registering no async + // handlers for any of them, the awaits were pure microtask + // boundaries. _assertSync throws if anyone ever does register a + // thenable-returning handler -- same safety pattern the chunker's + // per-page hot path uses. + parse(text) { this.text = text; - await this.hooks.beforeTreeParse.trigger(this.text, this); + _assertSync(this.hooks.beforeTreeParse.trigger(this.text, this), "beforeTreeParse"); // send to csstree this.ast = csstree.parse(this._text); - await this.hooks.beforeTreeWalk.trigger(this.ast); + _assertSync(this.hooks.beforeTreeWalk.trigger(this.ast), "beforeTreeWalk"); // Replace urls this.replaceUrls(this.ast); @@ -26532,7 +26551,7 @@ this.rules(this.ast); this.atrules(this.ast); - await this.hooks.afterTreeWalk.trigger(this.ast, this); + _assertSync(this.hooks.afterTreeWalk.trigger(this.ast, this), "afterTreeWalk"); // return ast return this.ast; @@ -27487,28 +27506,30 @@ } `; - async function request(url, options={}) { - return new Promise(function(resolve, reject) { - let request = new XMLHttpRequest(); - - request.open(options.method || "get", url, true); - - for (let i in options.headers) { - request.setRequestHeader(i, options.headers[i]); - } - - request.withCredentials = options.credentials === "include"; - - request.onload = () => { - // Chrome returns a status code of 0 for local files - const status = request.status === 0 && url.startsWith("file://") ? 200 : request.status; - resolve(new Response(request.responseText, {status})); - }; - - request.onerror = reject; - - request.send(options.body || null); - }); + // [PATCH: sync-chain] Synchronous XHR returning body text directly. + // Upstream paged.js used async XHR + Promise + Response wrapper to + // keep the interactive-browser main thread responsive while + // stylesheets loaded. Our headless pipeline doesn't share that + // constraint: every stylesheet is a local file:// URL, fetches are + // sub-ms, and we want the polisher's stylesheet ingestion off the + // microtask queue so the whole render chain stays sync. Both + // callers (Polisher.add / convertViaSheet) only ever consumed + // response.text(), which is itself async per spec -- returning the + // text directly skips that boundary too. Throws on HTTP error. + function request(url, options={}) { + let req = new XMLHttpRequest(); + req.open(options.method || "get", url, false); + for (let i in options.headers) { + req.setRequestHeader(i, options.headers[i]); + } + req.withCredentials = options.credentials === "include"; + req.send(options.body || null); + // Chrome returns status 0 for successful local-file loads. + const status = req.status === 0 && url.startsWith("file://") ? 200 : req.status; + if (status < 200 || status >= 300) { + throw new Error("paged.js (forked): request " + url + " failed with status " + status); + } + return req.responseText; } class Polisher { @@ -27545,53 +27566,43 @@ return this.styleSheet; } - async add() { - let fetched = []; - let urls = []; - - for (var i = 0; i < arguments.length; i++) { - let f; - - if (typeof arguments[i] === "object") { - for (let url in arguments[i]) { - let obj = arguments[i]; - f = new Promise(function(resolve, reject) { - urls.push(url); - resolve(obj[url]); - }); + // [PATCH: sync-chain] add() is now synchronous. Upstream collected + // every input as a Promise (Promise.all + then-chain), even when + // inputs were inline {url:text} objects with no fetch needed. + // With request() returning text directly and convertViaSheet now + // sync, we just walk the arguments once and feed each to the + // pipeline. Same return semantics: the converted-and-inserted + // text of the last stylesheet. + add() { + let text = ""; + for (let i = 0; i < arguments.length; i++) { + let arg = arguments[i]; + if (typeof arg === "object") { + for (let url in arg) { + text = this.convertViaSheet(arg[url], url); + this.insert(text); } } else { - urls.push(arguments[i]); - f = request(arguments[i]).then((response) => { - return response.text(); - }); + let url = arg; + let cssStr = request(url); + text = this.convertViaSheet(cssStr, url); + this.insert(text); } - - - fetched.push(f); } - - return await Promise.all(fetched) - .then(async (originals) => { - let text = ""; - for (let index = 0; index < originals.length; index++) { - text = await this.convertViaSheet(originals[index], urls[index]); - this.insert(text); - } - return text; - }); + return text; } - async convertViaSheet(cssStr, href) { + // [PATCH: sync-chain] convertViaSheet is now synchronous. + // sheet.parse is sync; request() now returns body text directly + // (sync XHR + responseText, no Response wrapper). + convertViaSheet(cssStr, href) { let sheet = new Sheet(href, this.hooks); - await sheet.parse(cssStr); + sheet.parse(cssStr); // Insert the imported sheets first for (let url of sheet.imported) { - let str = await request(url).then((response) => { - return response.text(); - }); - let text = await this.convertViaSheet(str, url); + let str = request(url); + let text = this.convertViaSheet(str, url); this.insert(text); } @@ -32452,12 +32463,31 @@ TargetText ]; + // [PATCH: whitespace-filter-opt-in] Default off because our Jekyll + // pipeline runs html-compress on `book.html` (see _plugins/html- + // compress.rb's three-tier hook ordering: book-combined is in the + // compress-eligible set), so inter-element whitespace is already + // collapsed by the time paged.js sees the document. The filter + // would visit every text node in the parsed DOM (~181 k callbacks + // on the 1651-page book) and -- post-compression -- find essentially + // nothing to mutate. A paired cpu-profile A/B (3+3 runs, see + // perf/README.md) showed the no-op walk still costs ~600 ms of CPU + // per render: ~125 ms direct (filterTree / filterEmpty self) plus + // ~480 ms indirect (gBCR + downstream Blink layout / style work that + // runs cheaper when V8's IC + Blink scheduler aren't being churned + // by 181 k C++->JS callback dispatches). The cost is small per call + // but compounds because the walk lives inside the same microtask + // continuation as the per-page render loop. Set + // `window.PagedConfig.runWhitespaceFilter = true` before + // PagedPolyfill.preview() if processing a document whose source + // HTML wasn't compressed at build time. class WhiteSpaceFilter extends Handler { constructor(chunker, polisher, caller) { super(chunker, polisher, caller); } filter(content) { + if (!(typeof window !== "undefined" && window.PagedConfig && window.PagedConfig.runWhitespaceFilter)) return; filterTree(content, (node) => { return this.filterEmpty(node); @@ -33062,9 +33092,18 @@ }); } - async preview(content, stylesheets, renderTo) { + // [PATCH: sync-chain] preview() is now synchronous end-to-end. + // beforePreview / afterPreview hooks are once-per-render so the + // _assertSync guard is the same shape as the chunker's per-page + // hot path uses. polisher.add and chunker.flow are sync above. + // External callers (perf/measure.mjs, docs/render-book.mjs) now + // call this without `await` -- the page.evaluate IIFE wrapping + // the call is also sync, so the entire script execution runs + // inside one EvaluateScript frame instead of being scheduled + // across multiple microtask continuations. + preview(content, stylesheets, renderTo) { - await this.hooks.beforePreview.trigger(content, renderTo); + _assertSync(this.hooks.beforePreview.trigger(content, renderTo), "beforePreview"); if (!content) { content = this.wrapContent(); @@ -33078,12 +33117,12 @@ this.handlers = this.initializeHandlers(); - await this.polisher.add(...stylesheets); + this.polisher.add(...stylesheets); let startTime = performance.now(); // Render flow - let flow = await this.chunker.flow(content, renderTo); + let flow = this.chunker.flow(content, renderTo); let endTime = performance.now(); @@ -33092,7 +33131,7 @@ this.emit("rendered", flow); - await this.hooks.afterPreview.trigger(flow.pages); + _assertSync(this.hooks.afterPreview.trigger(flow.pages), "afterPreview"); return flow; } diff --git a/docs/render-book.mjs b/docs/render-book.mjs index 7ad9509a..e7ad9bfc 100644 --- a/docs/render-book.mjs +++ b/docs/render-book.mjs @@ -99,10 +99,20 @@ const browser = await puppeteer.launch({ // tagged-pdf and outline launch flags are added by puppeteer 22+ // automatically in ChromeLauncher.defaultArgs(), so we don't repeat // them here. + // + // --disable-gpu + --disable-software-rasterizer: shrinks the GPU + // process from ~100 MB to ~16 MB (Chromium keeps a stub even with + // these flags -- only --in-process-gpu kills it entirely, but that + // serialises GPU work onto the main thread and costs ~15 s on the + // render+generate wall clock). With just the disable pair the + // renderer is also ~120 MB lighter and generate runs ~5 s faster + // (Skia skips a GPU init path). PDF output is byte-identical. args: [ '--no-sandbox', '--disable-dev-shm-usage', '--allow-file-access-from-files', + '--disable-gpu', + '--disable-software-rasterizer', ], }); @@ -158,13 +168,19 @@ try { } // Render -- paged.js per-page layout. + // PagedPolyfill.preview() is fully synchronous in our forked bundle + // (the entire chain preview -> chunker.flow -> render -> *layout is + // now sync; loadFonts is a sync assertion that page.goto's + // waitUntil:'load' already satisfied; stylesheets are loaded via + // synchronous XHR). Inner IIFE is a plain sync arrow; outer await + // is just the CDP round-trip puppeteer needs to ferry the result. const tRender = Date.now(); - await page.evaluate(async () => { + await page.evaluate(() => { if (!window.PagedPolyfill) { throw new Error('paged.js bundle did not expose window.PagedPolyfill'); } try { - await window.PagedPolyfill.preview(); + window.PagedPolyfill.preview(); } catch (err) { // Unwrap the undecorated ProgressEvent paged.js throws on fetch // failures so the message includes the offending URL. diff --git a/perf/.gitignore b/perf/.gitignore index fbca2253..df01c96d 100644 --- a/perf/.gitignore +++ b/perf/.gitignore @@ -1 +1,3 @@ results/ +ab-css/ +ab-css-*/ diff --git a/perf/CHROMIUM.md b/perf/CHROMIUM.md new file mode 100644 index 00000000..a7e4a4f1 --- /dev/null +++ b/perf/CHROMIUM.md @@ -0,0 +1,431 @@ +# Chromium-internal approaches to parallel PDF generation + +A separate document because none of this is shipped or even partially +implemented. It records the research we did into Chromium-internal +approaches to faster / parallel PDF emission, with honest cost +estimates and the reasons each was rejected. Kept as a reference for +two scenarios: + +1. The book grows large enough that the 70 s build becomes a CI + bottleneck again (3000+ pages, or CI runtime tightens). +2. Someone independently rediscovers the same ideas and wants to know + why we didn't pursue them. + +For the perf work that *did* land (the `--disable-gpu` flag pair, the +memory probes, the GC-pass investigation), see [README.md](README.md). + +## What the public APIs don't expose + +The shortest version: **Skia's drawing stream and HarfBuzz's shape +results never leave the renderer process via any documented API**. +That's the wall behind every approach below. + +What's documented and works from JS / CDP: + +- `Range.getClientRects()`, `Element.getBoundingClientRect()` -- per + line-fragment bounding boxes. Box-level, not glyph-level. +- CDP `DOMSnapshot.captureSnapshot` -- the full layout tree as JSON + with each text node's `textBoxes[]` (bounds + text-fragment offsets). + Run-level granularity. +- `CanvasRenderingContext2D.measureText()` -- `TextMetrics` for text + *about to be drawn*, not text already laid out. +- `document.fonts` (`FontFaceSet`) -- load state, not glyph positions. + +What is *not* exposed anywhere: + +- The HarfBuzz shaping result -- the character-to-glyph mapping with + ligatures, contextual substitutions, kerning all applied. Lives in + Blink's `blink::ShapeResult` / `ShapeResultView` (~50 MB in the + renderer for our book, visible in the memory-infra dump). +- Per-glyph x-positions (`SkTextBlob`). +- Font binaries / subsets (security/copyright concerns). +- The accessibility structure tree that becomes the tagged-PDF + structure tree. + +What is internally serialized but invisible from outside: + +- `cc::PaintRecord` / `SkPicture` -- the renderer's full draw stream, + containing every `SkTextBlob` with its glyph IDs and positions. + Serialized for Mojo transfer renderer → PrintCompositor (see below); + could be intercepted with dynamic instrumentation. +- The tagged-PDF structure tree -- traveled separately through Mojo + to PrintCompositor; same intercept-by-hook story. + +## How the print path actually works + +Inside one PDF render (`Page.printToPDF`): + +1. **Renderer (where paged.js lives)** -- Blink lays out the document + via LayoutNG; the paint pass produces a `cc::PaintRecord` + containing every draw op as `SkPaint` + `SkTextBlob` + `SkPath` + + `SkImage` plus the accessibility structure tree. +2. **Mojo IPC** -- the `PaintRecord` is serialized (Skia's documented + `SkPicture` byte format, ~50 MB on our book) and sent over a Mojo + channel to the PrintCompositor utility process. The structure tree + travels via a separate Mojo message. +3. **PrintCompositor utility process** (`chrome.exe --type=utility + --utility-sub-type=printing.mojom.PrintCompositor`) -- deserializes + the picture into Skia, calls `SkPDFDocument` to emit PDF bytes, + merges the structure tree on top, returns the PDF bytes via Mojo. +4. **Browser process** -- receives the PDF, forwards over the + DevTools/CDP channel to puppeteer over a WebSocket. +5. **Node (us)** -- receives the bytes from puppeteer. + +Cost shape on the 1651-page book, with the shipped `--disable-gpu` +flag pair: + +| stage | typical wall clock | peak memory | +| ----- | ------------------ | ----------- | +| render (Blink layout + paged.js) | ~10 s | renderer ~1.3 GB | +| Mojo transfer renderer → PrintCompositor | <100 ms | (briefly +50 MB browser IPC buffer) | +| PrintCompositor → PDF | ~35 s | utility process ~300-500 MB | +| PDF transfer back | <500 ms | browser process spikes (PDF is in flight) | +| pdf-lib outline + metadata | ~5 s | Node ~100 MB | + +The 35 s `SkPDF` step is single-threaded Skia walking the layout tree +and emitting PDF objects per the SkPDF design (see "Memory: where the +renderer's 1.9 GB goes" in README.md for the per-allocator breakdown +of that growth). + +## Chromium's binary boundary + +`chrome.dll` is a single ~283 MB blob containing essentially all of +Chromium: Blink, V8, Skia, Mojo, services, PrintCompositor, +everything. The launcher `chrome.exe` is a 4 MB shim that loads +`chrome.dll` and calls `ChromeMain`. + +A PE export-table dump (see `perf/probe-idle-browser.mjs` for the +measurement that surfaced this) shows **chrome.dll exports exactly +six functions**: + +``` +ChromeMain # main entry point +CrashForExceptionInNonABICompliantCodeRange # crash helper +GetHandleVerifier # sandbox handle check +IsSandboxedProcess # sandbox query +RelaunchChromeBrowserWithNewCommandLineIfNeeded # relauncher +sqlite3_dbdata_init # accidental third-party leak +``` + +Out of probably millions of internal C++ functions, six are reachable +from outside via `LoadLibrary` + `GetProcAddress`. PrintCompositor, +Mojo, Skia, Blink, V8 -- none are exported. The binary is opaque by +design; Chromium isn't built as a library for third-party embedders. + +**CEF** (Chromium Embedded Framework, which the docs ship a reference +for in `docs/Reference/CEF/`) exists exactly because of this gap. +CEF is a deliberately-stable C/C++ API wrapper on top of Chromium +internals, with a single stable ABI per major version. The CEF +maintainers do the work of (a) building Chromium with the right +configs, (b) exposing necessary internals through a stable wrapper, +and (c) keeping the wrapper compatible across Chromium upgrades. + +## Idle process tree baseline + +Measured by [probe-idle-browser.mjs](probe-idle-browser.mjs) -- a +fresh puppeteer.launch + about:blank only, no work: + +| process | private | +| ------- | ------- | +| browser (the parent) | 40-46 MB | +| renderer (initial about:blank target) | 20-23 MB | +| gpu-process (stub, post `--disable-gpu`) | 15-16 MB | +| utility:network.mojom.NetworkService | 17 MB | +| utility:storage.mojom.StorageService | 11 MB | +| crashpad-handler x 2 | 2 MB each | +| **total tree** | **~125-180 MB** | + +The "browser process at 1,113 MB" figure in earlier memory probes was +specific to the PDF-transit phase -- the browser process buffers the +41 MB PDF + the tagged structure tree as they flow from PrintCompositor +to the browser to puppeteer's CDP channel. It is not the steady-state +cost. + +## Approach A: patch and upstream a Chromium flag + +The highest-leverage candidate: a CDP/flag-level change that either +skips PrintCompositor for single-renderer documents or adds streaming +output. Concrete entry points for research: + +- Skia source: + -- commit log against the Skia revision pinned in our Chromium + build. +- Skia Gerrit reviews-in-flight: + filtered by `src/pdf/`. +- Chromium printing tree: `chromium/src/printing/`, + `components/printing/`, `chrome/browser/printing/`. +- crbug.com: searches like `component:Internals>Printing performance` + or `component:Internals>Skia>PDF`. +- Dev mailing lists: `chromium-dev@chromium.org`, + `skia-discuss@googlegroups.com` (Google Groups archives). + +Plausibly upstreamable patches: + +1. `Page.printToPDF({ singleRenderer: true })` -- skip PrintCompositor + when the document doesn't span multiple frames. Saves ~450 MB + peak + ~5-10 s in our pipeline. +2. CDP method that emits the renderer's `SkPicture` directly. Unlocks + external pipelines. +3. Streaming `Page.printToPDF` output. Lets us overlap `process` + (pdf-lib outline / metadata) with `generate`. + +**Rejected because** the gains overestimated what they'd buy us. The +generate phase is ~35 s with the shipped flag pair, peak memory is +~2.4 GB. Saving ~450 MB of PrintCompositor or shaving 5-10 s of +generate isn't worth the upstreaming overhead (RFC, review cycles, +Chromium release cadence, plus carrying a patch until the upstream +lands). + +## Approach B: port SkPDF to JS + +Skia's PDF backend (`src/pdf/` in Skia, ~30 k LOC of C++) consumes an +`SkCanvas` draw stream and emits PDF bytes. Porting it to JS is a +real project but the work it does isn't where the time goes -- Skia +is well-optimized. **The hard problem is not Skia. It's getting +Blink's draw stream out to feed into the port.** + +CanvasKit (`canvaskit-wasm` on npm) is Skia compiled to WASM and +includes `SkDocument::MakePDFDocument`. In principle: load an +`SkPicture` into CanvasKit, replay it into the PDF document's canvas, +serialize. The same input problem still applies -- the `SkPicture` +isn't accessible from JS land without a Chromium-side intervention. +CanvasKit's PDF surface is also materially less battle-tested than +native SkPDF and lacks the tagged-PDF API. + +**Rejected because** the port alone doesn't unblock anything and the +real bottleneck (data extraction) is identical to approaches C-E. + +## Approach C: Frida + Mojo emulation in Node + +Architecture: + +1. Frida-hook the renderer process, intercept `SkPicture::serialize` + to capture the serialized picture bytes during `Page.printToPDF`. +2. Slice the picture by page bounds using `SkBBoxHierarchy` / + `SkPicture::playback` with a clipping canvas. +3. Spawn N PrintCompositor utility processes from Node, talking to + each over Mojo to send a sub-picture and receive a PDF slice. +4. Concatenate slices with raw-byte xref rewriting. + +The blocker is step 3. Mojo has three sub-layers: + +- **Transport** -- Win32 named pipes. One end inherited by the child + via the `PROC_THREAD_ATTRIBUTE_HANDLE_LIST` Win32 attribute, + command-line arg `--mojo-platform-channel-handle=`. +- **Wire protocol** -- framed messages with version headers, + attachment references, multiplexed message pipes. +- **Bindings** -- `.mojom` interface files (e.g., + `components/services/print_compositor/public/mojom/print_compositor.mojom`) + compiled to marshaling stubs. + +The handshake the browser-process normally does to bring up a +PrintCompositor utility: + +1. Spawn the child with the `--type` / `--utility-sub-type` args plus + the inheritable pipe handle. +2. Send the Mojo "invitation" message containing the primordial + message pipe handle. +3. Once the child has resolved the invitation, send a binding request + for the named `printing.mojom.PrintCompositor` attachment. +4. Call methods on the resulting remote (e.g., + `PrepareForDocumentToPdf`, `CompositePage`, `FinishDocumentToPdf`), + each method being a structured Mojo message with mojom-encoded + payload and shared-memory regions for the large blobs. + +Implementing all of this in Node, against unstable Chromium internal +interfaces, is the cost: + +| component | effort | +| --------- | ------ | +| Win32 process spawn with inherited handles (Win32 FFI) | 1 week | +| Named pipe + cross-process handle transfer | 1 week | +| Mojo channel framing (read/write headers, multiplex) | 2-3 weeks | +| Mojo invitation protocol | 1-2 weeks | +| `.mojom` parser + JS codegen, or hand-written stubs | 2-3 weeks | +| Shared-memory region encoding | 1 week | +| PrintCompositor-specific marshaling | 1-2 weeks | +| Tagged-PDF tree capture + slicing | 2-3 weeks | +| SkPicture slicing by page bounds | 1-2 weeks | +| Integration + Chromium-version drift debugging | 3-4 weeks | +| **total** | **15-22 weeks** | + +Plus ongoing maintenance every Chromium upgrade -- internal +interfaces have no stability guarantees because they're build-time +contracts between Chromium components. + +**Rejected because** the engineering cost dwarfs the wall-clock +savings, and the maintenance is permanent. + +## Approach D: Frida + CanvasKit-WASM in workers + +Avoids Mojo by using Skia directly. Architecture: + +1. Frida-hook to capture the SkPicture bytes (same as C). +2. Slice the picture by page bounds (same as C). +3. Spawn N Node `worker_threads`, each loads CanvasKit-WASM, + deserializes its sub-picture, calls `SkDocument::MakePDFDocument`, + emits a sub-PDF. +4. Concatenate. + +Cost is smaller than approach C because no Mojo plumbing, but two +issues: + +- **CanvasKit's PDF surface diverges from native SkPDF.** Font + subsetting, image encoding, color-space handling have known gaps + and quirks. Plan on 1-2 weeks of debugging diverging output before + matching native SkPDF closely enough for production. +- **Tagged PDF is missing.** CanvasKit's `SkDocument` doesn't expose + Skia's tagging API; the structure tree would have to be applied + separately, derived from the DOM in our own code. Probably 2-4 + weeks to rebuild. + +Total: **6-10 weeks**, with output-fidelity risk. + +**Rejected because** of the tagged-PDF gap (accessibility is +non-negotiable) and the divergence risk against the production +Chromium SkPDF baseline. + +## Approach E: helper binary linking Chromium components + +Architecture: build a small DLL/EXE that statically links against +`//mojo/core/embedder`, `//components/services/print_compositor`, +and `//cc/paint`. The helper exports C-style functions Node calls +via FFI: + +- `helper_init()` -- start a Mojo node, set up the embedder. +- `helper_emit_pdf(skp_bytes, ax_tree_bytes, page_range, out_pdf*)` -- + spawn or reuse a PrintCompositor, send the inputs, return the PDF. + +GN file is short: + +```gn +shared_library("printcomp_helper") { + sources = [ "helper.cc" ] + deps = [ + "//mojo/core/embedder", + "//components/services/print_compositor", + "//cc/paint", + "//base", + ] +} +``` + +The helper does all the Mojo plumbing using Chromium's own Mojo +library, so we avoid reimplementing Mojo in Node. Node handles +SkPicture slicing (a pure data problem) and PDF concatenation. + +### Checkout and build cost (corrected) + +The "Chromium build is 50 GB and 6 hours" rule of thumb refers to +the full-history `fetch chromium`. For a single-purpose helper, +with `gclient sync --no-history --shallow` and targeted GN builds: + +| step | estimate | +| ---- | -------- | +| depot_tools install + Visual Studio Build Tools + Win SDK 10 (if not already set up) | half day, one-time | +| Shallow `gclient sync` for selected DEPS | 30-90 min | +| Disk footprint after shallow sync | ~20-30 GB (not 50) | +| First `ninja printcomp_helper` with `is_debug=false symbol_level=1` | 30-90 min (~1500-2500 TUs vs ~50,000 for full Chromium) | +| Incremental rebuild (touched `helper.cc`) | 5-15 min | +| Output DLL size | ~80-150 MB (statically-linked Skia, base, mojo, abseil, icu) | +| Per-Chromium-upgrade re-sync + rebuild | 1 hour if interfaces stable, up to a day if a signature changed | + +So the **initial commitment is more like a Saturday afternoon than a +quarter** -- the 6-12 weeks figure from approach C drops to **4-6 +weeks for the full pipeline** (helper + Frida extraction + SkPicture +slicing + AX tree slicing + Node orchestration + PDF concat). + +### A potentially smaller variant: Skia-only helper + +If tagged PDF were acceptable to drop, the helper could skip +`//components/services/print_compositor` and link only against +`//third_party/skia`. The build shrinks to ~800-1200 TUs, ~20-40 min +first build, helper DLL ~30-50 MB. The PDF emit path becomes a direct +`SkDocument::MakePDF` call. + +**Rejected because** tagged PDF is non-negotiable. Documented here +because it's the simplest viable Chromium-internal architecture if +the accessibility requirement ever changes. + +### Why approach E was still rejected + +The 4-6 week full-project estimate is a fair cost for the gains: + +- Render once, extract SkPicture (~10 s). +- Kill the original Chromium (frees ~1.4 GB renderer). +- Run N PrintCompositor helpers in parallel (~11 s wall clock for N=4 + at ~45/4 s each). +- Concat (~3 s). +- **End-to-end: ~26 s vs current ~70 s, peak ~2 GB.** + +Actual 41 s wall-clock save with comparable peak memory. Worth doing +if the engineering budget exists. + +What pushes it off the table for now: + +1. **Maintenance against Chromium version churn.** Mojo's + `printing.mojom.PrintCompositor` interface signature changes + between Chromium milestones. We'd be re-syncing + rebuilding + + retesting on every Puppeteer Chromium bump (every few months). +2. **CI build pipeline complexity.** Helper.dll has to be pre-built + and shipped as a release artifact -- can't be built fresh in + GitHub Actions every PR because the sync + build is ~45-90 min on + a CI-class machine. +3. **The savings aren't urgent.** A 70 s build is fine on CI. A + ~26 s build would be nicer, but the 44 s difference doesn't change + any developer workflow we have. + +If item 3 changes (book grows past ~3000 pages, or CI gains a hard +runtime cap), approach E becomes the right answer. + +## Cost summary + +| approach | engineering | tagged PDF | output fidelity | binary | maintenance | +| -------- | ----------- | ---------- | --------------- | ------ | ----------- | +| A (upstream patch) | weeks-months of RFC + review | works | identical | none (official) | none after merge | +| B (port SkPDF alone) | doesn't unblock | n/a | n/a | n/a | n/a | +| C (Frida + Mojo in Node) | 15-22 weeks | works | identical | small | high (Mojo internals) | +| D (Frida + CanvasKit workers) | 6-10 weeks | requires rebuild | divergence risk | medium | medium | +| E (helper binary) | 4-6 weeks | works | identical | 80-150 MB | per Chromium upgrade | +| E-slim (Skia-only helper) | 3-4 weeks | broken | divergence on tags | 30-50 MB | per Chromium upgrade | + +## What would change the calculus + +- **Book grows past ~3000 pages.** Generate time scales roughly + linearly in Skia; at 3000 pages the single-process pipeline is + ~70-90 s generate alone, ~100-120 s total. Approach E pays off. +- **CI runner downsized.** If peak memory has to stay under ~1.5 GB, + any current single-Chromium path is in trouble; approach E with + the renderer killed mid-pipeline is the only fit. +- **Chromium ships streaming `Page.printToPDF`.** A long-standing + feature request that would let us overlap `generate` and + `process`. If it lands upstream, our pipeline benefits without any + patch work and approach E loses its remaining edge. +- **CEF adds tagged-PDF support.** Currently a gap; if filled, the + helper-binary architecture could route through CEF's stable API + instead of raw Chromium internals, collapsing the maintenance cost. + +## Tooling notes for future investigators + +If you do come back to this: + +- [perf/probe-idle-browser.mjs](probe-idle-browser.mjs) gives the + idle baseline (~125-180 MB tree) and was the data behind the + corrected memory math here. +- [perf/probe-memory.mjs](probe-memory.mjs) + sample-mem.ps1 gives + the working pipeline's per-process tree at peak. +- [perf/probe-renderer-mem.mjs](probe-renderer-mem.mjs) + + analyze-mem-trace.mjs gives the per-allocator breakdown inside the + renderer via memory-infra dumps. +- [perf/diff-blink-classes.mjs](diff-blink-classes.mjs) compares + Blink object class counts between two memory-infra dumps -- useful + for verifying that a code change is or isn't affecting layout-state + count. +- [perf/analyze-heap-snapshot.mjs](analyze-heap-snapshot.mjs) parses + V8 heap snapshots from the `--heap-snapshot` extension to + probe-renderer-mem.mjs. + +For exploring Chromium internals: +(searches and cross-refs the source). The `printing/` and +`components/services/print_compositor/` directories are the entry +points to the print pipeline. diff --git a/perf/README.md b/perf/README.md index 135d2de2..81e67b6f 100644 --- a/perf/README.md +++ b/perf/README.md @@ -5,28 +5,41 @@ paged.js + headless Chromium + pdf-lib (see `docs/book.bat`, which invokes `docs/render-book.mjs`). The pipeline was historically driven by `pagedjs-cli`; we replaced that with our own thin driver after the investigations in this folder, so we control pdf-lib's parseSpeed -without patching upstream (see "Profiling pdf-lib's load" below). -As the book has grown we noticed **quadratic** wall-clock behaviour: -time-per-page goes up as later pages are laid out, so doubling the -page count roughly quadruples the total render time. - -This folder holds the tools used to investigate that. +without patching upstream (see *Profiling pdf-lib's load* in +[notes/01-baseline-and-detach.md](notes/01-baseline-and-detach.md)). +As the book grew we found **quadratic** wall-clock behaviour -- +time-per-page grew with page count -- and chased it through ~22 +sub-investigations, recorded in [`notes/`](notes/). + +This folder holds the tools used to investigate that. The README is +the operational reference: what each tool does, how to run it, and +what shape the output takes. The narrative -- baselines, each landed +optimisation, what was tried and failed -- lives split across the +seven phase files in [`notes/`](notes/). The current state is summarised +at the bottom of this file. ## Profiling `paged.browser.js`: canonical command The command we reach for whenever CPU-profiling paged.js: ``` -node measure.mjs --detach-pages --no-timing --render-only --cpu-profile --cpu-sampling 100 +node measure.mjs --render-only --cpu-profile --cpu-sampling 100 ``` -(`run.bat` forwards the same args.) Flag rationale: +(`run.bat` forwards the same args.) Two defaults match what most +profiling work needs: + +- **detach-pages is on.** It's the shipping fix; matching production + is the right baseline for any profiling work. Pass + `--no-detach-pages` for an A/B against the original O(n²) quadratic. +- **timing is off.** The `timing-handler.js` per-page `console.log` + relay costs ~2 % of render self-time on the 1638-page book and + muddies bottom-up profile tables. Pass `--timing` when you want the + per-page CSV + first/last-quartile summary; otherwise `timing.csv` + is empty and `summary.txt` says so. + +Flag rationale: -- `--detach-pages` -- inject the shipping fix. The profile reflects - what production actually pays, not the old O(n^2) baseline. -- `--no-timing` -- skip the per-page `console.log` relay from - `timing-handler.js`. The relay costs ~2 % of render self-time on - the 1638-page book and muddies the bottom-up view. - `--render-only` -- bail out after `PagedPolyfill.preview()` returns. Skips meta extraction, `parseOutline`, `page.pdf`, and the pdf-lib roundtrip / incremental writer. ~47 s saved per run @@ -40,78 +53,77 @@ node measure.mjs --detach-pages --no-timing --render-only --cpu-profile --cpu-sa / `grep-profile.mjs`. - `--cpu-sampling 100` -- 100 us sampling, 10x denser than the 1 ms default. Resolves frames in paged.js's sub-millisecond inner loops - where most remaining cost lives (see "Looking past `finalizePage`" - and later sections). Larger profile file in return. + where most remaining cost lives (see *Looking past `finalizePage`* + in [notes/02-finalizepage.md](notes/02-finalizepage.md) and later + phase files). Larger profile file in return. Drop `--render-only` whenever you need to also measure generate / process (e.g. confirming a fix doesn't shift cost into `page.pdf()` or pdf-lib), or to write `book.pdf` for behavioural verification. -The rest of this README is the long-form narrative -- baseline -findings, each landed optimisation, and the residual hotspots. - -## The plan - -The render pipeline has three phases, matching what `pagedjs-cli` -historically showed as its three spinners: - -1. **Rendering** -- `PagedPolyfill.preview()` does all the per-page - layout work inside headless Chromium. -2. **Generating** -- `page.pdf()` asks Chromium to serialize the - laid-out DOM into PDF bytes, after a small `parseOutline` DOM - walk. -3. **Processing** -- `pdf-lib` loads Chromium's PDF, attaches the - outline and metadata, and re-serialises. - -All three can grow super-linearly. So the harness times all three -separately and produces a phase breakdown. - -Two-step investigation, cheapest first: - -1. **Per-page timing + phase breakdown** -- the cheap pass. Hook - paged.js's `beforePageLayout` / `afterPageLayout` for the - per-page render curve, and wall-clock the generate and process - phases from Node. If render's per-page cost grows with page index - that's an `O(n^2)` render; if generate or process dominate, the - bottleneck is downstream of paged.js. - -2. **CPU profile of headless Chromium** -- the deep pass, only if - step 1 doesn't already point at a culprit. Attach the Chrome - DevTools Performance panel (or save a CPU profile via the CDP - `Profiler` domain) and look for the hot function. Typical paged.js - suspects in render: `Chunker`, `Layout`, cross-reference - resolution, or a handler that walks the entire document on every - page. Generate / process bottlenecks usually point at Chromium's - PDF writer or `pdf-lib`'s outline / save path. - -Step 1 is what's wired up here. Step 2 will reuse the same harness -- -adding `page.tracing.start()` / `page.tracing.stop()` for a -DevTools-compatible trace is a few lines. - ## What's in this folder +The harness and core probes: + | File | Role | | --- | --- | -| `package.json` | Pins `puppeteer` + `pdf-lib` + `html-entities` (the same direct deps `docs/` uses). | -| `measure.mjs` | Puppeteer harness. Drives the same flow as `docs/render-book.mjs` (loads the vendored paged.js bundle, runs `PagedPolyfill.preview()`, calls `page.pdf()`, then either the pdf-lib roundtrip or the incremental writer), with optional CPU profiling, in-page handler injection, and DOM-accessor instrumentation. | -| `timing-handler.js` | `Paged.Handler` that records per-page wall time + heap into `window.__pagedTiming` and streams a line per page to the console. Always injected. | -| `detach-pages.js` | `Paged.Handler` that hides each completed page from the layout tree (registered against `finalizePage`). The fix. Injected by `--detach-pages` and by `docs/book.bat`. | +| `measure.mjs` | Puppeteer harness. Drives the same flow as `docs/render-book.mjs` (loads the vendored paged.js bundle, runs `PagedPolyfill.preview()`, calls `page.pdf()`, then either the pdf-lib roundtrip or the incremental writer), with optional CPU profiling, in-page handler injection, and DOM-accessor instrumentation. Auto-pins to a fixed core mask on Windows via `pin-cpu.mjs` (see below) for stable measurements; pass `--no-affinity` to opt out. | +| `pin-cpu.mjs` | Shared shim used by `measure.mjs`, `profile-load.mjs`, `profile-roundtrip.mjs`, and `ab-css.mjs`. On Windows, auto-relaunches the parent Node process under `start /affinity 0x5500 /high` (cores 4-7 physical, thread 0 each, on an 8C16T AMD Ryzen 7) so puppeteer's Chromium children inherit the mask + priority at spawn time. Reduces single-run CPU sample-time variance from ~15-25 % on a stock dev box to ~3 %. No-op on non-Windows; opt out per-invocation with `--no-affinity` or `PERF_PINNED=1`; override mask with `PERF_AFFINITY=`. | +| `timing-handler.js` | `Paged.Handler` that records per-page wall time + heap into `window.__pagedTiming` and streams a line per page to the console. Injected when `--timing` is passed; off by default because the per-page console relay costs ~2 % of render self-time. | +| `detach-pages.js` | `Paged.Handler` that hides each completed page from the layout tree (registered against `finalizePage`). The shipping fix. Injected by default (both by `measure.mjs` and by `docs/book.bat`); pass `--no-detach-pages` to measure the pre-fix baseline. | | `instrument-flush-ops.js` | Wraps `getComputedStyle`, `getBoundingClientRect`, and the `offsetWidth` / `clientWidth` / `scrollWidth` family with counters + per-call timing. Injected by `--instrument`. | +| `instrument-detach.js` | Counters around `detach-pages.js`'s removeChild / restore cycle. | | `time-hooks.js` | Wraps every task registered to `chunker.hooks.*` and `polisher.hooks.*` with a wall-clock timer. Tells you which handler's hook method is eating render time, per page. Injected by `--time-hooks`. | | `instrument-clones.js` | Wraps `Layout.prototype.append` to tag every source-walker clone, then walks each finalized page at `finalizePage` counting tagged survivors. Reports total appendCalls vs. survivors and the per-page overshoot distribution -- the share of clones rolled back by `removeOverflow`. Requires a one-line `window.PagedLayout = Layout` patch near the bottom of `docs/lib/paged.browser.js` (it's a private class otherwise). Injected by `--clone-count`. | | `incremental-pdf.mjs` | Replaces the pdf-lib load+save roundtrip with a PDF 1.7 §7.5.6 incremental update appended to Chrome's bytes. Used by `--incremental`. | | `test-incremental.mjs` | Smoke test for `incremental-pdf.mjs`: renders a tiny probe page, runs the writer, verifies the result parses (via pdf-lib re-load) and that outline + metadata land correctly. | -| `profile-load.mjs` | Standalone profiler for `PDFDocument.load`. Runs the load on a chosen PDF with a chosen `parseSpeed`; intended to be run under `node --cpu-prof`. | -| `profile-roundtrip.mjs` | Times the full pdf-lib `load + save` roundtrip across the three `parseSpeed` / `objectsPerTick` settings on a chosen PDF. | -| `probe-chrome-outline.mjs` | Renders a synthetic multi-level h1..h6 document via Chrome's `outline: true` and dumps the resulting `/Outlines` tree. Quick check that the CDP flag is wired correctly in the local Chromium / puppeteer combo. | -| `compare-outlines.mjs` | Diffs two PDFs' `/Outlines` trees by `(depth, title, target page)`. Used to verify whether Chrome's native outline matches the injected one. | -| `probe-outline-exclusions.mjs` | Tests which per-element attributes / styles (aria-hidden, role=presentation, hidden, display:none, CSS bookmark-level, ...) make Chrome drop a heading from its outline. | +| `run.bat` | Windows wrapper. On first run, runs `npm install` against the repo-root `package.json` (which pins `puppeteer` / `pdf-lib` / `html-entities` -- the same direct deps `docs/` uses; consolidated to repo root in commit `3da85e8`, May 2026, so `node_modules` is shared). Then invokes `node measure.mjs`. | +| `results/` | Output, one timestamped subfolder per run. Git-ignored. | + +Profile / trace analysis (point at files produced by `--cpu-profile` +or `--tracing`): + +| File | Role | +| --- | --- | | `analyze-profile.mjs` | Bottom-up self-time analyzer for `.cpuprofile` files. Same shape as DevTools' Performance bottom-up view, in the terminal. | +| `analyze-trace.mjs` | Bottom-up self-time analyzer for Chrome traces (`trace.json` from `--tracing`). Computes per-event self-time on the renderer's main thread (`CrRendererMain` by default) by walking nested `X`-phase events. Cracks the cpu profile's `(program)` bucket open into named Blink / V8 events (`Layout`, `RecalcStyle`, `RunMicrotasks`, `V8.GC_*`, ...). Operates on the Blink trace events only -- ignores any embedded V8 cpu samples (`Profile` / `ProfileChunk`). | +| `analyze-hybrid.mjs` | Bottom-up analyzer that *combines* the V8 cpu samples and the Blink trace events from a hybrid `trace.json`. Builds a `[JS root..leaf] ++ [Blink outer..inner]` stack at each sample (filtering V8's virtual frames and JS-entry wrapper events) and prints either top-N self-time mixing JS function names with Blink/V8 event names, or `--callees