Skip to content

feat: generate llms-full.txt with complete docs content#476

Open
sandroqdb wants to merge 3 commits into
mainfrom
feat/llms-full-txt
Open

feat: generate llms-full.txt with complete docs content#476
sandroqdb wants to merge 3 commits into
mainfrom
feat/llms-full-txt

Conversation

@sandroqdb

@sandroqdb sandroqdb commented Jul 3, 2026

Copy link
Copy Markdown

What

Add scripts/generate-llms-full.js, which generates static/llms-full.txt — the complete text of all documentation as a single file, served at https://questdb.com/docs/llms-full.txt.

Why

The website's llms.txt (questdb.io repo) has long advertised a full docs corpus at /docs/llms-full.txt, but no such file was ever generated — the URL 404s. questdb/questdb.io#2923 fixes the links on its side and auto-detects this file at build time, so the two PRs can merge in any order.

How

  • Walks documentation/sidebars.js in the same order as the llms.txt generator. Top-level categories become # sections; loose docs before the first category form an Overview section; loose docs after a category (changelog) get their own title-labeled section. Category link: {type: 'doc'} pages are included exactly where llms.txt lists them (shared subtreeContainsDoc gate). Docs listed in several sidebar positions render once.
  • Each doc renders through the existing plugins/raw-markdown/convert-components pipeline with the remote-repo-example plugin's data, so <RemoteRepoExample /> shows real code and output matches the per-page .md endpoints. Doc entries: ## title + Source: canonical markdown URL + body with headings bumped by 2 (some docs carry body H1s — bump-by-1 would collide with the doc-delimiter level).
  • New scripts/lib/docs-urls.js (canonical URLs, mirrors the raw-markdown plugin exactly, incl. an introduction→index.md safety net) and scripts/lib/sidebar-utils.js — both shared by the two generators.
  • Build resilience: remote example data is only needed for llms-full.txt, so fetch failures retry once then degrade to placeholder examples for that build — a GitHub flake can never fail the docs deploy.
  • Wired into prebuild, gitignored like llms.txt / reference-full.md. Frontmatter via gray-matter (existing dep).

Review history

Two high-effort multi-agent review rounds; every confirmed finding fixed:

  • R1: category-link docs dropped · RemoteRepoExample placeholders · duplicate doc bodies · URL-logic duplication/divergence · hand-rolled frontmatter regex.
  • R2: phantom doc boundaries from body H1s · duplicate/mislabeled sections (two "Getting Started", changelog under Troubleshooting) · prebuild hard-failing on transient GitHub errors · introduction URL fallback · cross-generator link-doc ordering · bare headers for fully-deduped sections.

Known accepted trade-offs (noted, deliberately out of scope): the MDX→md conversion runs at prebuild and in the plugin's postBuild (as generate-reference-full.js already does); the render pipeline and URL logic are shared between the two generator scripts but the plugin itself doesn't consume the shared modules yet — consolidating generation into the plugin's postBuild is the right future cleanup but means touching the code path that produces every production .md page.

Verified locally (fence-aware where relevant)

  • 352 real H2 doc headers == 352 Source: lines — every doc boundary is real, no phantoms.
  • 15 content sections, exactly matching the sidebar: one Getting Started, Overview leading, Documentation changelog standalone. No empty sections.
  • cookbook/sql/finance.md present; vpin.md exactly once; zero "Example not found" placeholders (real Java/Python/… example code verified).
  • llms.txt: URL set = production + exactly the one previously-missing category-link doc; byte-identical before/after the shared-module refactor.

🤖 Generated with Claude Code

The website's llms.txt has long advertised a full documentation corpus
at /docs/llms-full.txt, but no such file was ever generated — the URL
404s. This adds scripts/generate-llms-full.js, which walks the sidebar
(same order as llms.txt) and concatenates every doc's full markdown
content into static/llms-full.txt, served at /docs/llms-full.txt.

MDX processing (partials, component conversion, import stripping,
heading bumping) reuses plugins/raw-markdown/convert-components, so the
output matches the per-page .md endpoints exactly. Each doc entry
carries a Source: line pointing at its canonical markdown URL.

Output on current content: 354 docs, 2.69 MB. Wired into prebuild and
gitignored like the other generated files.

Companion to questdb/questdb.io#2923, which repairs the llms.txt link;
once this deploys, the Full Documentation Content link can be restored
there.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

🚀 Build success!

Latest successful preview: https://preview-476--questdb-documentation.netlify.app/docs/

Commit SHA: 5d99349

📦 Build generates a preview & updates link on each commit.

sandroqdb and others added 2 commits July 3, 2026 16:26
…logic

Fixes from high-effort review of the first revision:
- Docs attached to a category only via link: {type: 'doc'} were silently
  dropped; both generators now include them (llms.txt gains the one doc
  the sidebar only references that way, cookbook/sql/finance/index).
- Pass the remote-repo-example plugin's data to convertAllComponents so
  <RemoteRepoExample /> renders real code instead of its 'Example not
  found' fallback.
- Doc ids listed in multiple sidebar positions are rendered once in
  llms-full.txt (4 duplicate entries skipped, logged); doc count now
  reflects rendered docs only.
- Extract canonical-URL construction into scripts/lib/docs-urls.js,
  mirroring plugins/raw-markdown/index.js exactly (fixes latent
  multi-segment relative-slug divergence) and shared by both the
  llms.txt and llms-full.txt generators. Verified: URL set identical to
  production llms.txt except the one added doc.
- Parse frontmatter with gray-matter (existing dep, same as the
  raw-markdown plugin) instead of a hand-rolled regex.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…lience

- Bump body headings by 2 (H1->H3) instead of 1: introduction.md and
  changelog.mdx carry body H1s that landed at H2 — the per-doc delimiter
  level — creating phantom doc boundaries. Verified fence-aware: 352
  real H2 doc headers == 352 Source lines.
- Fix section labeling: loose top-level docs before the first category
  form an 'Overview' section (no more duplicate 'Getting Started'
  headers), and loose docs after a category (changelog) get their own
  title-labeled section instead of folding into the preceding category.
- Never fail the docs build on a GitHub flake: remote example data is
  only used for llms-full.txt, so loadContent gets one retry and then
  degrades to placeholder examples for that build instead of aborting
  the whole deploy.
- Gate category link docs with subtreeContainsDoc (moved to shared
  scripts/lib/sidebar-utils.js, used by both generators) so llms-full
  orders them identically to llms.txt; buffer section bodies so a
  section whose docs all rendered elsewhere emits no bare header.
- docs-urls: restore the introduction -> index.md fallback as a safety
  net against slug-extraction failure; document that a trailing slash
  in a slug is deliberately not stripped (the raw-markdown plugin writes
  '<slug>.md' verbatim, so stripping would link a path it never writes).

llms.txt output verified byte-identical before/after the shared-walker
refactor.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant