From be15b483a2285e00355913cd6be419460503f774 Mon Sep 17 00:00:00 2001
From: Andreas Stefl <stefl.andreas@gmail.com>
Date: Sat, 13 Jun 2026 17:39:16 +0200
Subject: [PATCH] docs: consolidate per-module STATUS/PLAN notes into AGENTS.md

Converge the untracked per-module STATUS.md + PLAN.md (and the PDF
PLAN-stage0.md subplan) into a single AGENTS.md per module, so the agent
notes are tracked in git and auto-loaded by the repo's agent-instruction
discovery convention (which looks for AGENTS.md, not AGENT.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 AGENTS.md                                     | 163 +++++++
 src/odr/internal/oldms/AGENTS.md              |  64 +++
 src/odr/internal/oldms/presentation/AGENTS.md | 363 ++++++++++++++
 src/odr/internal/oldms/spreadsheet/AGENTS.md  | 183 +++++++
 src/odr/internal/oldms/text/AGENTS.md         | 369 ++++++++++++++
 src/odr/internal/pdf/AGENTS.md                | 450 ++++++++++++++++++
 6 files changed, 1592 insertions(+)
 create mode 100644 AGENTS.md
 create mode 100644 src/odr/internal/oldms/AGENTS.md
 create mode 100644 src/odr/internal/oldms/presentation/AGENTS.md
 create mode 100644 src/odr/internal/oldms/spreadsheet/AGENTS.md
 create mode 100644 src/odr/internal/oldms/text/AGENTS.md
 create mode 100644 src/odr/internal/pdf/AGENTS.md
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 00000000..dbfb6a00
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,163 @@
+# AGENTS.md — OpenDocument.core
+
+Orientation for AI agents working in this repo. Summarises the architecture, the
+conventions, and where to find things. For user-facing docs see
+[`README.md`](README.md) and [`docs/`](docs/README.md).
+
+## What this is
+
+`odr` (a.k.a. `odrcore`) is a **C++20 library that decodes documents and renders
+them to HTML**. It reads many formats (ODF, OOXML, legacy MS, PDF, CSV, …) behind
+one abstract document model and a generic HTML renderer. It is the backend for
+OpenDocument.droid / .ios.
+
+Build system: **CMake + Conan**. Language standard: **C++20** (`CMakeLists.txt`).
+
+## Big picture: how a file becomes HTML
+
+```
+bytes ─▶ magic/open_strategy ─▶ DecodedFile ─▶ Document ─▶ ElementAdapter ─▶ html::translate ─▶ HtmlService
+        (detect FileType +      (per engine)   (per       (tree of           (generic renderer,
+         DecoderEngine)                          format)    elements)          walks public API)
+```
+
+1. **Detection** — `internal/magic.cpp` (+ `internal/libmagic`) sniffs the file;
+   `internal/open_strategy.cpp` picks a `FileType` and a `DecoderEngine` and
+   constructs the matching `abstract::DecodedFile`.
+2. **Decode** — a document file yields an `abstract::Document` (the engine's
+   subclass of `internal::Document`).
+3. **Element tree** — a `Document` exposes a root `ElementIdentifier` plus an
+   `abstract::ElementAdapter`. The public, value-semantics handles
+   (`Element`, `Slide`, `Paragraph`, `Text`, `Frame`, …) in
+   `src/odr/document_element.hpp` are thin wrappers that delegate to the adapter.
+4. **Render** — `internal/html/` walks that public element API and writes HTML.
+   Entry point: `odr::html::translate(...)` → `HtmlService` (paginated fragments;
+   `bring_offline` materialises files).
+
+### The element-adapter pattern (every document engine follows it)
+
+Two pieces per engine:
+
+- An **`ElementRegistry`**: a flat `std::vector<Element>` (id = index + 1) where
+  each `Element` holds `parent`/`first_child`/`last_child`/`prev`/`next` ids and a
+  `type`, plus side `unordered_map`s for per-type payloads (text strings, frame
+  anchors, …). Builders are `create_element` / `create_*_element` /
+  `append_child`. See `oldms/text/doc_element_registry.*` or
+  `oldms/presentation/ppt_element_registry.*` for the minimal version.
+- An **`ElementAdapter`**: one class that implements `abstract::ElementAdapter`
+  (tree navigation by id) and, by multiple inheritance, the per-element-type
+  adapters it supports (`SlideAdapter`, `ParagraphAdapter`, `TextAdapter`,
+  `FrameAdapter`, …). The `*_adapter(id)` methods return `this` when the element
+  is of that type, else `nullptr`. See `oldms/presentation/ppt_document.cpp` for a
+  compact example.
+
+`ElementType` is the shared enum in `src/odr/document_element.hpp` (`root`,
+`slide`, `paragraph`, `text`, `line_break`, `frame`, `table*`, `sheet*`, …).
+
+## Directory map
+
+| Path | What |
+|------|------|
+| `src/odr/*.hpp` | **Public API**: `file.hpp`, `document.hpp`, `document_element.hpp`, `html.hpp`, `style.hpp`, `quantity.hpp` (`Measure`), `odr.hpp`. |
+| `src/odr/internal/abstract/` | Core interfaces: `File`/`DecodedFile`, `Document` + `ElementAdapter` (and all per-element adapters), `Filesystem`, `Archive`, `HtmlService`. |
+| `src/odr/internal/common/` | Reusable impls: `Path`/`AbsPath`, base `Document`, filesystem, `style`, table cursor/range, temp files. |
+| `src/odr/internal/util/` | Helpers: `byte_stream_util` (POD reads), `string_util` (`split`, `u16string_to_string`), `stream_util`, `document_util`, `xml_util`. |
+| `src/odr/internal/magic.*`, `open_strategy.*` | File-type detection and the open/dispatch logic. |
+| `src/odr/internal/html/` | Generic HTML renderer (`document.cpp`, `document_element.cpp`, `document_style.cpp`). |
+| `src/odr/internal/cfb/`, `zip/` | Container formats (Compound File Binary, ZIP). |
+| `src/odr/internal/odf/` | OpenDocument (odt/ods/odp/odg). |
+| `src/odr/internal/ooxml/` | OOXML (docx/pptx/xlsx); subdirs `text`/`presentation`/`spreadsheet`. |
+| `src/odr/internal/oldms/` | **Legacy MS binary** (.doc/.ppt/.xls); subdirs `text`/`presentation`/`spreadsheet`. |
+| `src/odr/internal/oldms_wvware/` | Alternative .doc decoder via wvWare. |
+| `src/odr/internal/pdf/`, `pdf_poppler/` | PDF (own parser + poppler/pdf2htmlEX path). |
+| `src/odr/internal/{csv,json,text,svm}/` | Smaller formats. |
+| `cli/src/` | CLI tools: `translate`, `back_translate`, `meta`, `server`. |
+| `test/src/` | GoogleTest suites; data in `test/data` (git submodules, see below). |
+| `offline/documentation/MS-*/` | Vendored Microsoft spec text (PDF + extracted markdown), see [Specs](#specs). |
+| `docs/design/README.md` | High-level design rationale. |
+
+## Build & test
+
+A configured build dir already exists (`cmake-build-debug`, also `…-release`,
+`…-relwithdebinfo`). Typical loop:
+
+```bash
+# library
+cmake --build cmake-build-debug --target odr
+# tests (the ODR_TEST option is on in this build dir)
+cmake --build cmake-build-debug --target odr_test
+./cmake-build-debug/test/odr_test --gtest_filter='OldMs.*'
+# CLI (renders a file to a directory of HTML)
+cmake --build cmake-build-debug --target translate
+```
+
+Notable CMake options (`CMakeLists.txt`): `ODR_TEST`, `ODR_CLI`,
+`ODR_WITH_PDF2HTMLEX`, `ODR_WITH_WVWARE`, `ODR_WITH_LIBMAGIC`, `ODR_CLANG_TIDY`.
+A new `.cpp` must be added to the `ODR_SOURCE_FILES` list in `CMakeLists.txt`.
+
+**Test data lives in git submodules** under `test/data/input/odr-public`,
+`…/odr-private`, and `test/data/reference-output/*`.
+
+## Conventions
+
+- **Formatting**: clang-format, LLVM-based (`.clang-format`); run `scripts/format`
+  (or rely on the git hook from `scripts/setup`). `clang-tidy` config in
+  `.clang-tidy`; CI enforces both (`.github/workflows/format.yml`, `tidy.yml`).
+- **Error handling — fail fast**: where the spec/format dictates what to expect,
+  **throw** on unexpected input (`std::runtime_error`, or the typed exceptions in
+  `src/odr/exceptions.hpp`) rather than silently degrading. Only **pass through**
+  (return empty / skip) values that are genuinely *optional* or *not yet
+  modelled*.
+- **Public API**: value semantics; immutable handles; iterators only for
+  immutable traversal (`docs/design/README.md`).
+- **Byte parsing**: read POD structs via `util::byte_stream::read`; this assumes
+  host byte order matches the file's (little-endian) — big-endian is a known
+  not-yet-handled gap in the binary engines.
+- Match the **surrounding file's** style, includes, and idioms; mirror a sibling
+  engine when adding a format (the `oldms/text` `.doc` impl is the reference the
+  `.ppt` impl was modelled on).
+- **Comments — keep them minimal**: a function/struct doc comment is at most a
+  couple of terse lines stating the key point (what it does, stream/ownership
+  preconditions, the spec section it implements, e.g. `[MS-PPT] 2.3.2`). Don't
+  restate the code or spell out every case; cite the spec instead of paraphrasing
+  it. The detailed design rationale belongs in the per-module `AGENTS.md`, not in
+  source comments.
+
+## Adding / extending a document format
+
+1. Detection: extend `magic`/`open_strategy` to map the bytes to a `FileType`
+   (+ `DecoderEngine`) and construct your `DecodedFile`.
+2. For documents: subclass `internal::Document`; in its constructor build an
+   `ElementRegistry` and an `ElementAdapter` (see the pattern above).
+3. Implement the per-element adapters you can populate; the **generic HTML
+   renderer then works for free**.
+4. Register the format's factory (e.g. `oldms_file.cpp::document()` switches on
+   `file_type()`), add sources to `CMakeLists.txt`, and add a GoogleTest.
+
+## Legacy Microsoft binary formats (`oldms`)
+
+Container handling (CFB) already exists; each format is a small module under
+`oldms/` mirroring `oldms/text` (`.doc`). Spec references in
+`src/odr/internal/oldms/README.md`.
+
+- **`.doc`** (`oldms/text`): working, visible-text extraction.
+- **`.ppt`** (`oldms/presentation`): implemented — slides resolved via the
+  persist directory (the only spec-defined read path), each slide's text boxes
+  modelled as positioned `frame`s. **Read its docs before touching it**:
+  [`oldms/presentation/AGENTS.md`](src/odr/internal/oldms/presentation/AGENTS.md)
+  — what's implemented and **why** (persist-directory resolution, no scan
+  fallback, sequential `ChildCursor` reading without `tellg`, fail-fast error
+  handling, the two-text-locations finding, endianness), the open work (frame
+  refinements, smaller shortcomings), and the verified `[MS-PPT]`/`[MS-ODRAW]`
+  drawing-tree map.
+- **`.xls`** (`oldms/spreadsheet`): working, visible cell-text extraction
+  (BIFF8). See [`oldms/spreadsheet/AGENTS.md`](src/odr/internal/oldms/spreadsheet/AGENTS.md).
+
+## Specs
+
+Vendored Microsoft Open Specifications live under
+`offline/documentation/<NAME>/<NAME>-<date>/`, both as `original.pdf` and an
+extracted `docling-from-docx.md` (grep-friendly). Available: **MS-PPT**,
+**MS-ODRAW** (Office Art / Escher drawing records), **MS-DOC**, **MS-XLS**,
+**MS-CFB** (container), **MS-OFFCRYPTO** (encryption). Cite section numbers from
+these when implementing binary parsing.
diff --git a/src/odr/internal/oldms/AGENTS.md b/src/odr/internal/oldms/AGENTS.md
new file mode 100644
index 00000000..ba7cfcde
--- /dev/null
+++ b/src/odr/internal/oldms/AGENTS.md
@@ -0,0 +1,64 @@
+# Legacy MS Office (`oldms/`) — shared status & conventions
+
+What the binary legacy-format modules share. Each format's own status, design
+notes, and open work live with its module; this file holds the conventions they
+build on and the one piece of open work common to all three. Spec links are in
+[`README.md`](README.md), the PDFs under `offline/documentation/`.
+
+| Module                           | Format                 | Agent doc                       |
+|----------------------------------|------------------------|---------------------------------|
+| [`text/`](text/)                 | `.doc` (Word)          | [text/AGENTS.md](text/AGENTS.md)        |
+| [`presentation/`](presentation/) | `.ppt` (PowerPoint)    | [presentation/AGENTS.md](presentation/AGENTS.md) |
+| [`spreadsheet/`](spreadsheet/)   | `.xls` (Excel / BIFF8) | [spreadsheet/AGENTS.md](spreadsheet/AGENTS.md) |
+
+## Shared conventions
+
+All three modules follow the same approach; the per-format docs cover only what
+is specific to each format.
+
+- **CFB container.** Each format is a `[MS-CFB]` compound file; container
+  handling already existed in the engine. Each module reads its stream(s)
+  sequentially.
+- **Byte-copy structs.** Fixed-layout spec structures are `#pragma pack(1)`
+  structs in the `*_structs.hpp` headers, with the spec's field names and
+  `[MS-*]` section citations, guarded by `static_assert(sizeof ...)`, filled by
+  copying the file's bytes straight in.
+- **Bit-fields mirror the spec tables.** Sub-byte fields are declared as
+  bit-fields in the spec's order (LSB-first): `FibBase`/`Sprm` (`.doc`),
+  `RecordHeader` (`.ppt`), `RkNumber`/`UnicodeStringFlags` (`.xls`).
+- **Little-endian, LSB-first hosts only.** The byte copy interprets bytes in the
+  host's byte order and bit-fields in the host's allocation order. See below.
+- **Fail early on malformed input**; records/structures that are merely *not
+  modelled* are skipped.
+
+## Endianness and bit order: little-endian host assumed (shared open work)
+
+All three modules read multi-byte fields and UTF-16 code units in the host's
+byte order with no swap, and their bit-field structs assume LSB-first
+allocation.
+
+The file side is fixed: `[MS-DOC]`, `[MS-PPT]`, `[MS-XLS]` and the `[MS-CFB]`
+container all store little-endian unconditionally — there is no big-endian
+variant — so no runtime detection is needed. Only the host varies, and that is
+known at compile time (`std::endian::native`). Conveniently, GCC/Clang switch
+bit-field allocation to MSB-first exactly on big-endian targets, so byte order
+and bit order flip together and one compile-time guard covers both. (The flip is
+each ABI keeping declaration order equal to memory order: the first declared
+field lands in the first byte either way.)
+
+**Fix if a non-little-endian target matters**: give each struct in the
+`*_structs.hpp` headers a fixup function, applied right after the raw byte copy,
+that byte-swaps the multi-byte fields and re-places the bit-field values. It has
+to be per struct — a blind byte swap cannot fix bit-fields, the transform needs
+the field widths. On little-endian hosts every fixup compiles to a no-op.
+
+**Rejected alternative**, for the record: `#if`-mirrored bit-field declarations
+(the Linux `iphdr` pattern). Reversing declaration order repositions fields
+within the allocation unit but cannot change how the unit's bytes are assembled,
+so any field that straddles a byte boundary — `Sprm.ispmd` (9 bits),
+`FcCompressed.fc` (30 bits), `RkNumber.num` (30 bits), `RecordHeader.recInstance`
+(12 bits) — ends up in non-contiguous bits on a big-endian reader; only fixing
+the data can express that. The pattern pays off only for zero-copy in-place
+access (mapped packets/pages), which these formats rule out anyway: CFB streams
+are fragmented into sectors and `.xls` records into `CONTINUE` chunks, so structs
+are always assembled by copying — the fixup point is structural.
diff --git a/src/odr/internal/oldms/presentation/AGENTS.md b/src/odr/internal/oldms/presentation/AGENTS.md
new file mode 100644
index 00000000..223c3f8a
--- /dev/null
+++ b/src/odr/internal/oldms/presentation/AGENTS.md
@@ -0,0 +1,363 @@
+# `.ppt` (PowerPoint) support — status, design & open work
+
+What the `oldms/presentation/` module does **today**, the **design decisions**
+behind it, and the **open work**. Shared `oldms/` conventions are in
+[`../AGENTS.md`](../AGENTS.md).
+
+**Scope.** Extract the **visible text of each slide, positioned in its text
+boxes**, and expose it through the abstract document model so the generic HTML
+renderer lays each slide out as positioned frames. No character/paragraph
+styles, master/notes pages, images, charts, tables, or animations.
+
+**Specs.** `offline/documentation/MS-PPT/` (the PowerPoint stream) and
+`MS-ODRAW/` (the Office Art / Escher drawing records). CFB container handling
+already existed in the engine.
+
+---
+
+## What works
+
+- `.ppt` is detected and decoded to a `Document` (presentation), one `slide`
+  per presentation slide **in presentation order**.
+- Each slide's on-slide **text boxes** become positioned `frame`s; their text is
+  split into paragraphs / line breaks.
+- A text box that stores no inline text but an `OutlineTextRefAtom` (the common
+  PowerPoint placeholder representation) is resolved against the slide's text in
+  the `SlideListWithTextContainer`, so placeholder/body text is not lost.
+- The generic HTML renderer produces one page per slide with each text box
+  absolutely positioned (verified: `position:absolute;left:…;top:…` with the
+  decoded coordinates).
+
+## Module layout (mirrors `../text`)
+
+| File (`oldms/presentation/`)     | Role                                                |
+|----------------------------------|-----------------------------------------------------|
+| `ppt_structs.hpp`                | `#pragma pack(1)` PODs (`RecordHeader`, atom bodies, `Anchor`) + `static_assert` sizes + the `RecordType` / `SlideListInstance` enums |
+| `ppt_io.{hpp,cpp}`               | `read(...)` helpers over `std::istream` (text atoms, the anchor rect, fixed structs) |
+| `ppt_parser.{hpp,cpp}`           | `parse_tree(registry, files)` → walks the stream and builds the element tree |
+| `ppt_element_registry.{hpp,cpp}` | Flat element store (copy of `doc_element_registry`) + text & frame side-payloads |
+| `ppt_document.{hpp,cpp}`         | `internal::Document` subclass + the `ElementAdapter` |
+
+`ElementRegistry` is a `vector<Element>` (id = index) with parent/child/sibling
+ids and side maps for the text and frame payloads; `create_element` /
+`create_text_element` / `create_frame_element` / `append_child` are the only
+builders.
+
+## Pipeline: how a `.ppt` becomes the element tree
+
+1. **Wiring.** `LegacyMicrosoftFile` already detected `.ppt` (the `/PowerPoint
+   Document` stream → `FileType::legacy_powerpoint_presentation`,
+   `DocumentType::presentation`) and `open_strategy` routed it here; the
+   `legacy_powerpoint_presentation` case in `LegacyMicrosoftFile::document()`
+   returns `presentation::Document`.
+2. **Resolve slides (persist directory).** `parse_tree` opens both required
+   streams and hands them to `collect_slides(current_user, document)`, following
+   the `[MS-PPT]` reading algorithm: read `CurrentUserAtom` from `/Current User`
+   → walk the `UserEditAtom` chain newest→oldest, building the persist object
+   directory (newest offset per id wins) → resolve the **live**
+   `DocumentContainer` via `docPersistIdRef` → walk the slide list's
+   `SlidePersistAtom`s **in presentation order**, resolving each `persistIdRef`
+   to its `SlideContainer`. See *Design decisions* for why this is the only read
+   path.
+3. **Read text boxes per slide.** For each `SlideContainer` the parser descends
+   the drawing and reads its text boxes (with positions) — see [Text boxes
+   (frames)](#text-boxes-frames).
+4. **Build the tree.** `parse_tree` makes one `slide`, one `frame` per text box
+   (storing its anchor), and `build_paragraphs` hangs the box's text off the
+   frame:
+
+   ```
+   root  (ElementType::root)
+   └── slide              (ElementType::slide)        one per slide, in order
+       └── frame          (ElementType::frame)        one per on-slide text box
+           └── paragraph  (ElementType::paragraph)    split on 0x0D
+               ├── text       (ElementType::text)
+               └── line_break (ElementType::line_break)  for 0x0B in a paragraph
+   ```
+5. **Render.** HTML works through the generic renderer via the public `Slide` /
+   `Frame` / `Paragraph` / `Text` API and our adapters.
+
+## Text boxes (frames)
+
+A `.ppt` slide is a *drawing of shapes*; each text box / placeholder is a shape
+with its own position. `collect_slides` returns, per slide, the on-slide text
+boxes in shape (z) order, each becoming a `frame`.
+
+Per slide the parser descends `SlideContainer → DrawingContainer (0x040C) →
+OfficeArtDgContainer (0xF002) → OfficeArtSpgrContainer (0xF003)` and walks the
+`OfficeArtSpContainer` (0xF004) shapes. For each shape it reads:
+- the **optional** `OfficeArtClientAnchor` (0xF010) → `read_client_anchor`
+  (`SmallRectStruct`/`RectStruct`, master units = 1/576 inch), and
+- the text in its `OfficeArtClientTextbox` (0xF00D).
+
+Shapes with no text are dropped, so the group shape and pictures disappear.
+`FrameAdapter` returns `anchor_type = at_page` and `x/y/width/height` as Measures
+(master units / 576 → inches); a shape without an anchor yields a frame with no
+position.
+
+**First cut (current):** only **top-level** shapes — direct children of the root
+`OfficeArtSpgrContainer`, whose anchors are already in the slide's master-unit
+system. Nested-group coordinate transforms, non-grouped shapes, and
+master-placeholder geometry inheritance are deferred — see [open
+work](#1-frame-refinements). The verified record map of the drawing tree is in
+[Reference](#reference-the-drawing-tree).
+
+## Adapters
+
+`ppt_document.cpp` implements the generic `ElementAdapter` (tree navigation,
+copied from `doc_document.cpp`) plus `SlideAdapter` / `FrameAdapter` /
+`ParagraphAdapter` / `TextAdapter` / `LineBreakAdapter`:
+- `FrameAdapter`: `anchor_type = at_page`; `x/y/width/height` from the frame's
+  anchor (or empty when absent); `z_index` / `style` empty.
+- `SlideAdapter`: `slide_page_layout` → hardcoded 10"×7.5" (4:3); `slide_name` →
+  empty; `slide_master_page` → `null_element_id`.
+- `paragraph_text_style` / `text_style` set `font_size = 11pt` so empty
+  paragraphs still have height.
+- `Document::is_editable()` → `false`; `save(...)` → throws
+  `UnsupportedOperation`.
+
+## Binary format reference
+
+Every record starts with an 8-byte `RecordHeader`:
+
+```
+RecordHeader {
+  uint16 recVer : 4 ;          // 0xF marks a container
+  uint16 recInstance : 12 ;
+  uint16 recType ;
+  uint32 recLen ;              // bytes of body that follow the header
+}
+```
+
+`recVer == 0xF` marks a **container** (body is a sequence of records); otherwise
+it's an **atom** with `recLen` bytes of payload.
+
+| Record                 | Type   | Kind      | Purpose                                  |
+|------------------------|--------|-----------|------------------------------------------|
+| `CurrentUserAtom`      | 0x0FF6 | atom      | in `/Current User`; newest edit offset   |
+| `UserEditAtom`         | 0x0FF5 | atom      | edit chain + persist directory offset    |
+| `PersistDirectoryAtom` | 0x1772 | atom      | persist id → stream offset               |
+| `DocumentContainer`    | 0x03E8 | container | top-level document                       |
+| `SlideListWithText`    | 0x0FF0 | container | per-list slide refs (+ optional outline) |
+| `SlidePersistAtom`     | 0x03F3 | atom      | one per slide; `persistIdRef` + order    |
+| `SlideContainer`       | 0x03EE | container | a slide (drawing + placeholders)         |
+| `MainMaster`           | 0x03F8 | container | master slide (skipped)                   |
+| `Notes`                | 0x03F0 | container | notes page (skipped)                     |
+| `TextHeaderAtom`       | 0x0F9F | atom      | type of the text block that follows      |
+| `TextCharsAtom`        | 0x0FA0 | atom      | UTF-16 text (two bytes per code unit)    |
+| `TextBytesAtom`        | 0x0FA8 | atom      | "compressed" text: one byte per char     |
+
+The Office Art drawing records (`RT_Drawing` 0x040C and `0xF00*`/`0xF010`) used
+for text boxes are listed with the full drawing-tree map in
+[Reference](#reference-the-drawing-tree).
+
+### Text decoding
+
+- `TextCharsAtom`: `recLen / 2` UTF-16 code units → `u16string_to_string`.
+- `TextBytesAtom`: each byte is one character value (0x00–0xFF).
+- In-text control characters: `0x0D` = paragraph break, `0x0B` = vertical tab =
+  manual line break — split on these like `doc_parser`. `0x09` (tab) kept; other
+  control characters dropped (`clean_text`).
+
+---
+
+## Design decisions
+
+**Slide resolution is persist-directory based (the single spec path).** The
+persist directory gives correct slide **ordering** for incrementally-saved files
+(where stream order ≠ presentation order) and picks the **live**
+`DocumentContainer` rather than the first one in the stream. Verified on
+`slides.ppt`: `/Current User` → `offsetToCurrentEdit=11646` → `UserEditAtom`
+(`docPersistIdRef=1`, `offsetPersistDirectory=11606`, `offsetLastEdit=0`) → 2
+slides in order with correct text.
+
+**No scan/heuristic fallback — spec-justified.** Both `/Current User` (§2.1.1)
+and `/PowerPoint Document` (§2.1.2) are *required* streams, every conformant
+file has at least one `UserEditAtom` + `PersistDirectoryAtom`, and the reading
+algorithm has no alternative branch. An earlier draft kept a stream-scan
+fallback (first `DocumentContainer`, every `SlideContainer` in stream order,
+plus an outline-vs-container "more text wins" heuristic); it was **removed** —
+unreachable for conformant files and able to silently serve *wrong* results (a
+stale `DocumentContainer`, wrong slide order). `collect_slides` returns an empty
+presentation only for the one *optional* structure: a document with no
+presentation slide list (§2.4.1). Every mandatory structure that can't be
+resolved — empty edit chain, unresolved `docPersistIdRef`, a slide
+`persistIdRef` not in the directory — **throws**.
+
+**Two places hold slide text — and they are not equivalent.**
+- The **outline** (`SlideListWithTextContainer`, §2.4.14.3) is **optional**
+  (`DocumentContainer.slideList`, §2.4.1). When present it carries, per slide,
+  the title/body **placeholder** text only — free text boxes are *never* in it.
+- The **`SlideContainer`** (§2.5.1) is the authoritative source: on-slide text
+  lives in the drawing's `ClientTextbox` records.
+
+In LibreOffice-exported `.ppt` the outline is **empty** (verified on
+`slides.ppt`: the `0x0FF0` lists hold zero text atoms), so there we read each
+slide's text from its `SlideContainer`. But PowerPoint-authored placeholders
+commonly carry **no inline text** in the `SlideContainer` and instead an
+`OutlineTextRefAtom` (§2.9.78) pointing, by index, at the *i*-th `TextHeaderAtom`
+block of that slide in the `SlideListWithTextContainer`. So we read the outline
+too: `read_slide_list_text` collects, per slide (keyed by `persistIdRef`), the
+ordered list of its `TextHeaderAtom` texts, and `gather_text` resolves an
+`OutlineTextRefAtom` box against it. On-slide `ClientTextbox` text still wins
+when present.
+
+**`RT_SlideListWithText` recInstance disambiguates three lists.**
+`MasterListWithTextContainer` (§2.4.14.1), `SlideListWithTextContainer`
+(§2.4.14.3) and `NotesListWithTextContainer` (§2.4.14.6) share `recType =
+RT_SlideListWithText` (0x0FF0); only `recInstance` tells them apart:
+
+| recInstance | container                     | meaning             |
+|-------------|-------------------------------|---------------------|
+| `0x000`     | `SlideListWithTextContainer`  | presentation slides |
+| `0x001`     | `MasterListWithTextContainer` | masters             |
+| `0x002`     | `NotesListWithTextContainer`  | notes               |
+
+An early draft had Slides/Master swapped, making the lookup read the *master*
+list; fixed in `ppt_structs.hpp` (`SlideListInstance`).
+
+**Sequential reading, no `tellg`.** The CFB-backed stream's `tellg()` returns
+bogus values (it broke an early offset-tracking `read_children`). The parser
+never depends on `tellg`: the caller `seekg`s to known offsets (from the persist
+directory or a parent record), and child records are walked **forward** with a
+`ChildCursor` — `read` header → `read`/recurse/`ignore` body — tracking the
+bytes left in the container. A record that overruns its container throws,
+keeping nested containers in sync or failing loudly.
+
+**Fail early on malformed input.** Where the spec dictates what to expect,
+unexpected input **throws** (matches the sibling `.doc` parser). We **throw** on:
+a missing required stream; a wrong record type (`read_header` — so a truncated
+read, whose garbage type won't match, also throws); a record that overruns its
+container (`ChildCursor`); a missing **mandatory** child record — the
+`DrawingContainer` / `OfficeArtDgContainer` / `OfficeArtSpgrContainer` of a
+slide (`require_child`); an `OfficeArtClientAnchor` whose `recLen` is neither 8
+nor 16; a non-decreasing (looping) `UserEditAtom` chain, an empty chain, an
+unresolved `docPersistIdRef`, or a slide `persistIdRef` not in the persist
+directory. We **pass through** (no throw) for values we don't model or that are
+optional: an absent presentation slide list (0 slides), a shape with no
+`OfficeArtClientAnchor` (unpositioned frame), nested groups and non-`Sp` records
+in a group, and any non-text / unrecognised child record.
+
+**Endianness.** Host byte order / LSB-first bit-fields assumed; shared `oldms/`
+assumption, see [`../AGENTS.md`](../AGENTS.md). For `.ppt`: every record field is
+read in host byte order (see the note in `ppt_io.hpp`), and the `RecordHeader`
+recVer/recInstance bit-fields assume LSB-first allocation.
+
+## Tests
+
+- `ppt_empty` — `odr-public/ppt/empty.ppt`: 1 slide.
+- `ppt_slides` — `odr-public/ppt/slides.ppt`: 2 slides, 2 positioned frames each
+  (all `at_page` with `x/y/width/height`), distinct vertical positions, exact
+  per-box text.
+
+The non-empty fixture `slides.ppt` and reference-output HTML wiring are open
+items (see below).
+
+## Out of scope
+
+Character/paragraph styles, fonts and colours; master and notes slides;
+images/charts/tables and non-text shapes; animations/transitions; and
+encrypted/obfuscated presentations.
+
+---
+
+# Open work
+
+## 1. Frame refinements
+
+The first cut reads only **top-level** shapes — direct children of the root
+`OfficeArtSpgrContainer` — whose anchors are already in the slide's master-unit
+coordinate system. The refinements below raise fidelity; each is optional and
+independent.
+
+- **1.1 Nested groups.** A shape nested inside a sub-group has its anchor
+  expressed in **that group's** coordinate system, defined by the group's
+  `OfficeArtFSPGR` (0xF009, `recVer 0x1`, `recLen 16`: `xLeft, yTop, xRight,
+  yBottom`), not in slide units. To support it: recurse into nested
+  `OfficeArtSpgrContainer` (0xF003), and for each descendant map its anchor from
+  the group's `[xLeft..xRight] × [yTop..yBottom]` onto the group shape's own
+  anchor rect in the parent, composing transforms down the nesting, before the
+  `/576` conversion.
+- **1.2 Non-grouped shapes.** `OfficeArtDgContainer` (0xF002) also has an
+  optional direct `shape` (`OfficeArtSpContainer`, §2.2.13) for a shape not in a
+  group — the current walk only iterates the `OfficeArtSpgrContainer`. Rare in
+  real files, but read that child too for completeness.
+- **1.3 Optional / inherited anchor.** A shape without an
+  `OfficeArtClientAnchor` (0xF010) currently yields a frame with no position.
+  PowerPoint placeholders often omit the anchor and inherit geometry from the
+  matching placeholder shape on the **master slide** (resolve via
+  `OfficeArtClientData.placeholderAtom` → the master's placeholder).
+- **1.4 Origin / sign sanity check.** Field order and units are spec-confirmed
+  (top/left/right/bottom; master units = 1/576 inch) and verified on
+  `slides.ppt`. Still worth confirming the origin (top-left of the slide) and
+  non-negative values on a second, independently produced real file.
+
+## 2. Smaller shortcomings
+
+- **2.1 Slide size is hardcoded.** `slide_page_layout` returns a fixed 10"×7.5"
+  (`ppt_document.cpp`). The real size is `DocumentAtom.slideSize`
+  (`RT_DocumentAtom` 0x03E9, the first child of the `DocumentContainer`) — a
+  `PointStruct` in master units (`/576` → inches). Read it and feed the page
+  layout; fall back to 10"×7.5" only if absent.
+- **2.2 Reference-output HTML not wired.** `html_output_test` has no `ppt` case.
+  Add reference HTML under
+  `test/data/reference-output/odr-public/output/ppt/...` and wire it in (needs
+  the `OpenDocument.test.output` submodule).
+- **2.3 Fixture not committed.** `test/data/input/odr-public/ppt/slides.ppt`
+  exists only in the local `odr-public` submodule working tree. It must be
+  committed/pushed to the `OpenDocument.test` repo and the submodule pointer
+  bumped, or CI can't see it (so `ppt_slides` would fail there).
+- **2.4 No `OutlineTextRefAtom` fixture.** `OutlineTextRefAtom` resolution is
+  implemented but **unexercised by any committed fixture** — all three current
+  `.ppt` files are LibreOffice-authored with an empty outline (`grep` for the
+  `00 00 9E 0F 04 00 00 00` header finds none). A PowerPoint-authored `.ppt`
+  whose placeholders use the outline indirection is needed to regression-test
+  the path. Pairs with §2.3.
+- **2.5 Auto-field metacharacters dropped.** Slide-number / date / header /
+  footer placeholders are separate records (`RT_*MetaCharAtom`) interleaved with
+  the text; we ignore them, so e.g. a slide-number placeholder yields nothing.
+  Low priority for "visible text only".
+- **2.6 `slide_name` is empty.** Could return `"Slide N"` (index-based) so the
+  HTML page/tab has a label, matching how other formats name pages.
+- **2.7 Endianness** — shared `oldms/` shortcoming; see [`../AGENTS.md`](../AGENTS.md).
+
+## Reference: the drawing tree
+
+Inside each `SlideContainer` (0x03EE) is the Office Art (Escher) drawing that
+holds the slide's text boxes:
+
+```
+SlideContainer (0x03EE)                            [MS-PPT] 2.5.1
+└─ drawing = DrawingContainer (RT_Drawing, 0x040C) [MS-PPT] 2.5.13
+   └─ OfficeArtDgContainer (0xF002)                [MS-ODRAW] 2.2.13
+      └─ OfficeArtSpgrContainer (0xF003)           shape group       [MS-ODRAW] 2.2.16
+         ├─ OfficeArtSpContainer (0xF004)          shape #1 (text box) [MS-ODRAW] 2.2.14
+         │  ├─ OfficeArtFSPGR        (0xF009)      group bounds (group shape only) [MS-ODRAW] 2.2.38
+         │  ├─ OfficeArtFSP          (0xF00A)      shape id/flags    [MS-ODRAW] 2.2.40
+         │  ├─ OfficeArtFOPT         (0xF00B)      shape properties  [MS-ODRAW] 2.2.9
+         │  ├─ OfficeArtClientAnchor (0xF010)      POSITION + SIZE   [MS-PPT] 2.7.1
+         │  ├─ OfficeArtClientData   (0xF011)      placeholderAtom: title/body/… [MS-PPT] 2.7.3
+         │  └─ OfficeArtClientTextbox(0xF00D)      the box's text    [MS-PPT] 2.9.76
+         │     ├─ TextHeaderAtom (0xF9F)
+         │     └─ TextCharsAtom/TextBytesAtom (0xFA0/0xFA8)
+         └─ OfficeArtSpContainer (0xF004)          shape #2 …
+```
+
+- The `OfficeArt*` container/shape records are `[MS-ODRAW]`; the
+  `DrawingContainer` and the *client* records (`0xF00D` textbox, `0xF010`
+  anchor, `0xF011` data) are `[MS-PPT]`. `[MS-ODRAW]` §2.2.14 defers
+  `clientAnchor`/`clientData`/`clientTextbox` to the host app.
+- **`OfficeArtSpContainer` (0xF004) child order** per `[MS-ODRAW]` §2.2.14:
+  `shapeGroup?` (`OfficeArtFSPGR`, group shapes only), `shapeProp`
+  (`OfficeArtFSP`, 16 B), `shapePrimaryOptions?` (`OfficeArtFOPT`), …,
+  **`clientAnchor?`**, `clientData?`, `clientTextbox?`. The parser matches by
+  recType, so order only documents what to expect.
+- **Anchor body** (`OfficeArtClientAnchor`, atom, `recLen == 8` or `16`), field
+  order **top, left, right, bottom** (y, x, x, y):
+  - `recLen == 8` → `SmallRectStruct` (`[MS-PPT]` 2.12.8): four **signed 2-byte**.
+  - `recLen == 16` → `RectStruct` (`[MS-PPT]` 2.12.7): four **signed 4-byte**.
+
+  `width = right - left`, `height = bottom - top`; master units → inches = `/576`.
+- The first child `OfficeArtSpContainer` of the root spgr is the **group shape**
+  itself (holds the `OfficeArtFSPGR`, has no `clientTextbox`); the parser drops
+  it implicitly because it has no text.
diff --git a/src/odr/internal/oldms/spreadsheet/AGENTS.md b/src/odr/internal/oldms/spreadsheet/AGENTS.md
new file mode 100644
index 00000000..25b1349b
--- /dev/null
+++ b/src/odr/internal/oldms/spreadsheet/AGENTS.md
@@ -0,0 +1,183 @@
+# `.xls` (Excel / BIFF8) support — status, design & open work
+
+What the `oldms/spreadsheet/` module does **today**, the **design decisions**
+behind it, and the **open work**. Shared `oldms/` conventions are in
+[`../AGENTS.md`](../AGENTS.md).
+
+**Scope.** Extract the **visible cell text** of every worksheet and expose it
+through the abstract document model so the generic HTML renderer produces a plain
+table per sheet. Every cell value is rendered as a *string* — no styles,
+number/date formats, merged cells, drawings, or charts.
+
+**Specs.** `[MS-XLS]` (the record stream, the SST, the cell records) and
+`[MS-CFB]` for the container. Section numbers are cited inline below and in code.
+
+---
+
+## What works
+
+- `.xls` is detected (`/Workbook` stream) and decoded to a `Document`
+  (spreadsheet): one `sheet` element per worksheet, with `sheet_cell` →
+  `paragraph` → `text` elements for every non-empty cell.
+- **All BIFF8 cell value kinds** become display text: SST strings (`LabelSst`),
+  inline strings (`Label`), numbers (`RK`, `MulRk`, `Number`), booleans/errors
+  (`BoolErr`), and **cached formula results** (`Formula` + `String` for string
+  results; numeric/boolean/error results from the `FormulaValue`).
+- **SST `CONTINUE` splitting** is handled, including a split *mid-string* where
+  the continuation re-declares the character encoding (§2.5.293).
+- Sheet `dimensions` come from the `Dimensions` record; `content` is the tight
+  extent of the non-empty cells (what the HTML renderer uses by default).
+- The generic HTML renderer produces one table per sheet
+  (`html::translate_sheet`), with column letters and row numbers.
+
+Verified against `[MS-XLS]`: the record stream (§2.1.4), BOF/substream layout
+(§2.4.21), `BoundSheet8` (§2.4.28), `SST`/`Continue` (§2.4.265/.58),
+`XLUnicodeRichExtendedString` (§2.5.293), `RkNumber` (§2.5.217: bit 0 = `fX100`,
+bit 1 = `fInt`), `FormulaValue` (§2.5.133), `Dimensions` (§2.4.90).
+
+## Module layout (sibling of `../text`, `../presentation`)
+
+| File (`oldms/spreadsheet/`)        | Role                                              |
+|------------------------------------|---------------------------------------------------|
+| `xls_structs.hpp`                  | `#pragma pack(1)` PODs for the record bodies + `static_assert` sizes + record type enum |
+| `xls_io.{hpp,cpp}`                 | `BiffReader` (record walker with transparent `CONTINUE` hopping; the `[MS-XLS]` string readers and `expect_bof` are methods), RK decoding, number formatting |
+| `xls_parser.{hpp,cpp}`             | `parse_tree(registry, files)` → globals (BoundSheet8 + SST) then one pass per sheet substream |
+| `xls_element_registry.{hpp,cpp}`   | Flat element store + `Sheet` (name, dimensions, cell position map) and `SheetCell` (position) payloads |
+| `xls_document.{hpp,cpp}`           | `internal::Document` subclass + the `ElementAdapter` |
+
+## Pipeline: how a `.xls` becomes the element tree
+
+1. **Wiring.** `LegacyMicrosoftFile::parse_meta` detects the `/Workbook` stream
+   → `FileType::legacy_excel_worksheets`, `DocumentType::spreadsheet`, and
+   `document()` returns `spreadsheet::Document`.
+2. **Globals substream.** `/Workbook` is a flat sequence of `(u16 type, u16
+   size, body)` records. The first substream (after its `BOF`, which must
+   declare BIFF8 = `vers 0x0600`) holds, per sheet, a `BoundSheet8` (name +
+   absolute offset of the sheet's `BOF`; only `dt == worksheet` is kept) and the
+   `SST` — all shared string constants, deduplicated.
+3. **SST / CONTINUE.** A record body is capped at 8224 bytes; the SST payload
+   spills into `Continue` records, and the split can fall *inside* a string.
+   `BiffReader`'s body accessors hop into a following `CONTINUE` transparently
+   (throwing if the next record is anything else); character data additionally
+   re-reads a fresh flags byte at each hop, since the continuation re-declares
+   compressed (1 byte/char) vs UTF-16 for the remainder. Formatting runs
+   (`cRun`·4 bytes) and phonetic data (`cbExtRst` bytes) are read and skipped.
+4. **Sheet substreams.** For each kept `BoundSheet8`, seek to its `BOF` and scan
+   records until `EOF`: `Dimensions` → sheet extents; `LabelSst` / `Label` /
+   `RK` / `MulRk` / `Number` / `BoolErr` → one cell each; `Formula` → the cached
+   result in its `FormulaValue` (an Xnum double unless `fExprO == 0xFFFF`, then
+   string/bool/error/blank — a string result follows in a `String` record,
+   matched via a pending-cell marker). `Blank` / `MulBlank` carry no text and
+   are ignored.
+5. **Tree.** Each non-empty cell becomes `sheet_cell → paragraph → text` (the
+   cell's rendered string). Cells hang off their sheet by `parent_id` only —
+   they are *not* in the sibling chain (mirrors `ooxml/spreadsheet`); lookup goes
+   through the sheet's `(column,row) → id` map, which also tracks the tight
+   `content` extent.
+6. **Render.** `html::translate_sheet` walks the sheet purely through the public
+   `Sheet` / `SheetCell` API, which delegates to our adapter.
+
+### Value formatting
+
+- **RK numbers** (§2.5.217): low 2 bits are flags — bit 0 `fX100` (divide by
+  100), bit 1 `fInt` (30-bit signed integer vs the *high 30 bits* of an IEEE
+  double, rest zero).
+- Numbers are formatted with `%.15g` (≈ Excel's "General": up to 15 significant
+  digits, no trailing zeros, integers without a decimal point).
+- Booleans → `TRUE`/`FALSE`; error codes (BErr, §2.5.10) → `#DIV/0!`, `#VALUE!`,
+  `#REF!`, `#NAME?`, `#NUM!`, `#N/A`, `#NULL!`.
+- Dates are **not** decoded: a date cell shows its raw serial number unless the
+  file stored it as a string (number-format handling is open work).
+
+## Adapters
+
+`xls_document.cpp` implements the generic `ElementAdapter` plus `SheetAdapter` /
+`SheetCellAdapter` / `ParagraphAdapter` / `TextAdapter`:
+- `sheet_name` / `sheet_dimensions` → from the registry payload;
+  `sheet_content(range)` → the tight content extent, clamped to `range`.
+- `sheet_cell(col,row)` → map lookup, `null_element_id` for empties;
+  `sheet_first_shape` → none.
+- All `*_style(...)` → `{}`; `sheet_cell_value_type` → `ValueType::string`
+  (every value is pre-rendered text); `sheet_cell_span` → `{1,1}`.
+- `paragraph_text_style` / `text_style` set `font_size = 11pt` so empty
+  paragraphs have height (same hack as the `.doc`/`.ppt` modules).
+- `Document::is_editable()` → `false`; `save(...)` → `UnsupportedOperation`.
+
+## Design decisions
+
+- **Fail early on malformed input** (matches the sibling modules): missing or
+  non-BIFF8 `BOF`, a non-`CONTINUE` record where a body continuation is
+  required, an out-of-range SST index, a malformed `MulRk` body, an unknown
+  `FormulaValue` type, and truncated streams all **throw**. Records that are
+  merely *not modelled* are skipped.
+- **Pre-rendered text instead of typed values.** Cell values are converted to
+  display strings at parse time; the model exposes `ValueType::string` only.
+  Typed values would require XF/number-format plumbing — deliberately deferred.
+- **Endianness/bit order**: bytes are copied straight into native
+  integers/doubles and bit-field structs (`RkNumber`, `UnicodeStringFlags`, flag
+  fields of `BoundSheet8Fixed`/`FormulaFixed`) — little-endian, LSB-first hosts
+  only; shared `oldms/` assumption, see [`../AGENTS.md`](../AGENTS.md).
+
+## Tests
+
+- `xls_string_split_across_continue` — a string split mid-character-data with an
+  encoding switch at the boundary.
+- `xls_rich_string_runs_across_continue` — formatting-run skip across a
+  `CONTINUE` (no flags byte there) + correct position for the next string.
+- `xls_decode_rk` — all four RK flag combinations + number formatting; the
+  inputs are raw on-disk encodings, so it also pins the `RkNumber` bit-field
+  layout.
+- `xls_empty` / `xls_file_example_10` / `xls_file_example_5000` — real fixtures:
+  sheet names, dimensions, content extents, string/number cells; the 5000-row
+  file exercises SST `CONTINUE` handling on real data.
+- HTML output: `html_output_test` no longer skips `legacy_excel_worksheets`;
+  reference output lives under
+  `test/data/reference-output/{odr-public,odr-private}/output/xls/`.
+
+---
+
+# Open work
+
+Roughly ordered by value.
+
+## 1. Number & date formatting (the biggest visible gap)
+
+Cells currently show raw values: a date cell renders as its serial number (e.g.
+`43023` instead of `15/10/2017`) and numbers ignore their format codes. Fix by
+following the format chain:
+- Each cell record carries an `ixfe` (currently discarded — the parser already
+  reads it). It indexes the `XF` records (0x00E0) in the globals substream;
+  `XF.ifmt` picks a number format: a built-in id (0–163, the table is in
+  [MS-XLS] 2.4.126 `Format`) or a `Format` record (0x041E) with a format string.
+- MVP: keep `ixfe` per cell, parse `XF`/`Format`, and special-case the date/time
+  formats (built-in ids 14–22, 45–47 + anything containing `y/m/d/h`) to convert
+  the serial date (days since 1899-12-31, fractional part = time; mind the
+  workbook's 1904 flag in `Date1904`, 0x0022) into a sensible string. Full
+  custom-format rendering is a rabbit hole; approximate first.
+
+## 2. Coverage gaps
+
+- **Merged cells**: `MergeCells` record (0x00E5) → `sheet_cell_span` /
+  `sheet_cell_is_covered` (the adapter stubs are in place).
+- **Styles**: fonts (`Font`, 0x0031), fills/borders from `XF` →
+  `sheet_cell_style` / `text_style`; column widths (`ColInfo`, 0x007D) and row
+  heights (`Row`, 0x0208) → `sheet_column_style` / `sheet_row_style`.
+- **Hidden rows/columns** (`Row.fDyZero`, `ColInfo.fHidden`).
+- **Typed cell values**: expose numeric/bool/date `ValueType`s instead of
+  pre-rendered strings (needed for anything smarter than HTML text).
+- **Encrypted workbooks**: a `FilePass` record (0x002F) in the globals substream
+  means the rest of the stream is encrypted ([MS-OFFCRYPTO]) — currently it
+  parses as garbage or throws; should report password-protected.
+- **BIFF5/BIFF7** (`BOF.vers != 0x0600`): currently throws; older `.xls` files
+  exist in the wild (no SST — `Label` records carry the strings inline).
+- **Drawings/charts/images** (`MsoDrawing`/`Obj`/chart substreams) — likely
+  never worth it for text extraction.
+
+## 3. Smaller shortcomings
+
+- **Endianness/bit order** — shared `oldms/` shortcoming, see
+  [`../AGENTS.md`](../AGENTS.md).
+- `RString` (0x00D6, rich inline string cell) is rare and currently skipped.
+- A `Formula` string result is matched to the *immediately following* `String`
+  record via a pending-cell marker; an intervening `SharedFmla`/`Array`/`Table`
+  record is tolerated only because unknown records are skipped — not validated.
diff --git a/src/odr/internal/oldms/text/AGENTS.md b/src/odr/internal/oldms/text/AGENTS.md
new file mode 100644
index 00000000..a934eab7
--- /dev/null
+++ b/src/odr/internal/oldms/text/AGENTS.md
@@ -0,0 +1,369 @@
+# `.doc` (Word) support — status, design & open work
+
+What the `oldms/text/` module does **today**, the **design decisions** behind
+it, and the **open work**. Shared `oldms/` conventions are in
+[`../AGENTS.md`](../AGENTS.md).
+
+**Scope.** Extract the **visible text of the main document body**, split into
+paragraphs and manual line breaks, and expose it through the abstract document
+model so the generic HTML renderer lays it out as a flat run of paragraphs. No
+character/paragraph styles, no headers/footers/footnotes/endnotes/annotations,
+no tables, frames, images, or fields beyond showing their result text.
+
+**Specs.** `[MS-DOC]` (the FIB, the Clx / piece table, text decoding) and
+`[MS-CFB]` for the container. Section numbers are cited inline below.
+
+---
+
+## What works
+
+- `.doc` is detected (`/WordDocument` stream) and decoded to a `Document`
+  (text), one flat element tree under the root.
+- The **main document body** (the first `ccpText` characters) is read from the
+  piece table, decoded (compressed 8-bit *or* UTF-16), split into paragraphs /
+  manual line breaks, with a `page_break` element at each end-of-section /
+  manual page break (`0x0C`).
+- Field codes are resolved to their **result** text (the instruction part is
+  hidden); anchor/control characters are stripped.
+- The generic HTML renderer produces the body as a sequence of paragraphs.
+
+Verified against `[MS-DOC]`: the read path matches *Retrieving Text* (§2.4.1,
+steps 1–6), the FIB version map (§2.5.1), the Clx / Pcdt / Prc lead bytes
+(§2.9.38/.178/.209), `FcCompressed` incl. the `0x82–0x9F` byte map (§2.9.73),
+and the field characters (§2.8.25).
+
+## Module layout (sibling of `../presentation`)
+
+| File (`oldms/text/`)              | Role                                                |
+|-----------------------------------|-----------------------------------------------------|
+| `doc_structs.hpp`                 | `#pragma pack(1)` PODs (`FibBase`, the `FibRgFcLcb97/2000/2002/2003/2007` chain, `Sprm`, `FcCompressed`, `Pcd`) + `static_assert` sizes + the `PlcPcdMap` piece-table view + `ParsedFib` |
+| `doc_io.{hpp,cpp}`                | `read(...)` helpers over `std::istream`: the variable-length FIB, the Clx walk, string decoding (compressed / UTF-16) |
+| `doc_helper.{hpp,cpp}`            | `CharacterIndex` (the decoded piece table) + `read_character_index` |
+| `doc_parser.{hpp,cpp}`           | `parse_tree(registry, files)` → reads the body text and builds the element tree, incl. `clean_text` (field & control-char handling) |
+| `doc_element_registry.{hpp,cpp}` | Flat element store (id = vector index) + a text side-payload |
+| `doc_document.{hpp,cpp}`         | `internal::Document` subclass + the `ElementAdapter` |
+
+`ElementRegistry` is a `vector<Element>` (id = index) with parent/child/sibling
+ids and a side map for the text payload; `create_element` / `create_text_element`
+/ `append_child` are the only builders.
+
+## Pipeline: how a `.doc` becomes the element tree
+
+1. **Wiring.** `LegacyMicrosoftFile::parse_meta` detects the `/WordDocument`
+   stream → `FileType::legacy_word_document`, `DocumentType::text`, and
+   `document()` returns `text::Document`.
+2. **Read the FIB.** `parse_tree` opens `/WordDocument` and reads the **File
+   Information Block** (§2.5.1). The FIB is variable-length and self-describing:
+   a fixed `FibBase` (32 B) followed by four counted arrays — `csw`·uint16
+   (`fibRgW`), `cslw`·uint32 (`fibRgLw`), `cbRgFcLcb`·`FcLcb` (`fibRgFcLcb`),
+   `cswNew`·uint16 (`fibRgCswNew`). `read(ParsedFib&)` reads each count,
+   validates it covers the struct we model, then `ignore`s any surplus.
+3. **Pick the FIB version.** The effective `nFib` is `fibRgCswNew.nFibNew` when
+   `cswNew > 0`, else `FibBase.nFib`. `type_dispatch_FibRgFcLcb` maps it
+   (`nFib97 … nFib2007`) to the right `FibRgFcLcb*` layout and `memcpy`s the raw
+   `fibRgFcLcb` bytes into it. We only read `clx` out of it, but the whole
+   versioned struct is modelled so the offset is correct.
+4. **Locate & read the Clx (piece table).** The table stream is `/1Table` or
+   `/0Table` per `FibBase.fWhichTblStm`. The Clx (§2.9.38) lives at
+   `fibRgFcLcb->clx.fc`. `read_Clx` walks it: leading `Prc` entries (lead
+   `0x01`) are skipped, then the `Pcdt` (lead `0x02`) carries the `PlcPcd` — the
+   piece table mapping CP ranges to byte offsets in `/WordDocument`.
+   `read_character_index` turns it into a `CharacterIndex`.
+5. **Concatenate the body text.** Pieces come in ascending CP order;
+   `parse_tree` clamps each to the remaining `ccpText` budget (so only the main
+   body is taken), seeks to each piece's `data_offset`, decodes it.
+6. **Build the tree.** Split the body on `0x0D` (paragraph mark) — dropping the
+   trailing empty paragraph from the body's guard mark — then each paragraph on
+   `0x0C` (end-of-section / manual page break) and each segment on `0x0B`
+   (manual line break):
+
+   ```
+   root  (ElementType::root)
+   ├── paragraph  (ElementType::paragraph)    split on 0x0D, then 0x0C
+   │   ├── text       (ElementType::text)     clean_text(...) of the run
+   │   └── line_break (ElementType::line_break)  for 0x0B in a paragraph
+   └── page_break (ElementType::page_break)   one per 0x0C boundary
+   ```
+7. **Render.** HTML works through the generic renderer via the public
+   `Paragraph` / `Text` / `LineBreak` API and our adapters.
+
+## The piece table (`CharacterIndex`)
+
+A `.doc` stores text in **pieces** rather than one contiguous run: the `PlcPcd`
+is `n+1` ascending CP boundaries (`aCP`) followed by `n` `Pcd` structures
+(`aData`). `PlcPcdMap` is a zero-copy view over the raw `plcPcd` bytes computing
+`n = (cb - 4) / (4 + sizeof(Pcd))`, exposing `aCP(i)` / `aData(i)`.
+
+Each `Pcd` holds an `FcCompressed`:
+- `fCompressed == 0` → **UTF-16**, `data_offset = fc`, 2 bytes per CP.
+- `fCompressed == 1` → **compressed** (one byte per CP), `data_offset = fc / 2`.
+
+`read_character_index` records, per piece, `(start_cp, length_cp, data_offset,
+is_compressed)`; `CharacterIndex::Iterator` derives `length_cp` from adjacent CP
+boundaries and `data_length` from the compression flag. `append` enforces
+ascending CP order (throws otherwise).
+
+### Text decoding
+
+- **Uncompressed**: `length_cp` UTF-16 code units → `u16string_to_string`.
+- **Compressed**: each byte is one code point (§2.9.73 / §2.4.1 step 6). Bytes
+  `0x82–0x9F` are remapped via `uncompress_char` (the Windows-1252 "smart
+  quotes" block — e.g. `0x92 → U+2019`, `0x96 → U+2013`); every other byte `b`
+  is code point `U+00b` and UTF-8-encoded, so `0xA0–0xFF` round-trip (e.g.
+  `0xE9 → "é"`).
+- **In-text control characters** (`clean_text`):
+  - `0x0D` paragraph mark, `0x0C` end-of-section / manual page break, `0x0B`
+    manual line break are consumed by the caller's splits and never reach
+    `clean_text`. A `0x0C` boundary emits a `page_break` (§2.8.26).
+  - `0x13`/`0x14`/`0x15` delimit a **field**: instruction (begin→separator)
+    hidden, result (separator→end) shown. The separator `0x14` is optional
+    (§2.8.25); a separator-less field is hidden up to its `0x15` end. Nesting is
+    tracked with a per-field stack.
+  - `0x09` tab kept; `0x1E` non-breaking hyphen → `-`; `0x1F` optional hyphen
+    dropped; all other control characters `< 0x20` (picture/OLE `0x01`, footnote
+    ref `0x02`, cell mark `0x07`, …) dropped.
+
+## Adapters
+
+`doc_document.cpp` implements the generic `ElementAdapter` plus
+`TextRootAdapter` / `ParagraphAdapter` / `SpanAdapter` / `TextAdapter` /
+`LineBreakAdapter`:
+- `text_root_page_layout` / `text_root_first_master_page` → empty.
+- `paragraph_style` / `span_style` / `line_break_style` → empty (`TODO`).
+- `paragraph_text_style` / `text_style` set `font_size = 11pt` so empty
+  paragraphs still have height (same hack as the PPT module; removed when
+  character formatting lands — see open work).
+- `Document::is_editable()` → `true` and `is_savable(encrypted)` →
+  `!encrypted`, but `save(...)` and `text_set_content(...)` throw
+  `UnsupportedOperation` — read-only in practice.
+
+## Binary format reference (FIB)
+
+The FIB is the root of every `.doc`, at offset 0 of `/WordDocument`:
+
+```
+FibBase        32 B fixed   (wIdent, nFib, flags incl. fWhichTblStm/fEncrypted, …)
+csw            uint16       count of the following uint16 array
+fibRgW         csw·uint16
+cslw           uint16       count of the following uint32 array
+fibRgLw        cslw·uint32  (holds ccpText at uint16 indices 6–7)
+cbRgFcLcb      uint16       count of the following FcLcb (8-byte) array
+fibRgFcLcb     cbRgFcLcb·FcLcb   (holds clx → the piece table)
+cswNew         uint16       count of the following uint16 array
+fibRgCswNew    cswNew·uint16     (nFibNew overrides FibBase.nFib when present)
+```
+
+`ccpText` (count of CPs in the main body) is read out of `fibRgLw` as a
+little-endian uint32 spanning indices 6–7; it is signed and MUST be ≥ 0, so a
+value with the sign bit set **throws** (§2.5.5). `nFib` values handled: `nFib97`
+(0x00C1), `nFib2000` (0x00D9), `nFib2002` (0x0101), `nFib2003` (0x010C),
+`nFib2007` (0x0112). A value **above** `nFib2007` falls back to the
+`FibRgFcLcb2007` layout; a value below `nFib97` **throws**.
+
+---
+
+## Design decisions
+
+**Main body only, via the `ccpText` budget.** `/WordDocument` interleaves the
+body with headers, footnotes, annotations, etc.; the FIB's `ccp*` counts
+partition the CP space. We take only the first `ccpText` CPs by clamping each
+piece to the remaining budget and stopping when exhausted.
+
+**Self-describing FIB read — forward-compatible.** `read(ParsedFib&)` trusts the
+on-disk counts rather than a fixed layout: it reads what we model and `ignore`s
+the surplus. A FIB from a newer Word that appends fields still parses — the
+version dispatch picks the matching `FibRgFcLcb*` (or `FibRgFcLcb2007` for a
+newer-than-2007 `nFib`), and the `FcLcb` block is copied **clamped** to
+`min(sizeof(layout), cbRgFcLcb·8)`, so extra trailing entries are ignored and a
+shorter block leaves the remainder zero (the `clx`/`fcClx` we need lives in the
+`FibRgFcLcb97` base, always covered). The `csw`/`cslw` counts must still cover
+the arrays we read, else they throw.
+
+**Fail early on malformed input** (matches the sibling `.ppt` parser). We
+**throw** on: an `nFib` below `nFib97` or an unknown `nFibNew` (newer-than-2007
+`nFib` does **not** throw — it uses the 2007 layout); a `ccpText` with the sign
+bit set (§2.5.5); a `csw`/`cslw` count too small to cover the array we read; an
+unexpected lead byte while walking the Clx (anything other than `0x01`/`0x02`);
+a piece table whose CP boundaries are not ascending; a compressed byte outside
+`0x00–0xFF` or an early EOF while decoding. We **pass through** for things we
+don't model: text after the main body, the `Prc` formatting runs, and every
+control/field character `clean_text` drops.
+
+**Endianness.** Host byte order / LSB-first bit-fields assumed; shared `oldms/`
+assumption, analysis and fix plan in [`../AGENTS.md`](../AGENTS.md).
+
+## Tests
+
+- `OldMs.doc_read_string_compressed` — the compressed (1-byte-per-CP) decoder
+  against the §2.9.73 byte map: ASCII passthrough, the `0x82–0x9F` remap, the
+  `0xA0–0xFF` UTF-8 round-trip.
+
+The FIB-robustness behaviours (negative `ccpText` rejected, newer-than-2007
+`nFib` falling back to the 2007 layout) and the `0x0C` page-break emission are
+**not yet unit-tested**; there is also **no assertion-based render test** over a
+real `.doc` fixture (unlike the `.ppt` cases).
+
+---
+
+# Open work
+
+## 1. Character (font) formatting → the IR (the next feature)
+
+**Goal.** Extract per-run character properties (font name, size, bold, italic,
+underline, strikethrough, colour, highlight) and surface them through the
+abstract model's `TextStyle`, so the HTML renderer styles text instead of
+emitting one flat 11pt run. This replaces the `font_size = 11pt` placeholder in
+`doc_document.cpp`.
+
+`TextStyle` (`src/odr/style.hpp`) maps almost 1:1 onto the `.doc` character
+SPRMs:
+
+| `TextStyle` field   | SPRM (opcode)            | operand → value                                            |
+|---------------------|--------------------------|------------------------------------------------------------|
+| `font_size`         | `sprmCHps` (0x4A43)      | u16 **half-points** → `Measure(hps/2.0, pt)` (default 20 = 10pt) |
+| `font_weight`       | `sprmCFBold` (0x0835)    | `ToggleOperand` → `FontWeight::bold` when on               |
+| `font_style`        | `sprmCFItalic` (0x0836)  | `ToggleOperand` → `FontStyle::italic` when on              |
+| `font_underline`    | `sprmCKul` (0x2A3E)      | `Kul` value, `0x00` = none → `bool`                        |
+| `font_line_through` | `sprmCFStrike` (0x0837)  | `ToggleOperand` → `bool`                                   |
+| `font_color`        | `sprmCCv` (0x6870)       | `COLORREF` → `Color`; legacy `sprmCIco` (0x2A42) is a palette index |
+| `background_color`  | `sprmCHighlight` (0x2A0C)| `Ico` highlight index → `Color`                            |
+| `font_name`         | `sprmCRgFtc0` (0x4A4F)   | s16 index into `SttbfFfn` → font name (intern it; see below) |
+
+`font_name` is a `const char *`, so the resolved name needs stable storage —
+intern it in the `ElementRegistry` (e.g. a `std::deque<std::string>` whose
+elements never move) and hand out the pointer.
+
+**How `[MS-DOC]` stores & retrieves character properties** — the authoritative
+algorithm is **Direct Character Formatting** (§2.4.6.2), which reuses the
+*Retrieving Text* walk we already have:
+1. For a character at `cp`, run *Retrieving Text* (§2.4.1) to get its byte
+   offset `fc` in `/WordDocument` and the owning `Pcd` (we already compute both).
+2. Read the **`PlcBteChpx`** (§2.8.5) at `fcPlcfBteChpx`/`lcbPlcfBteChpx` in the
+   table stream — a PLC keyed by **stream offset**: `aFC[n+1]` boundaries +
+   `aPnBteChpx[n]` (`PnFkpChpx`, 4 bytes each).
+3. Find the largest `i` with `aFC[i] ≤ fc`; read a **`ChpxFkp`** (§2.9.33) at
+   `aPnBteChpx[i].pn * 512` in `/WordDocument` (a fixed 512-byte page: `rgfc`
+   run boundaries, parallel `rgb` offsets, `crun` in the last byte).
+4. Find the largest `j` with `rgfc[j] ≤ fc`; the `Chpx` (§2.9.32) lives at
+   `rgb[j] * 2` within the page. `Chpx.grpprl` is an array of **`Prl`** = `Sprm`
+   (2 bytes) + operand.
+5. Append the `Pcd.Prm` modifications (§2.9.214–216): a `Prm0` (inline) or
+   `Prm1` (index) carrying extra SPRMs for this run.
+
+`Prl`/`Sprm` is already modelled in `doc_structs.hpp` (`Sprm` with
+`ispmd/fSpec/sgc/spra` and `operand_size()`); a **character** property is a SPRM
+with `sgc == 2`. Walk each `Chpx.grpprl` by reading a 2-byte `Sprm` then
+`operand_size()` operand bytes (note `spra == 6` is length-prefixed/variable),
+keeping only the opcodes above.
+
+**First cut — direct formatting only.** Implement §2.4.6.2 (`Chpx.grpprl` +
+`Pcd.Prm`) and map the table's SPRMs. Captures the common case: bold/italic/
+size/font/colour applied directly to runs. Resolve `sprmCRgFtc0` by reading
+**`SttbfFfn`** (§2.9.286) once at `fcSttbfFfn`/`lcbSttbfFfn` (an STTB of `FFN`
+records; `FFN.xszFfn` is the UTF-16 font name) and indexing it. Drop the
+hardcoded 11pt; use 10pt (the `sprmCHps` default of 20 half-points).
+
+**Full fidelity — styles (later).** *Determining Formatting Properties*
+(§2.4.6.6) layers, in order: document defaults → `STSH` (§2.4.6.5,
+`fcStshf`/`lcbStshf`) paragraph- and character-style `grpprl`s resolved via the
+paragraph's `istd` → table-style props → direct paragraph → direct character.
+The first cut skips the STSH layer, so style-dependent props fall back to
+defaults; wiring the STSH closes that gap.
+
+**Wiring to the abstract model.** Today `parse_tree` concatenates all body
+pieces into one `body_text` and emits one `text` per paragraph. Per-run styling
+needs run boundaries, expressed in `/WordDocument` byte offsets
+(`ChpxFkp.rgfc`, `PlcBteChpx.aFC`) — so:
+1. **Keep the FC↔text mapping.** While concatenating, retain each piece's
+   `data_offset` and compression so any character's source `fc` is recoverable
+   (the `CharacterIndex` already holds this; thread it through instead of
+   discarding it after building `body_text`).
+2. **Split paragraphs into runs.** Within a paragraph, cut at every `ChpxFkp`
+   run boundary inside it, resolve each run's `TextStyle` once, and emit a
+   **`span`** (`ElementType::span`, already wired via `SpanAdapter`) per run,
+   with the `text` element(s) as its children. Paragraph/line-break splitting
+   stays as-is.
+3. **Store the style.** Add a `TextStyle` side-map to `ElementRegistry` keyed by
+   span id (mirror the text side-payload, and the frame-payload pattern in
+   `presentation`) plus the font-name intern store. `SpanAdapter::span_style`
+   returns the stored style; `text_style` / `paragraph_text_style` then return
+   `{}` (or the paragraph mark's run style) instead of the 11pt hack.
+
+## 2. Coverage gaps
+
+- **Only the main document body.** `parse_tree` stops at the `ccpText` budget,
+  so headers/footers, footnotes, endnotes, comments/annotations, and text boxes
+  — each its own CP range after the body (`ccpFtn`, `ccpHdd`, `ccpAtn`, … in
+  FibRgLw97, located via the matching `plcf*` in the table stream) — are
+  dropped. Extending coverage means walking the later CP ranges and their
+  `Plcf*` structures.
+- **Tables.** Cell text renders as plain paragraphs: the end-of-cell mark `0x07`
+  is dropped by `clean_text` and row/cell structure (§2.4.3, `sprmPFInTable` /
+  `sprmPTtp` / the `TC`/`TAP` tables) is unmodelled. Reconstruct table structure
+  from the paragraph properties to emit real `table`/`row`/`cell` elements.
+  Paragraph-level formatting (alignment, indent, spacing) via `PlcBtePapx` →
+  `PapxFkp` belongs here too, alongside the character work.
+- **Fields show only the cached result.** `clean_text` keeps the field *result*
+  and drops the *instruction* (§2.8.25); page numbers, dates, refs show their
+  last-saved value and are never evaluated. Acceptable for "visible text".
+- **Images / OLE / drawn objects.** The anchor characters (`0x01` inline
+  picture, `0x08` floating picture, OLE) are dropped. No image extraction; would
+  require `PlcfSpa` / the Office Art (`dggInfo`) drawing data.
+- **Encrypted / obfuscated documents.** `FibBase.fEncrypted` / `fObfuscated` are
+  parsed but not acted on; `decrypt` throws `UnsupportedOperation`.
+  XOR-obfuscated and `[MS-OFFCRYPTO]`-encrypted `.doc` are unsupported.
+
+## 3. Smaller shortcomings
+
+- **Endianness.** Shared `oldms/` shortcoming — see [`../AGENTS.md`](../AGENTS.md).
+  For `.doc`: every field is read in host byte order, and the
+  `FibBase`/`Sprm`/`FcCompressed` bit-fields in `doc_structs.hpp` assume
+  LSB-first allocation.
+
+## Reference: the read path
+
+```
+WordDocument stream
+└─ FIB @ 0                                   [MS-DOC] §2.5.1
+   ├─ FibBase (32 B): fWhichTblStm, fEncrypted, nFib
+   ├─ csw·u16  fibRgW
+   ├─ cslw·u32 fibRgLw  → ccpText (idx 6–7)  §2.5.5
+   ├─ cbRgFcLcb·FcLcb fibRgFcLcb → clx.fc    §2.5.7 (version by nFib)
+   └─ cswNew·u16 fibRgCswNew → nFibNew
+
+Table stream (/1Table or /0Table per fWhichTblStm)   §1.4
+└─ Clx @ clx.fc                               §2.9.38
+   ├─ RgPrc: 0..n Prc (lead 0x01, skipped)    §2.9.209
+   └─ Pcdt  (lead 0x02)                        §2.9.178
+      └─ PlcPcd: aCp[n+1] + aPcd[n] (Pcd)      §2.8.35 / §2.9.177
+         └─ Pcd.fc = FcCompressed              §2.9.73
+            ├─ fCompressed=0 → UTF-16 @ fc
+            └─ fCompressed=1 → 8-bit @ fc/2 (+ 0x82–0x9F map)
+
+Retrieving Text algorithm: §2.4.1 (steps 1–6, matches parse_tree)
+Field characters 0x13/0x14/0x15: §2.8.25
+```
+
+Character-formatting path (open work §1), keyed by `/WordDocument` byte offset
+`fc`:
+
+```
+Table stream
+├─ PlcBteChpx @ fcPlcfBteChpx                  §2.8.5
+│  └─ aFC[n+1] (stream offsets) + aPnBteChpx[n] (PnFkpChpx, 4 B)
+├─ SttbfFfn @ fcSttbfFfn  (font names, FFN.xszFfn)   §2.9.286
+└─ STSH @ fcStshf  (styles — full fidelity only)     §2.4.6.5
+
+WordDocument stream
+└─ ChpxFkp @ aPnBteChpx[i].pn * 512  (512-byte page)  §2.9.33
+   ├─ rgfc[crun+1] run boundaries (stream offsets)
+   ├─ rgb[crun] → Chpx @ rgb[j]*2 within page
+   └─ crun (last byte)
+      └─ Chpx = cb + grpprl(Prl[])               §2.9.32
+         └─ Prl = Sprm (2 B) + operand            §2.2.x
+            └─ character SPRMs have sgc == 2; + Pcd.Prm  §2.9.214–216
+
+Direct Character Formatting: §2.4.6.2  (Determining Formatting Properties: §2.4.6.6)
+Font SPRMs: CHps 0x4A43, CFBold 0x0835, CFItalic 0x0836, CKul 0x2A3E,
+            CFStrike 0x0837, CCv 0x6870, CHighlight 0x2A0C, CRgFtc0 0x4A4F
+```
diff --git a/src/odr/internal/pdf/AGENTS.md b/src/odr/internal/pdf/AGENTS.md
new file mode 100644
index 00000000..1cfa284f
--- /dev/null
+++ b/src/odr/internal/pdf/AGENTS.md
@@ -0,0 +1,450 @@
+# In-house PDF support (`pdf/`) — status, design & roadmap
+
+What the `pdf/` module does **today**, the **design decisions** behind it, and
+the **staged roadmap** for turning it into a faithful renderer. Reference links
+(web resources; offline spec docs are planned) live in [`README.md`](README.md).
+
+This is the `DecoderEngine::odr` path for PDF; the sibling `../pdf_poppler/`
+module (poppler / pdf2htmlEX, behind `ODR_WITH_PDF2HTMLEX`) is the
+production-quality alternative engine.
+
+**Scope today.** Parse the PDF object/file structure (classic cross-reference
+tables, cross-reference streams, object streams, hybrid files), build the page
+tree with fonts and annotations, tokenize page content streams into graphics
+operators, and emit a **proof-of-concept HTML rendering**: absolutely positioned
+text spans per `Tj`, pages sized from `MediaBox`. Encrypted files are decrypted
+(RC4, AES-128, AES-256). No graphics, no images, no font files. Experimental and
+not production-quality — the HTML path still contains debug `std::cout` output.
+
+---
+
+## What works
+
+- `.pdf` is detected by file magic and opened as `PdfFile`
+  (`DecoderEngine::odr`); `is_decodable()` returns `false` and `file_meta()`
+  carries only the file type. All parsing is lazy, on HTML request.
+- **Object syntax**: null, booleans, integers/reals, names (incl. `#xx`
+  escapes), literal strings (`\` and `\ooo` escapes), hex strings, arrays,
+  dictionaries, indirect references (`n g R`) — standalone and nested.
+- **File structure**: header, `n g obj … endobj`, `stream` payloads (via
+  `/Length`, with a scan-to-`endstream` fallback), classic `xref` tables,
+  `trailer`, `startxref`, `%%EOF`; both sequential reading (`read_entry`) and
+  random access via the xref table. **Incremental updates**: `startxref` found
+  by scanning the file tail, then the `Prev` chain is followed (cycle-guarded),
+  merging xref tables so the newest entry for each object wins.
+- **Cross-reference streams, object streams, hybrid files** (PDF 1.5+): each
+  trailer-chain section may be a classic table or a cross-reference stream
+  (`/W`/`/Index`/`Size`, decoded via the filter framework, entry types 0/1/2;
+  unknown types treated as absent). Xref entries are a tagged union
+  (`FreeEntry`/`UsedEntry`/`CompressedEntry`); compressed objects are read from
+  their object stream (`/N`/`/First` header, decoded once and cached per
+  stream). Hybrid files follow the `XRefStm`-before-`Prev` lookup order.
+  Lenient where the wild demands: `/Type /XRef` only warns, references to free
+  or absent objects resolve to null with a `Logger` warning, `n g obj` need not
+  end with a newline.
+- **Page tree**: `Catalog` → `Pages` (recursive) → `Page` with per-page
+  `Resources` (fonts only) and `Annots` (raw dictionary only). Objects cached by
+  reference (`DocumentParser::m_objects`).
+- **Inherited page attributes**: the inheritable set per spec Table 30 —
+  `Resources`, `MediaBox`, `CropBox`, `Rotate` — resolved by threading an
+  accumulator down the `Pages` recursion (no `Parent` walk). Each `Page` carries
+  the resolved `media_box`/`crop_box`/`rotate` and its resolved `resources`.
+  Lenience: `CropBox` defaults to `MediaBox`, `Rotate` normalized to
+  {0,90,180,270}, a `MediaBox` missing everywhere falls back to US Letter, a
+  missing `Resources` to an empty dict — all with a `Logger` warning.
+- **Stream filters** (`pdf_filter`): `/Filter` and `/DecodeParms` honoured,
+  including chains and the inline-image abbreviations — FlateDecode and
+  LZWDecode (both with TIFF and PNG predictors), ASCIIHexDecode, ASCII85Decode,
+  RunLengthDecode. Image codecs (DCTDecode, JPXDecode, CCITTFaxDecode,
+  JBIG2Decode) are deliberately not decoded: `decode()` stops and hands back the
+  still-encoded payload for stage 4; `read_decoded_stream` treats them as an
+  error. The `Crypt` filter passes through only as `Identity`.
+- **Encryption** (`pdf_encryption`): the standard security handler. An
+  `Authenticator` parses `/Encrypt` and authenticates the password (user then
+  owner; the empty password is tried first, so owner-locked files open
+  transparently), producing a `Decryptor` that decrypts object strings and
+  streams. RC4 (V 1/2, R 2/3, 40–128 bit),
+  AES-128 crypt filters (V 4, R 4 — `StdCF` with `V2`/`AESV2`, `Identity`,
+  honouring `StmF`/`StrF`) and AES-256 (V 5, R 6, AESV3) are all supported,
+  including owner-only files and `EncryptMetadata false`. Streams are decrypted
+  before `/Filter` decoding; cross-reference streams and object-stream members
+  are left untouched. The user password is never retained: once `authenticate`
+  succeeds, the derived key lives only inside the `Decryptor` (no accessor), and
+  `PdfFile` carries the whole authenticated `Decryptor` forward — from the
+  encryption probe to the render parse — so the HTML service unlocks the
+  document without re-deriving the key. Permission bits (`/P`) are recorded, not
+  enforced.
+- **Fonts / text mapping**: a font's `ToUnicode` CMap stream is decoded and
+  parsed; `bfchar` mappings with 1-byte glyph codes and single UTF-16 units are
+  applied. Unmapped glyphs pass through as their byte value.
+- **Content streams**: the full graphics-operator vocabulary is tokenized;
+  `GraphicsState` executes a subset (state stack `q`/`Q`, matrices `cm`/`Tm`,
+  line parameters, text state `Tc`/`Tw`/`Tz`/`TL`/`Tf`/`Tr`/`Ts`, text
+  positioning `Td`/`TD`, grey/RGB/CMYK colors, glyph metrics `d0`/`d1`). Unknown
+  operators are logged to stderr and skipped.
+- **HTML**: one `document.html` view; each page is a `div` sized from `MediaBox`
+  (points → inches), each `Tj` becomes an absolutely positioned `span` at the
+  text-state offset with `font-size` from `Tf` and the CMap-translated text.
+  `TJ`/`'`/`"` are recognized but only printed to stdout, not rendered.
+
+## Module layout
+
+| File (`pdf/`)                          | Role                                                  |
+|----------------------------------------|-------------------------------------------------------|
+| `pdf_object.{hpp,cpp}`                 | Object model: `Object` (`std::any`-based variant), `Array`, `Dictionary`, `Name`, `StandardString`/`HexString`, `ObjectReference`; `to_stream`/`to_string` dumping |
+| `pdf_object_parser.{hpp,cpp}`          | Tokenizer over `std::streambuf`: whitespace/lines, numbers, names, strings, arrays, dictionaries, references |
+| `pdf_file_object.{hpp,cpp}`            | File-structure entries: `Header`, `IndirectObject`, `Trailer`, `Xref` (tagged-union entries, `append`/`merge_hybrid`), `StartXref`, `Eof`, the `Entry` any-holder; `parse_xref_stream_table` and the `ObjectStream` payload wrapper |
+| `pdf_file_parser.{hpp,cpp}`            | File-level reads on top of `ObjectParser`: indirect objects, xref, trailer, startxref, stream payloads, `seek_start_xref` |
+| `pdf_filter.{hpp,cpp}`                 | Stream filter framework: `decode()` over the `/Filter`/`/DecodeParms` chain; ASCIIHex/ASCII85/LZW/Flate/RunLength decoders, TIFF/PNG predictors; image codecs returned undecoded (`DecodeResult::stopped_at_filter`) |
+| `pdf_document_parser.{hpp,cpp}`        | `parse_document()`: xref/trailer chain → catalog → page tree; lazy object reads with cache; (deep) reference resolution |
+| `pdf_encryption.{hpp,cpp}`             | Standard security handler: `Authenticator` (parse `/Encrypt`, authenticate password → `Decryptor`) and `Decryptor` (decrypt strings/streams; RC4, AES-128, AES-256), plus a `standard_security` namespace of pure key/password algorithms for known-answer tests |
+| `pdf_document.hpp`                     | `Document`: arena of `Element`s + `catalog` pointer |
+| `pdf_document_element.hpp`             | Element structs: `Catalog`, `Pages`, `Page`, `Annotation`, `Resources`, `Font` |
+| `pdf_cmap.{hpp,cpp}`                   | `CMap`: 1-byte glyph → UTF-16 `bfchar` map + string translation |
+| `pdf_cmap_parser.{hpp,cpp}`            | `ToUnicode` CMap stream parser (`begincodespacerange`/`beginbfchar`/`beginbfrange`; only `bfchar` applied) |
+| `pdf_graphics_operator.hpp`            | `GraphicsOperatorType` enum (full operator set) + `GraphicsOperator` (type + `Object` arguments) |
+| `pdf_graphics_operator_parser.{hpp,cpp}` | Content-stream tokenizer: arguments then operator name |
+| `pdf_graphics_state.{hpp,cpp}`         | `GraphicsState`: stack of `State` (general/path/text/color), `execute(op)` for the modelled subset |
+| `pdf_file.{hpp,cpp}`                   | `abstract::PdfFile` wrapper; probes encryption at construction and implements `password_encrypted()`/`decrypt()`, carrying the authenticated `Decryptor` (not the password) so rendering needs no re-derivation |
+
+Consumers outside the module: `open_strategy.cpp` (detection / engine
+selection) and `html/pdf_file.cpp` (`create_pdf_service`).
+
+## Pipeline: how a `.pdf` becomes HTML
+
+1. **Wiring.** `open_strategy` maps `FileType::portable_document_format` to
+   `PdfFile`; `DecoderEngine::poppler` (or the unknown-file-type fallback) can
+   yield a `PopplerPdfFile` instead when built with `ODR_WITH_PDF2HTMLEX`.
+   `html::translate(PdfFile)` picks the matching HTML service.
+2. **Locate the xref.** `seek_start_xref` seeks to `EOF − 64`, scans for
+   `startxref`; `read_start_xref` yields the most recent xref offset.
+   (`read_header` exists but `parse_document` does not call it — the `%PDF-`
+   header is only checked by magic detection earlier.)
+3. **Walk the trailer chain.** `read_xref_section` dispatches: a classic table
+   (`read_xref` + `read_trailer`) or a cross-reference stream (an indirect
+   object whose dictionary doubles as the trailer dict; payload decoded via the
+   filter framework, entries via `parse_xref_stream_table`). A trailer `XRefStm`
+   (hybrid file) is read next and fills entries the classic table lacks or marks
+   free (`merge_hybrid`). Sections merge into the accumulated table
+   (`std::map::insert` keeps the first/newest entry), then `Prev` is followed
+   (cycle-guarded). The first/newest trailer provides `Root`.
+4. **Build the page tree.** `parse_catalog` → `parse_pages` recurses over
+   `Kids` (dispatching on `Type`). Each `Page` keeps its raw dictionary, its
+   `Contents` reference(s), parsed `Resources` (the `Font` table; each font's
+   `ToUnicode` CMap is parsed if present) and `Annots` (raw). `read_object`
+   dispatches on the xref entry kind: used → seek + `read_indirect_object`;
+   compressed → owning object stream decoded once, cached, member parsed from
+   the cached payload; free/absent → null with a warning. Parsed objects cached
+   by reference.
+5. **Decode content.** Per page (depth-first), the `Contents` streams are read,
+   decoded through their `/Filter` chain (`read_decoded_stream`), concatenated
+   with a newline between streams.
+6. **Execute and emit.** `GraphicsOperatorParser` tokenizes; `GraphicsState`
+   updates the state stack. `T*` advances the text offset by `size + leading`;
+   `Tj` emits a positioned `span` using `state.text.offset` and the `Tf` size,
+   glyphs translated through the font's CMap. The text and transform matrices
+   are tracked but **not applied** to positioning.
+
+---
+
+## Design decisions
+
+**Stream-based parsing with seeks, lazy object access.** Everything is parsed
+off a `std::istream`/`std::streambuf` — no full-file buffer. Random access
+(xref lookups, stream payloads) seeks; sequential tokenizing uses
+single-character peek/bump (`geti`/`getc`/`bumpc`). Objects are parsed only when
+referenced, and parsed `IndirectObject`s are cached by reference, so shared
+objects are read once. Positions are `std::uint32_t` (files ≥ 4 GiB are out of
+scope).
+
+**`std::any`-based object model.** `Object` holds its value in `std::any` with
+typed `is_*`/`as_*` accessors (mirrors `oldms/`'s `Entry`). Pro: one value type
+throughout parser, document elements, and operator arguments. Con: no exhaustive
+matching, RTTI lookups, and accidental copies are easy — `resolve_object_copy`
+exists because rvalue access proved fiddly (see the `TODO why rvalue not
+working?` in `pdf_document_parser.cpp`).
+
+**References are recognized by lookahead.** `n g R` is plain integers until the
+`R` appears, so `read_array`/`read_dictionary` patch references after the fact.
+A standalone `read_object` therefore returns the *id* integer of a reference —
+only array/dictionary contexts and `read_object_reference` assemble real
+references. Works for well-formed files; a known sharp edge (`TODO this seems
+hacky`).
+
+**Element tree as an arena.** `Document` owns all elements
+(`vector<unique_ptr<Element>>`); `Catalog`/`Pages`/`Page`/… hold raw non-owning
+pointers plus their original dictionary (`Element::object`), so unmodelled keys
+stay inspectable. Navigation is by typed `is_<T>()`/`as_<T>()` accessors over
+`kids` — thin `dynamic_cast` wrappers mirroring `Object`'s `is_*`/`as_*`
+surface (the former `Type` tag enum was dropped in favour of RTTI).
+
+**Fail early on malformed structure, tolerate unknown content.** Structural
+surprises **throw** `std::runtime_error` (missing `obj`/`endobj`/`stream`/
+`endstream`/`xref`/`startxref`, unexpected characters, an unresolvable
+`/Length`, an unknown page-tree element type, stream exhaustion). Unknown
+**content** is tolerated: unrecognized operators logged and skipped, unmodelled
+operators ignored by `execute`, annotations keep their raw dictionary, CMap
+`codespacerange`/`bfrange` parsed past without effect. References to free/absent
+objects resolve to null with a warning; unknown xref-stream entry types treated
+as absent (7.5.8.3).
+
+**Debug output still in place.** `html/pdf_file.cpp`, `pdf_graphics_state.cpp`,
+`pdf_graphics_operator_parser.cpp` and `pdf_cmap_parser.cpp` print diagnostics
+(and one leftover `"hi"` breakpoint marker) to stdout/stderr instead of
+`Logger`. Proof-of-concept residue; should move to `Logger` or be removed.
+`DocumentParser` itself takes an optional `Logger &` (default `Logger::null()`)
+and routes its warnings through it — new diagnostics should do the same.
+
+---
+
+## Tests
+
+- `test/src/internal/pdf/pdf_filter.cpp` — **assertion-based**, all inputs
+  inline strings: every decoder, predictors, chains, image-codec stop,
+  `Crypt`/unknown-filter errors.
+- `test/src/internal/pdf/pdf_file_object.cpp` — **assertion-based**, inline
+  only: cross-reference-stream entry decoding (field widths incl. 0, type
+  default, big-endian fields, subsections, unknown types, error paths),
+  `ObjectStream` header parsing and member lookup, `Xref::append` /
+  `Xref::merge_hybrid` precedence.
+- `test/src/internal/pdf/pdf_encryption.cpp` — **assertion-based**, inline
+  vectors only: the standard security handler across R 2 (RC4-40), R 3
+  (RC4-128), R 4 (AES-128/AESV2, incl. `EncryptMetadata false` and an
+  owner-locked file) and R 6 (AES-256). Vectors come from the real fixtures and
+  from `qpdf --encrypt` output frozen as literals — decrypting back to a known
+  marker, so no test is circular and no fixture file ships.
+  `crypto_util_test.cpp` covers the new MD5/RC4/SHA-384/512 primitives against
+  public standard vectors.
+- `test/src/internal/pdf/pdf_document_parser.cpp` — **assertion-based**
+  whole-file tests over mini-PDFs assembled by the test-only
+  `pdf_test_file_builder.{hpp,cpp}` (computes xref offsets/`startxref`, so tests
+  show only the dictionaries; classic-table and uncompressed-xref-stream
+  variants), plus inherited-page-attribute coverage (a multi-level `Pages` tree:
+  per-page resolved `MediaBox`/`CropBox`/`Rotate`/`Resources`, override vs.
+  inheritance, the `CropBox` ← `MediaBox` default, the missing-`MediaBox`
+  US-Letter lenience). End-to-end: the classic fixture
+  `odr-public/pdf/style-various-1.pdf`, plus decryption of
+  `odr-public/pdf/Casio_WVA-M650-7AJF.pdf` (RC4, empty password) and
+  `odr-private/pdf/encrypted_fontfile3_opentype.pdf` (AES-256; skipped when the
+  private submodule is absent). The `odr-private` xref-stream/objstm/hybrid
+  fixtures (`basic_text.pdf`, `geneve_1564.pdf`, `test_fail.pdf`, `Kayla….pdf`,
+  `svg_background…issue402.pdf`, `Core_v5.1.pdf`, `onepage.pdf`) were verified
+  manually but are not pinned in unit tests. Also still contains the original
+  print-everything smoke test.
+- `test/src/internal/pdf/pdf_file_parser.cpp` — sequential `read_entry` walk
+  (smoke) + assertion-based xref/trailer/root navigation over
+  `style-various-1.pdf`.
+
+No assertion-based coverage of the tokenizer (escapes, references, hex strings),
+the CMap, or the HTML output.
+
+---
+
+# Roadmap
+
+Goal: faithful read-only HTML for common real-world PDFs through the odr engine,
+so the poppler/pdf2htmlEX engine becomes optional rather than required. Stages
+are ordered by what they unlock; 0–2 are roughly sequential, 3 and 4 are
+independent, 5 builds on whatever pages already render. Each stage gets its own
+detailed design before implementation.
+
+## Stage 0 — file-format compatibility (prerequisite) — **mostly done**
+
+Modern producers write PDF 1.5+ structures the original parser rejected.
+Cross-reference/object streams + hybrid files, the filter framework (incl. PNG
+predictors), inherited page attributes, and encryption (RC4 / AES-128 / AES-256)
+are **all implemented** (see *What works*). The one remaining piece:
+
+**Xref recovery for broken files** (post-stage-0; the WP2 code left room):
+- Trigger: any structural throw during xref-chain walking or a failed object
+  lookup (`startxref` missing/garbage, offsets wrong).
+- Recovery: a single forward scan for `n g obj` line starts (the existing
+  sequential `read_entry` machinery is most of this), building a synthetic
+  `Xref` (last definition of an id wins), collecting `trailer` dicts and
+  `/Type /Catalog` objects as `Root` candidates; objstm members indexed by
+  scanning recovered object streams.
+- Tests fit inline strings well: the scan ignores xref offsets, so a broken
+  mini-PDF needs no offset bookkeeping — write a literal with a garbage
+  `startxref`, duplicate ids, or a missing trailer, and assert what got rebuilt.
+  Real-world fixture: `odr-private/pdf/order-EK52VKL0.pdf` — an HTTP response
+  accidentally saved as `.pdf` (starts with `HTTP/1.0 200 OK`).
+
+Remaining encryption edge cases (deferred until a real file needs them):
+per-stream `/Crypt` filter `Name` overrides, the `EncryptMetadata false`
+metadata-stream `Identity` special case, and `Perms` (Algorithm 13) validation;
+the public-key security handler and R 5 are out of scope.
+
+## Stage 1 — text extraction: the code → Unicode chain
+
+PDF strings are **character codes**; per font, walk this chain and record
+per-code Unicode (or "unknown", which stage 3 handles):
+
+1. **`ToUnicode` CMap** — extend the existing `CMap`: `bfrange`,
+   `codespacerange` (multi-byte codes), multi-character targets.
+2. **Simple fonts**: `/Encoding` base (WinAnsi/MacRoman/Standard) +
+   `/Differences` → glyph names → Unicode via the Adobe Glyph List (incl.
+   `uniXXXX`/`uXXXXXX` names).
+3. **Composite (Type0/CID) fonts**: `Identity-H/V` plus the predefined CMaps
+   (CJK); map CID → Unicode via the CID system info where defined.
+4. **Embedded font fallback** (needs stage 3's font *reading*): reverse the
+   TrueType `cmap`; read glyph names from Type1/CFF charstrings.
+5. Nothing applies → mark the run "no Unicode" for stage 3's re-encoding.
+
+`/ActualText` (tagged PDFs, ligatures) overrides the whole chain for extraction.
+
+## Stage 2 — text positioning & metrics
+
+Independent of Unicode work; fixes layout even with today's partial CMaps.
+
+- Apply the full transform: text matrix × CTM (both tracked in `GraphicsState`
+  but never applied), text rise, horizontal scaling.
+- **Glyph advances**: `/Widths` + `/MissingWidth` (simple), `/W` + `/DW` (CID),
+  char/word spacing, the numeric adjustments in `TJ` — so `TJ`, `'`, `"` finally
+  render and `Tj` runs land correctly.
+- **Form XObjects** (`Do` on a `/Form`): recursive content-stream execution with
+  scoped `/Resources` and the form matrix. Many producers put most page content
+  inside forms, and tiling patterns (stage 4) and annotation appearances
+  (stage 5) run on the same machinery — a structural prerequisite.
+- **Text render modes** (`Tr`): mode 3 (invisible text, OCR-over-scan) must stay
+  selectable but unpainted; stroke/clip modes (1–2, 4–7) need graceful
+  degradation.
+- **Space inference**: PDFs routinely encode no spaces; insert them from
+  glyph-gap heuristics (as pdf2htmlEX does) so copy/paste and search work.
+- Layout side of bidi (RTL run ordering) and vertical writing (Identity-V/CJK).
+- HTML mapping decision: per-run spans with CSS `transform` (cheap, breaks on
+  heavy kerning) vs. per-glyph positioning (exact, verbose) — likely per-run
+  with a kerning threshold that splits runs, like pdf2htmlEX.
+
+## Stage 3 — fonts in HTML
+
+Needed for visual fidelity regardless of text extraction.
+
+**Decision (2026-06): in-house, no FontForge.** pdf.js proves complete font
+conversion is doable without a font library; pdf2htmlEX uses FontForge at the
+cost of a notoriously heavy build. No trimmed off-the-shelf alternative does
+what we need (FreeType/stb_truetype are read-only; hb-subset can only subset
+along the *existing* `cmap`, so it cannot inject the PUA mappings below).
+Expected ~5–8k lines of focused C++ — on the order of an `oldms/` module.
+Reading (SFNT tables, CFF charsets) is the easy part and is needed by stage 1.4
+anyway.
+
+**Architecture: IR for facts, pass-through for glyphs.** No glyph-level font IR:
+decompiling and recompiling outlines is the FontForge model — loses hinting,
+risks fidelity, and with one output format (SFNT) the M×N payoff never
+materializes. Glyph data (outlines, hinting, charstrings) passes through
+byte-for-byte; even Type1 → Type2 charstrings is a direct sibling-format
+translation. What *is* shared: a thin `FontProgram`-style interface — per-flavor
+readers producing the facts every consumer needs (glyph count, glyph → Unicode,
+advance widths, units-per-em, name, bbox, symbolic flag) with raw bytes kept
+alongside. Stage 1.4 reads Unicode from it, the OTF wrap synthesizes
+`head`/`hhea`/`hmtx`/`OS/2` from it, the re-encoder assigns PUA code points from
+its glyph count.
+
+**Intermediate milestone — fonts as first-class library citizens.** Before
+wiring fonts into PDF output, ship them standalone:
+- **Font files as a `DecodedFile` type** (precedent: `SvmFile`, `ImageFile`):
+  `FileType` entries + magic detection (SFNT `0x00010000`/`OTTO`/`wOFF`), a
+  `FontFile` category, and `html::translate(FontFile)` emitting a **specimen
+  page** — name/metrics header plus a glyph grid, font served via `@font-face`.
+  Keep the UI at "specimen page"; no font-editor scope creep.
+- The glyph grid must show **every** glyph, including ones no `cmap` reaches —
+  which forces building the PUA re-encoding, table-directory rebuild, and OTF
+  wrap *first*, against a directly viewable deliverable with font-only tests.
+- **In parallel: PDF as a container.** Expose embedded fonts as an
+  `abstract::Filesystem` (`/fonts/F1.ttf`, …) and reuse the filesystem HTML
+  service (as for ZIP/CFB). Doubles as the corpus harvester.
+
+Sub-stages, ordered by corpus frequency, each independently useful:
+1. **TrueType** (`FontFile2`, CIDFontType2 — bulk of modern PDFs): serve nearly
+   as-is via `@font-face`; implement the `cmap` rewrite (format-4/12 subtable,
+   splice the table directory, recompute `head.checkSumAdjustment`).
+2. **Bare CFF** (`FontFile3`/Type1C): wrap into an OTF container by synthesizing
+   the ~8 required tables; take advance widths from `/Widths`/`/W` rather than
+   interpreting charstrings.
+3. **Type1** (`FontFile` — older docs, pdfTeX/academic PDFs): `eexec`
+   decryption, Type1 → Type2 charstrings, build a CFF, reuse sub-stage 2. The
+   hardest single piece but precisely specified (Adobe T1 spec; pdf.js as
+   reference).
+4. **Type3** (drawing procedures, no font file — scientific plots) → SVG glyphs
+   reusing stage 4's path rendering; plus **non-embedded fonts**: substitute the
+   standard 14 + common names with CSS fallbacks + metrics from `/Widths`.
+
+Mechanisms and guards:
+- **Re-encoding for unmapped glyphs** (the general workaround): rewrite the
+  `cmap` so deterministic PUA code points (`U+E000 + glyph index`) map to the
+  glyphs, emit those in the HTML, mark such runs non-extractable
+  (`user-select: none`, `aria-hidden`). Display correct; copy/search knowingly
+  garbage. Option: re-encode *all* fonts this way (pdf2htmlEX's choice) for one
+  uniform pipeline.
+- **Broken-font long tail**: real embedded fonts are routinely malformed, and
+  browsers run web fonts through a sanitizer (OTS) that silently rejects them.
+  Regenerating the table directory (which the re-encode/wrap does anyway) covers
+  most of it; start strict, add repair heuristics as real files demand. CI gate:
+  run **OTS** over every produced font (test-time only); optionally FreeType as
+  a second oracle. Neither ships in the product.
+
+## Stage 4 — graphics
+
+**Decision (2026-06): SVG generation, no rasterizer.** pdf2htmlEX uses poppler
+to render non-text into a per-page background image; we generate SVG instead —
+serialization, not rendering. pdf.js proves the full PDF graphics model needs no
+native renderer. The PDF and SVG imaging models are close cousins (PostScript
+heritage), so the mapping is mostly mechanical. Trade-off: pdf2htmlEX gets the
+long tail right for free via poppler, while our fidelity is bounded by operator
+coverage — countered by the test oracle below. The rasterized-background
+fallback is **rejected**: it reintroduces exactly the renderer dependency this
+stage exists to avoid.
+
+- Vector content → inline SVG per page, layered under the text spans: paths,
+  fill rules, stroke parameters, transforms; clipping → nested `<clipPath>`;
+  tiling patterns → `<pattern>` (form-XObject machinery from stage 2);
+  axial/radial shadings (types 2/3) → `linearGradient`/`radialGradient`.
+- **Images**: `DCTDecode` → `<img>` JPEG pass-through; Flate/LZW raster → PNG
+  encode; inline images (`BI`/`ID`/`EI` — currently not even tokenized correctly
+  past `ID`); image masks and SMasks later.
+- **SVG residue** — where no 1:1 primitive exists; all at generation time, never
+  rasterization: mesh/function shadings (types 1, 4–7) → tessellate into small
+  flat polygons (pdf.js's approach); color spaces
+  (Separation/DeviceN/Indexed/Lab/ICC) → convert to RGB when emitting (sample
+  tint transforms, approximate ICC as sRGB, ignore overprint); transparency:
+  `CA`/`ca` → `opacity`, soft masks → `<mask>`, blend modes → `mix-blend-mode`;
+  isolated/knockout groups don't map cleanly — punt (rare).
+- **Renderer as test oracle, not dependency** (parallels stage 3's OTS gate):
+  render corpus fixtures with poppler or pdf.js in CI, screenshot our output,
+  perceptual-diff.
+
+## Stage 5 — interaction & navigation
+
+Builds on whatever pages render; needs stage 0 plus destinations from the page
+tree, little else.
+
+- **Links**: URI actions and internal `GoTo` destinations (incl. named) as `<a>`
+  overlays.
+- **Annotation appearances**: render `/AP` appearance streams (form XObjects
+  again) for highlights, stamps, form-field appearances; AcroForm
+  *interactivity* stays out of scope (read-only).
+- **Document outline** (`/Outlines`) → navigation anchors/sidebar.
+- **Optional content groups** (layers): honor default visibility; no toggle UI.
+- **Metadata** (`/Info`, XMP) into `file_meta()`.
+- **Output scaling**: monolithic HTML vs. per-page lazy loading for large
+  documents (check what odr's HTML service model already provides first).
+
+## Cross-cutting (any time)
+
+- Route diagnostics through `Logger` instead of stdout/stderr; drop the leftover
+  debug code (incl. the `"hi"` marker) in `html/pdf_file.cpp`.
+- Grow a corpus: `odr-public` fixtures, the PDF101 "nasty files" collection
+  linked in `README.md`; assertion-based tests per stage.
+- Spec docs offline under `offline/documentation/PDF/` (ISO 32000-1:2008, ISO
+  32000-2:2020, Adobe PDF Reference 1.7, with markdown conversions); still to
+  do: fold them into `README.md` in place of the web links.
+
+## Other known gaps
+
+- **Linearized files** are not handled specially (the tail-first read usually
+  still works, but hint streams are ignored).
+- **CMap coverage**: only single-byte `bfchar`; `bfrange`/`codespacerange`
+  skipped, multi-byte codes unsupported, fonts without `ToUnicode` fall back to
+  identity bytes (stage 1).
+- **Annotations** are collected but their content is not interpreted (stage 5).
+- Revisit the reference-by-lookahead parsing and `read_stream(-1)` fallback.