From be15b483a2285e00355913cd6be419460503f774 Mon Sep 17 00:00:00 2001 From: Andreas Stefl Date: Sat, 13 Jun 2026 17:39:16 +0200 Subject: [PATCH] docs: consolidate per-module STATUS/PLAN notes into AGENTS.md Converge the untracked per-module STATUS.md + PLAN.md (and the PDF PLAN-stage0.md subplan) into a single AGENTS.md per module, so the agent notes are tracked in git and auto-loaded by the repo's agent-instruction discovery convention (which looks for AGENTS.md, not AGENT.md). Co-Authored-By: Claude Opus 4.8 --- AGENTS.md | 163 +++++++ src/odr/internal/oldms/AGENTS.md | 64 +++ src/odr/internal/oldms/presentation/AGENTS.md | 363 ++++++++++++++ src/odr/internal/oldms/spreadsheet/AGENTS.md | 183 +++++++ src/odr/internal/oldms/text/AGENTS.md | 369 ++++++++++++++ src/odr/internal/pdf/AGENTS.md | 450 ++++++++++++++++++ 6 files changed, 1592 insertions(+) create mode 100644 AGENTS.md create mode 100644 src/odr/internal/oldms/AGENTS.md create mode 100644 src/odr/internal/oldms/presentation/AGENTS.md create mode 100644 src/odr/internal/oldms/spreadsheet/AGENTS.md create mode 100644 src/odr/internal/oldms/text/AGENTS.md create mode 100644 src/odr/internal/pdf/AGENTS.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..dbfb6a00 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,163 @@ +# AGENTS.md — OpenDocument.core + +Orientation for AI agents working in this repo. Summarises the architecture, the +conventions, and where to find things. For user-facing docs see +[`README.md`](README.md) and [`docs/`](docs/README.md). + +## What this is + +`odr` (a.k.a. `odrcore`) is a **C++20 library that decodes documents and renders +them to HTML**. It reads many formats (ODF, OOXML, legacy MS, PDF, CSV, …) behind +one abstract document model and a generic HTML renderer. It is the backend for +OpenDocument.droid / .ios. + +Build system: **CMake + Conan**. Language standard: **C++20** (`CMakeLists.txt`). + +## Big picture: how a file becomes HTML + +``` +bytes ─▶ magic/open_strategy ─▶ DecodedFile ─▶ Document ─▶ ElementAdapter ─▶ html::translate ─▶ HtmlService + (detect FileType + (per engine) (per (tree of (generic renderer, + DecoderEngine) format) elements) walks public API) +``` + +1. **Detection** — `internal/magic.cpp` (+ `internal/libmagic`) sniffs the file; + `internal/open_strategy.cpp` picks a `FileType` and a `DecoderEngine` and + constructs the matching `abstract::DecodedFile`. +2. **Decode** — a document file yields an `abstract::Document` (the engine's + subclass of `internal::Document`). +3. **Element tree** — a `Document` exposes a root `ElementIdentifier` plus an + `abstract::ElementAdapter`. The public, value-semantics handles + (`Element`, `Slide`, `Paragraph`, `Text`, `Frame`, …) in + `src/odr/document_element.hpp` are thin wrappers that delegate to the adapter. +4. **Render** — `internal/html/` walks that public element API and writes HTML. + Entry point: `odr::html::translate(...)` → `HtmlService` (paginated fragments; + `bring_offline` materialises files). + +### The element-adapter pattern (every document engine follows it) + +Two pieces per engine: + +- An **`ElementRegistry`**: a flat `std::vector` (id = index + 1) where + each `Element` holds `parent`/`first_child`/`last_child`/`prev`/`next` ids and a + `type`, plus side `unordered_map`s for per-type payloads (text strings, frame + anchors, …). Builders are `create_element` / `create_*_element` / + `append_child`. See `oldms/text/doc_element_registry.*` or + `oldms/presentation/ppt_element_registry.*` for the minimal version. +- An **`ElementAdapter`**: one class that implements `abstract::ElementAdapter` + (tree navigation by id) and, by multiple inheritance, the per-element-type + adapters it supports (`SlideAdapter`, `ParagraphAdapter`, `TextAdapter`, + `FrameAdapter`, …). The `*_adapter(id)` methods return `this` when the element + is of that type, else `nullptr`. See `oldms/presentation/ppt_document.cpp` for a + compact example. + +`ElementType` is the shared enum in `src/odr/document_element.hpp` (`root`, +`slide`, `paragraph`, `text`, `line_break`, `frame`, `table*`, `sheet*`, …). + +## Directory map + +| Path | What | +|------|------| +| `src/odr/*.hpp` | **Public API**: `file.hpp`, `document.hpp`, `document_element.hpp`, `html.hpp`, `style.hpp`, `quantity.hpp` (`Measure`), `odr.hpp`. | +| `src/odr/internal/abstract/` | Core interfaces: `File`/`DecodedFile`, `Document` + `ElementAdapter` (and all per-element adapters), `Filesystem`, `Archive`, `HtmlService`. | +| `src/odr/internal/common/` | Reusable impls: `Path`/`AbsPath`, base `Document`, filesystem, `style`, table cursor/range, temp files. | +| `src/odr/internal/util/` | Helpers: `byte_stream_util` (POD reads), `string_util` (`split`, `u16string_to_string`), `stream_util`, `document_util`, `xml_util`. | +| `src/odr/internal/magic.*`, `open_strategy.*` | File-type detection and the open/dispatch logic. | +| `src/odr/internal/html/` | Generic HTML renderer (`document.cpp`, `document_element.cpp`, `document_style.cpp`). | +| `src/odr/internal/cfb/`, `zip/` | Container formats (Compound File Binary, ZIP). | +| `src/odr/internal/odf/` | OpenDocument (odt/ods/odp/odg). | +| `src/odr/internal/ooxml/` | OOXML (docx/pptx/xlsx); subdirs `text`/`presentation`/`spreadsheet`. | +| `src/odr/internal/oldms/` | **Legacy MS binary** (.doc/.ppt/.xls); subdirs `text`/`presentation`/`spreadsheet`. | +| `src/odr/internal/oldms_wvware/` | Alternative .doc decoder via wvWare. | +| `src/odr/internal/pdf/`, `pdf_poppler/` | PDF (own parser + poppler/pdf2htmlEX path). | +| `src/odr/internal/{csv,json,text,svm}/` | Smaller formats. | +| `cli/src/` | CLI tools: `translate`, `back_translate`, `meta`, `server`. | +| `test/src/` | GoogleTest suites; data in `test/data` (git submodules, see below). | +| `offline/documentation/MS-*/` | Vendored Microsoft spec text (PDF + extracted markdown), see [Specs](#specs). | +| `docs/design/README.md` | High-level design rationale. | + +## Build & test + +A configured build dir already exists (`cmake-build-debug`, also `…-release`, +`…-relwithdebinfo`). Typical loop: + +```bash +# library +cmake --build cmake-build-debug --target odr +# tests (the ODR_TEST option is on in this build dir) +cmake --build cmake-build-debug --target odr_test +./cmake-build-debug/test/odr_test --gtest_filter='OldMs.*' +# CLI (renders a file to a directory of HTML) +cmake --build cmake-build-debug --target translate +``` + +Notable CMake options (`CMakeLists.txt`): `ODR_TEST`, `ODR_CLI`, +`ODR_WITH_PDF2HTMLEX`, `ODR_WITH_WVWARE`, `ODR_WITH_LIBMAGIC`, `ODR_CLANG_TIDY`. +A new `.cpp` must be added to the `ODR_SOURCE_FILES` list in `CMakeLists.txt`. + +**Test data lives in git submodules** under `test/data/input/odr-public`, +`…/odr-private`, and `test/data/reference-output/*`. + +## Conventions + +- **Formatting**: clang-format, LLVM-based (`.clang-format`); run `scripts/format` + (or rely on the git hook from `scripts/setup`). `clang-tidy` config in + `.clang-tidy`; CI enforces both (`.github/workflows/format.yml`, `tidy.yml`). +- **Error handling — fail fast**: where the spec/format dictates what to expect, + **throw** on unexpected input (`std::runtime_error`, or the typed exceptions in + `src/odr/exceptions.hpp`) rather than silently degrading. Only **pass through** + (return empty / skip) values that are genuinely *optional* or *not yet + modelled*. +- **Public API**: value semantics; immutable handles; iterators only for + immutable traversal (`docs/design/README.md`). +- **Byte parsing**: read POD structs via `util::byte_stream::read`; this assumes + host byte order matches the file's (little-endian) — big-endian is a known + not-yet-handled gap in the binary engines. +- Match the **surrounding file's** style, includes, and idioms; mirror a sibling + engine when adding a format (the `oldms/text` `.doc` impl is the reference the + `.ppt` impl was modelled on). +- **Comments — keep them minimal**: a function/struct doc comment is at most a + couple of terse lines stating the key point (what it does, stream/ownership + preconditions, the spec section it implements, e.g. `[MS-PPT] 2.3.2`). Don't + restate the code or spell out every case; cite the spec instead of paraphrasing + it. The detailed design rationale belongs in the per-module `AGENTS.md`, not in + source comments. + +## Adding / extending a document format + +1. Detection: extend `magic`/`open_strategy` to map the bytes to a `FileType` + (+ `DecoderEngine`) and construct your `DecodedFile`. +2. For documents: subclass `internal::Document`; in its constructor build an + `ElementRegistry` and an `ElementAdapter` (see the pattern above). +3. Implement the per-element adapters you can populate; the **generic HTML + renderer then works for free**. +4. Register the format's factory (e.g. `oldms_file.cpp::document()` switches on + `file_type()`), add sources to `CMakeLists.txt`, and add a GoogleTest. + +## Legacy Microsoft binary formats (`oldms`) + +Container handling (CFB) already exists; each format is a small module under +`oldms/` mirroring `oldms/text` (`.doc`). Spec references in +`src/odr/internal/oldms/README.md`. + +- **`.doc`** (`oldms/text`): working, visible-text extraction. +- **`.ppt`** (`oldms/presentation`): implemented — slides resolved via the + persist directory (the only spec-defined read path), each slide's text boxes + modelled as positioned `frame`s. **Read its docs before touching it**: + [`oldms/presentation/AGENTS.md`](src/odr/internal/oldms/presentation/AGENTS.md) + — what's implemented and **why** (persist-directory resolution, no scan + fallback, sequential `ChildCursor` reading without `tellg`, fail-fast error + handling, the two-text-locations finding, endianness), the open work (frame + refinements, smaller shortcomings), and the verified `[MS-PPT]`/`[MS-ODRAW]` + drawing-tree map. +- **`.xls`** (`oldms/spreadsheet`): working, visible cell-text extraction + (BIFF8). See [`oldms/spreadsheet/AGENTS.md`](src/odr/internal/oldms/spreadsheet/AGENTS.md). + +## Specs + +Vendored Microsoft Open Specifications live under +`offline/documentation//-/`, both as `original.pdf` and an +extracted `docling-from-docx.md` (grep-friendly). Available: **MS-PPT**, +**MS-ODRAW** (Office Art / Escher drawing records), **MS-DOC**, **MS-XLS**, +**MS-CFB** (container), **MS-OFFCRYPTO** (encryption). Cite section numbers from +these when implementing binary parsing. diff --git a/src/odr/internal/oldms/AGENTS.md b/src/odr/internal/oldms/AGENTS.md new file mode 100644 index 00000000..ba7cfcde --- /dev/null +++ b/src/odr/internal/oldms/AGENTS.md @@ -0,0 +1,64 @@ +# Legacy MS Office (`oldms/`) — shared status & conventions + +What the binary legacy-format modules share. Each format's own status, design +notes, and open work live with its module; this file holds the conventions they +build on and the one piece of open work common to all three. Spec links are in +[`README.md`](README.md), the PDFs under `offline/documentation/`. + +| Module | Format | Agent doc | +|----------------------------------|------------------------|---------------------------------| +| [`text/`](text/) | `.doc` (Word) | [text/AGENTS.md](text/AGENTS.md) | +| [`presentation/`](presentation/) | `.ppt` (PowerPoint) | [presentation/AGENTS.md](presentation/AGENTS.md) | +| [`spreadsheet/`](spreadsheet/) | `.xls` (Excel / BIFF8) | [spreadsheet/AGENTS.md](spreadsheet/AGENTS.md) | + +## Shared conventions + +All three modules follow the same approach; the per-format docs cover only what +is specific to each format. + +- **CFB container.** Each format is a `[MS-CFB]` compound file; container + handling already existed in the engine. Each module reads its stream(s) + sequentially. +- **Byte-copy structs.** Fixed-layout spec structures are `#pragma pack(1)` + structs in the `*_structs.hpp` headers, with the spec's field names and + `[MS-*]` section citations, guarded by `static_assert(sizeof ...)`, filled by + copying the file's bytes straight in. +- **Bit-fields mirror the spec tables.** Sub-byte fields are declared as + bit-fields in the spec's order (LSB-first): `FibBase`/`Sprm` (`.doc`), + `RecordHeader` (`.ppt`), `RkNumber`/`UnicodeStringFlags` (`.xls`). +- **Little-endian, LSB-first hosts only.** The byte copy interprets bytes in the + host's byte order and bit-fields in the host's allocation order. See below. +- **Fail early on malformed input**; records/structures that are merely *not + modelled* are skipped. + +## Endianness and bit order: little-endian host assumed (shared open work) + +All three modules read multi-byte fields and UTF-16 code units in the host's +byte order with no swap, and their bit-field structs assume LSB-first +allocation. + +The file side is fixed: `[MS-DOC]`, `[MS-PPT]`, `[MS-XLS]` and the `[MS-CFB]` +container all store little-endian unconditionally — there is no big-endian +variant — so no runtime detection is needed. Only the host varies, and that is +known at compile time (`std::endian::native`). Conveniently, GCC/Clang switch +bit-field allocation to MSB-first exactly on big-endian targets, so byte order +and bit order flip together and one compile-time guard covers both. (The flip is +each ABI keeping declaration order equal to memory order: the first declared +field lands in the first byte either way.) + +**Fix if a non-little-endian target matters**: give each struct in the +`*_structs.hpp` headers a fixup function, applied right after the raw byte copy, +that byte-swaps the multi-byte fields and re-places the bit-field values. It has +to be per struct — a blind byte swap cannot fix bit-fields, the transform needs +the field widths. On little-endian hosts every fixup compiles to a no-op. + +**Rejected alternative**, for the record: `#if`-mirrored bit-field declarations +(the Linux `iphdr` pattern). Reversing declaration order repositions fields +within the allocation unit but cannot change how the unit's bytes are assembled, +so any field that straddles a byte boundary — `Sprm.ispmd` (9 bits), +`FcCompressed.fc` (30 bits), `RkNumber.num` (30 bits), `RecordHeader.recInstance` +(12 bits) — ends up in non-contiguous bits on a big-endian reader; only fixing +the data can express that. The pattern pays off only for zero-copy in-place +access (mapped packets/pages), which these formats rule out anyway: CFB streams +are fragmented into sectors and `.xls` records into `CONTINUE` chunks, so structs +are always assembled by copying — the fixup point is structural. diff --git a/src/odr/internal/oldms/presentation/AGENTS.md b/src/odr/internal/oldms/presentation/AGENTS.md new file mode 100644 index 00000000..223c3f8a --- /dev/null +++ b/src/odr/internal/oldms/presentation/AGENTS.md @@ -0,0 +1,363 @@ +# `.ppt` (PowerPoint) support — status, design & open work + +What the `oldms/presentation/` module does **today**, the **design decisions** +behind it, and the **open work**. Shared `oldms/` conventions are in +[`../AGENTS.md`](../AGENTS.md). + +**Scope.** Extract the **visible text of each slide, positioned in its text +boxes**, and expose it through the abstract document model so the generic HTML +renderer lays each slide out as positioned frames. No character/paragraph +styles, master/notes pages, images, charts, tables, or animations. + +**Specs.** `offline/documentation/MS-PPT/` (the PowerPoint stream) and +`MS-ODRAW/` (the Office Art / Escher drawing records). CFB container handling +already existed in the engine. + +--- + +## What works + +- `.ppt` is detected and decoded to a `Document` (presentation), one `slide` + per presentation slide **in presentation order**. +- Each slide's on-slide **text boxes** become positioned `frame`s; their text is + split into paragraphs / line breaks. +- A text box that stores no inline text but an `OutlineTextRefAtom` (the common + PowerPoint placeholder representation) is resolved against the slide's text in + the `SlideListWithTextContainer`, so placeholder/body text is not lost. +- The generic HTML renderer produces one page per slide with each text box + absolutely positioned (verified: `position:absolute;left:…;top:…` with the + decoded coordinates). + +## Module layout (mirrors `../text`) + +| File (`oldms/presentation/`) | Role | +|----------------------------------|-----------------------------------------------------| +| `ppt_structs.hpp` | `#pragma pack(1)` PODs (`RecordHeader`, atom bodies, `Anchor`) + `static_assert` sizes + the `RecordType` / `SlideListInstance` enums | +| `ppt_io.{hpp,cpp}` | `read(...)` helpers over `std::istream` (text atoms, the anchor rect, fixed structs) | +| `ppt_parser.{hpp,cpp}` | `parse_tree(registry, files)` → walks the stream and builds the element tree | +| `ppt_element_registry.{hpp,cpp}` | Flat element store (copy of `doc_element_registry`) + text & frame side-payloads | +| `ppt_document.{hpp,cpp}` | `internal::Document` subclass + the `ElementAdapter` | + +`ElementRegistry` is a `vector` (id = index) with parent/child/sibling +ids and side maps for the text and frame payloads; `create_element` / +`create_text_element` / `create_frame_element` / `append_child` are the only +builders. + +## Pipeline: how a `.ppt` becomes the element tree + +1. **Wiring.** `LegacyMicrosoftFile` already detected `.ppt` (the `/PowerPoint + Document` stream → `FileType::legacy_powerpoint_presentation`, + `DocumentType::presentation`) and `open_strategy` routed it here; the + `legacy_powerpoint_presentation` case in `LegacyMicrosoftFile::document()` + returns `presentation::Document`. +2. **Resolve slides (persist directory).** `parse_tree` opens both required + streams and hands them to `collect_slides(current_user, document)`, following + the `[MS-PPT]` reading algorithm: read `CurrentUserAtom` from `/Current User` + → walk the `UserEditAtom` chain newest→oldest, building the persist object + directory (newest offset per id wins) → resolve the **live** + `DocumentContainer` via `docPersistIdRef` → walk the slide list's + `SlidePersistAtom`s **in presentation order**, resolving each `persistIdRef` + to its `SlideContainer`. See *Design decisions* for why this is the only read + path. +3. **Read text boxes per slide.** For each `SlideContainer` the parser descends + the drawing and reads its text boxes (with positions) — see [Text boxes + (frames)](#text-boxes-frames). +4. **Build the tree.** `parse_tree` makes one `slide`, one `frame` per text box + (storing its anchor), and `build_paragraphs` hangs the box's text off the + frame: + + ``` + root (ElementType::root) + └── slide (ElementType::slide) one per slide, in order + └── frame (ElementType::frame) one per on-slide text box + └── paragraph (ElementType::paragraph) split on 0x0D + ├── text (ElementType::text) + └── line_break (ElementType::line_break) for 0x0B in a paragraph + ``` +5. **Render.** HTML works through the generic renderer via the public `Slide` / + `Frame` / `Paragraph` / `Text` API and our adapters. + +## Text boxes (frames) + +A `.ppt` slide is a *drawing of shapes*; each text box / placeholder is a shape +with its own position. `collect_slides` returns, per slide, the on-slide text +boxes in shape (z) order, each becoming a `frame`. + +Per slide the parser descends `SlideContainer → DrawingContainer (0x040C) → +OfficeArtDgContainer (0xF002) → OfficeArtSpgrContainer (0xF003)` and walks the +`OfficeArtSpContainer` (0xF004) shapes. For each shape it reads: +- the **optional** `OfficeArtClientAnchor` (0xF010) → `read_client_anchor` + (`SmallRectStruct`/`RectStruct`, master units = 1/576 inch), and +- the text in its `OfficeArtClientTextbox` (0xF00D). + +Shapes with no text are dropped, so the group shape and pictures disappear. +`FrameAdapter` returns `anchor_type = at_page` and `x/y/width/height` as Measures +(master units / 576 → inches); a shape without an anchor yields a frame with no +position. + +**First cut (current):** only **top-level** shapes — direct children of the root +`OfficeArtSpgrContainer`, whose anchors are already in the slide's master-unit +system. Nested-group coordinate transforms, non-grouped shapes, and +master-placeholder geometry inheritance are deferred — see [open +work](#1-frame-refinements). The verified record map of the drawing tree is in +[Reference](#reference-the-drawing-tree). + +## Adapters + +`ppt_document.cpp` implements the generic `ElementAdapter` (tree navigation, +copied from `doc_document.cpp`) plus `SlideAdapter` / `FrameAdapter` / +`ParagraphAdapter` / `TextAdapter` / `LineBreakAdapter`: +- `FrameAdapter`: `anchor_type = at_page`; `x/y/width/height` from the frame's + anchor (or empty when absent); `z_index` / `style` empty. +- `SlideAdapter`: `slide_page_layout` → hardcoded 10"×7.5" (4:3); `slide_name` → + empty; `slide_master_page` → `null_element_id`. +- `paragraph_text_style` / `text_style` set `font_size = 11pt` so empty + paragraphs still have height. +- `Document::is_editable()` → `false`; `save(...)` → throws + `UnsupportedOperation`. + +## Binary format reference + +Every record starts with an 8-byte `RecordHeader`: + +``` +RecordHeader { + uint16 recVer : 4 ; // 0xF marks a container + uint16 recInstance : 12 ; + uint16 recType ; + uint32 recLen ; // bytes of body that follow the header +} +``` + +`recVer == 0xF` marks a **container** (body is a sequence of records); otherwise +it's an **atom** with `recLen` bytes of payload. + +| Record | Type | Kind | Purpose | +|------------------------|--------|-----------|------------------------------------------| +| `CurrentUserAtom` | 0x0FF6 | atom | in `/Current User`; newest edit offset | +| `UserEditAtom` | 0x0FF5 | atom | edit chain + persist directory offset | +| `PersistDirectoryAtom` | 0x1772 | atom | persist id → stream offset | +| `DocumentContainer` | 0x03E8 | container | top-level document | +| `SlideListWithText` | 0x0FF0 | container | per-list slide refs (+ optional outline) | +| `SlidePersistAtom` | 0x03F3 | atom | one per slide; `persistIdRef` + order | +| `SlideContainer` | 0x03EE | container | a slide (drawing + placeholders) | +| `MainMaster` | 0x03F8 | container | master slide (skipped) | +| `Notes` | 0x03F0 | container | notes page (skipped) | +| `TextHeaderAtom` | 0x0F9F | atom | type of the text block that follows | +| `TextCharsAtom` | 0x0FA0 | atom | UTF-16 text (two bytes per code unit) | +| `TextBytesAtom` | 0x0FA8 | atom | "compressed" text: one byte per char | + +The Office Art drawing records (`RT_Drawing` 0x040C and `0xF00*`/`0xF010`) used +for text boxes are listed with the full drawing-tree map in +[Reference](#reference-the-drawing-tree). + +### Text decoding + +- `TextCharsAtom`: `recLen / 2` UTF-16 code units → `u16string_to_string`. +- `TextBytesAtom`: each byte is one character value (0x00–0xFF). +- In-text control characters: `0x0D` = paragraph break, `0x0B` = vertical tab = + manual line break — split on these like `doc_parser`. `0x09` (tab) kept; other + control characters dropped (`clean_text`). + +--- + +## Design decisions + +**Slide resolution is persist-directory based (the single spec path).** The +persist directory gives correct slide **ordering** for incrementally-saved files +(where stream order ≠ presentation order) and picks the **live** +`DocumentContainer` rather than the first one in the stream. Verified on +`slides.ppt`: `/Current User` → `offsetToCurrentEdit=11646` → `UserEditAtom` +(`docPersistIdRef=1`, `offsetPersistDirectory=11606`, `offsetLastEdit=0`) → 2 +slides in order with correct text. + +**No scan/heuristic fallback — spec-justified.** Both `/Current User` (§2.1.1) +and `/PowerPoint Document` (§2.1.2) are *required* streams, every conformant +file has at least one `UserEditAtom` + `PersistDirectoryAtom`, and the reading +algorithm has no alternative branch. An earlier draft kept a stream-scan +fallback (first `DocumentContainer`, every `SlideContainer` in stream order, +plus an outline-vs-container "more text wins" heuristic); it was **removed** — +unreachable for conformant files and able to silently serve *wrong* results (a +stale `DocumentContainer`, wrong slide order). `collect_slides` returns an empty +presentation only for the one *optional* structure: a document with no +presentation slide list (§2.4.1). Every mandatory structure that can't be +resolved — empty edit chain, unresolved `docPersistIdRef`, a slide +`persistIdRef` not in the directory — **throws**. + +**Two places hold slide text — and they are not equivalent.** +- The **outline** (`SlideListWithTextContainer`, §2.4.14.3) is **optional** + (`DocumentContainer.slideList`, §2.4.1). When present it carries, per slide, + the title/body **placeholder** text only — free text boxes are *never* in it. +- The **`SlideContainer`** (§2.5.1) is the authoritative source: on-slide text + lives in the drawing's `ClientTextbox` records. + +In LibreOffice-exported `.ppt` the outline is **empty** (verified on +`slides.ppt`: the `0x0FF0` lists hold zero text atoms), so there we read each +slide's text from its `SlideContainer`. But PowerPoint-authored placeholders +commonly carry **no inline text** in the `SlideContainer` and instead an +`OutlineTextRefAtom` (§2.9.78) pointing, by index, at the *i*-th `TextHeaderAtom` +block of that slide in the `SlideListWithTextContainer`. So we read the outline +too: `read_slide_list_text` collects, per slide (keyed by `persistIdRef`), the +ordered list of its `TextHeaderAtom` texts, and `gather_text` resolves an +`OutlineTextRefAtom` box against it. On-slide `ClientTextbox` text still wins +when present. + +**`RT_SlideListWithText` recInstance disambiguates three lists.** +`MasterListWithTextContainer` (§2.4.14.1), `SlideListWithTextContainer` +(§2.4.14.3) and `NotesListWithTextContainer` (§2.4.14.6) share `recType = +RT_SlideListWithText` (0x0FF0); only `recInstance` tells them apart: + +| recInstance | container | meaning | +|-------------|-------------------------------|---------------------| +| `0x000` | `SlideListWithTextContainer` | presentation slides | +| `0x001` | `MasterListWithTextContainer` | masters | +| `0x002` | `NotesListWithTextContainer` | notes | + +An early draft had Slides/Master swapped, making the lookup read the *master* +list; fixed in `ppt_structs.hpp` (`SlideListInstance`). + +**Sequential reading, no `tellg`.** The CFB-backed stream's `tellg()` returns +bogus values (it broke an early offset-tracking `read_children`). The parser +never depends on `tellg`: the caller `seekg`s to known offsets (from the persist +directory or a parent record), and child records are walked **forward** with a +`ChildCursor` — `read` header → `read`/recurse/`ignore` body — tracking the +bytes left in the container. A record that overruns its container throws, +keeping nested containers in sync or failing loudly. + +**Fail early on malformed input.** Where the spec dictates what to expect, +unexpected input **throws** (matches the sibling `.doc` parser). We **throw** on: +a missing required stream; a wrong record type (`read_header` — so a truncated +read, whose garbage type won't match, also throws); a record that overruns its +container (`ChildCursor`); a missing **mandatory** child record — the +`DrawingContainer` / `OfficeArtDgContainer` / `OfficeArtSpgrContainer` of a +slide (`require_child`); an `OfficeArtClientAnchor` whose `recLen` is neither 8 +nor 16; a non-decreasing (looping) `UserEditAtom` chain, an empty chain, an +unresolved `docPersistIdRef`, or a slide `persistIdRef` not in the persist +directory. We **pass through** (no throw) for values we don't model or that are +optional: an absent presentation slide list (0 slides), a shape with no +`OfficeArtClientAnchor` (unpositioned frame), nested groups and non-`Sp` records +in a group, and any non-text / unrecognised child record. + +**Endianness.** Host byte order / LSB-first bit-fields assumed; shared `oldms/` +assumption, see [`../AGENTS.md`](../AGENTS.md). For `.ppt`: every record field is +read in host byte order (see the note in `ppt_io.hpp`), and the `RecordHeader` +recVer/recInstance bit-fields assume LSB-first allocation. + +## Tests + +- `ppt_empty` — `odr-public/ppt/empty.ppt`: 1 slide. +- `ppt_slides` — `odr-public/ppt/slides.ppt`: 2 slides, 2 positioned frames each + (all `at_page` with `x/y/width/height`), distinct vertical positions, exact + per-box text. + +The non-empty fixture `slides.ppt` and reference-output HTML wiring are open +items (see below). + +## Out of scope + +Character/paragraph styles, fonts and colours; master and notes slides; +images/charts/tables and non-text shapes; animations/transitions; and +encrypted/obfuscated presentations. + +--- + +# Open work + +## 1. Frame refinements + +The first cut reads only **top-level** shapes — direct children of the root +`OfficeArtSpgrContainer` — whose anchors are already in the slide's master-unit +coordinate system. The refinements below raise fidelity; each is optional and +independent. + +- **1.1 Nested groups.** A shape nested inside a sub-group has its anchor + expressed in **that group's** coordinate system, defined by the group's + `OfficeArtFSPGR` (0xF009, `recVer 0x1`, `recLen 16`: `xLeft, yTop, xRight, + yBottom`), not in slide units. To support it: recurse into nested + `OfficeArtSpgrContainer` (0xF003), and for each descendant map its anchor from + the group's `[xLeft..xRight] × [yTop..yBottom]` onto the group shape's own + anchor rect in the parent, composing transforms down the nesting, before the + `/576` conversion. +- **1.2 Non-grouped shapes.** `OfficeArtDgContainer` (0xF002) also has an + optional direct `shape` (`OfficeArtSpContainer`, §2.2.13) for a shape not in a + group — the current walk only iterates the `OfficeArtSpgrContainer`. Rare in + real files, but read that child too for completeness. +- **1.3 Optional / inherited anchor.** A shape without an + `OfficeArtClientAnchor` (0xF010) currently yields a frame with no position. + PowerPoint placeholders often omit the anchor and inherit geometry from the + matching placeholder shape on the **master slide** (resolve via + `OfficeArtClientData.placeholderAtom` → the master's placeholder). +- **1.4 Origin / sign sanity check.** Field order and units are spec-confirmed + (top/left/right/bottom; master units = 1/576 inch) and verified on + `slides.ppt`. Still worth confirming the origin (top-left of the slide) and + non-negative values on a second, independently produced real file. + +## 2. Smaller shortcomings + +- **2.1 Slide size is hardcoded.** `slide_page_layout` returns a fixed 10"×7.5" + (`ppt_document.cpp`). The real size is `DocumentAtom.slideSize` + (`RT_DocumentAtom` 0x03E9, the first child of the `DocumentContainer`) — a + `PointStruct` in master units (`/576` → inches). Read it and feed the page + layout; fall back to 10"×7.5" only if absent. +- **2.2 Reference-output HTML not wired.** `html_output_test` has no `ppt` case. + Add reference HTML under + `test/data/reference-output/odr-public/output/ppt/...` and wire it in (needs + the `OpenDocument.test.output` submodule). +- **2.3 Fixture not committed.** `test/data/input/odr-public/ppt/slides.ppt` + exists only in the local `odr-public` submodule working tree. It must be + committed/pushed to the `OpenDocument.test` repo and the submodule pointer + bumped, or CI can't see it (so `ppt_slides` would fail there). +- **2.4 No `OutlineTextRefAtom` fixture.** `OutlineTextRefAtom` resolution is + implemented but **unexercised by any committed fixture** — all three current + `.ppt` files are LibreOffice-authored with an empty outline (`grep` for the + `00 00 9E 0F 04 00 00 00` header finds none). A PowerPoint-authored `.ppt` + whose placeholders use the outline indirection is needed to regression-test + the path. Pairs with §2.3. +- **2.5 Auto-field metacharacters dropped.** Slide-number / date / header / + footer placeholders are separate records (`RT_*MetaCharAtom`) interleaved with + the text; we ignore them, so e.g. a slide-number placeholder yields nothing. + Low priority for "visible text only". +- **2.6 `slide_name` is empty.** Could return `"Slide N"` (index-based) so the + HTML page/tab has a label, matching how other formats name pages. +- **2.7 Endianness** — shared `oldms/` shortcoming; see [`../AGENTS.md`](../AGENTS.md). + +## Reference: the drawing tree + +Inside each `SlideContainer` (0x03EE) is the Office Art (Escher) drawing that +holds the slide's text boxes: + +``` +SlideContainer (0x03EE) [MS-PPT] 2.5.1 +└─ drawing = DrawingContainer (RT_Drawing, 0x040C) [MS-PPT] 2.5.13 + └─ OfficeArtDgContainer (0xF002) [MS-ODRAW] 2.2.13 + └─ OfficeArtSpgrContainer (0xF003) shape group [MS-ODRAW] 2.2.16 + ├─ OfficeArtSpContainer (0xF004) shape #1 (text box) [MS-ODRAW] 2.2.14 + │ ├─ OfficeArtFSPGR (0xF009) group bounds (group shape only) [MS-ODRAW] 2.2.38 + │ ├─ OfficeArtFSP (0xF00A) shape id/flags [MS-ODRAW] 2.2.40 + │ ├─ OfficeArtFOPT (0xF00B) shape properties [MS-ODRAW] 2.2.9 + │ ├─ OfficeArtClientAnchor (0xF010) POSITION + SIZE [MS-PPT] 2.7.1 + │ ├─ OfficeArtClientData (0xF011) placeholderAtom: title/body/… [MS-PPT] 2.7.3 + │ └─ OfficeArtClientTextbox(0xF00D) the box's text [MS-PPT] 2.9.76 + │ ├─ TextHeaderAtom (0xF9F) + │ └─ TextCharsAtom/TextBytesAtom (0xFA0/0xFA8) + └─ OfficeArtSpContainer (0xF004) shape #2 … +``` + +- The `OfficeArt*` container/shape records are `[MS-ODRAW]`; the + `DrawingContainer` and the *client* records (`0xF00D` textbox, `0xF010` + anchor, `0xF011` data) are `[MS-PPT]`. `[MS-ODRAW]` §2.2.14 defers + `clientAnchor`/`clientData`/`clientTextbox` to the host app. +- **`OfficeArtSpContainer` (0xF004) child order** per `[MS-ODRAW]` §2.2.14: + `shapeGroup?` (`OfficeArtFSPGR`, group shapes only), `shapeProp` + (`OfficeArtFSP`, 16 B), `shapePrimaryOptions?` (`OfficeArtFOPT`), …, + **`clientAnchor?`**, `clientData?`, `clientTextbox?`. The parser matches by + recType, so order only documents what to expect. +- **Anchor body** (`OfficeArtClientAnchor`, atom, `recLen == 8` or `16`), field + order **top, left, right, bottom** (y, x, x, y): + - `recLen == 8` → `SmallRectStruct` (`[MS-PPT]` 2.12.8): four **signed 2-byte**. + - `recLen == 16` → `RectStruct` (`[MS-PPT]` 2.12.7): four **signed 4-byte**. + + `width = right - left`, `height = bottom - top`; master units → inches = `/576`. +- The first child `OfficeArtSpContainer` of the root spgr is the **group shape** + itself (holds the `OfficeArtFSPGR`, has no `clientTextbox`); the parser drops + it implicitly because it has no text. diff --git a/src/odr/internal/oldms/spreadsheet/AGENTS.md b/src/odr/internal/oldms/spreadsheet/AGENTS.md new file mode 100644 index 00000000..25b1349b --- /dev/null +++ b/src/odr/internal/oldms/spreadsheet/AGENTS.md @@ -0,0 +1,183 @@ +# `.xls` (Excel / BIFF8) support — status, design & open work + +What the `oldms/spreadsheet/` module does **today**, the **design decisions** +behind it, and the **open work**. Shared `oldms/` conventions are in +[`../AGENTS.md`](../AGENTS.md). + +**Scope.** Extract the **visible cell text** of every worksheet and expose it +through the abstract document model so the generic HTML renderer produces a plain +table per sheet. Every cell value is rendered as a *string* — no styles, +number/date formats, merged cells, drawings, or charts. + +**Specs.** `[MS-XLS]` (the record stream, the SST, the cell records) and +`[MS-CFB]` for the container. Section numbers are cited inline below and in code. + +--- + +## What works + +- `.xls` is detected (`/Workbook` stream) and decoded to a `Document` + (spreadsheet): one `sheet` element per worksheet, with `sheet_cell` → + `paragraph` → `text` elements for every non-empty cell. +- **All BIFF8 cell value kinds** become display text: SST strings (`LabelSst`), + inline strings (`Label`), numbers (`RK`, `MulRk`, `Number`), booleans/errors + (`BoolErr`), and **cached formula results** (`Formula` + `String` for string + results; numeric/boolean/error results from the `FormulaValue`). +- **SST `CONTINUE` splitting** is handled, including a split *mid-string* where + the continuation re-declares the character encoding (§2.5.293). +- Sheet `dimensions` come from the `Dimensions` record; `content` is the tight + extent of the non-empty cells (what the HTML renderer uses by default). +- The generic HTML renderer produces one table per sheet + (`html::translate_sheet`), with column letters and row numbers. + +Verified against `[MS-XLS]`: the record stream (§2.1.4), BOF/substream layout +(§2.4.21), `BoundSheet8` (§2.4.28), `SST`/`Continue` (§2.4.265/.58), +`XLUnicodeRichExtendedString` (§2.5.293), `RkNumber` (§2.5.217: bit 0 = `fX100`, +bit 1 = `fInt`), `FormulaValue` (§2.5.133), `Dimensions` (§2.4.90). + +## Module layout (sibling of `../text`, `../presentation`) + +| File (`oldms/spreadsheet/`) | Role | +|------------------------------------|---------------------------------------------------| +| `xls_structs.hpp` | `#pragma pack(1)` PODs for the record bodies + `static_assert` sizes + record type enum | +| `xls_io.{hpp,cpp}` | `BiffReader` (record walker with transparent `CONTINUE` hopping; the `[MS-XLS]` string readers and `expect_bof` are methods), RK decoding, number formatting | +| `xls_parser.{hpp,cpp}` | `parse_tree(registry, files)` → globals (BoundSheet8 + SST) then one pass per sheet substream | +| `xls_element_registry.{hpp,cpp}` | Flat element store + `Sheet` (name, dimensions, cell position map) and `SheetCell` (position) payloads | +| `xls_document.{hpp,cpp}` | `internal::Document` subclass + the `ElementAdapter` | + +## Pipeline: how a `.xls` becomes the element tree + +1. **Wiring.** `LegacyMicrosoftFile::parse_meta` detects the `/Workbook` stream + → `FileType::legacy_excel_worksheets`, `DocumentType::spreadsheet`, and + `document()` returns `spreadsheet::Document`. +2. **Globals substream.** `/Workbook` is a flat sequence of `(u16 type, u16 + size, body)` records. The first substream (after its `BOF`, which must + declare BIFF8 = `vers 0x0600`) holds, per sheet, a `BoundSheet8` (name + + absolute offset of the sheet's `BOF`; only `dt == worksheet` is kept) and the + `SST` — all shared string constants, deduplicated. +3. **SST / CONTINUE.** A record body is capped at 8224 bytes; the SST payload + spills into `Continue` records, and the split can fall *inside* a string. + `BiffReader`'s body accessors hop into a following `CONTINUE` transparently + (throwing if the next record is anything else); character data additionally + re-reads a fresh flags byte at each hop, since the continuation re-declares + compressed (1 byte/char) vs UTF-16 for the remainder. Formatting runs + (`cRun`·4 bytes) and phonetic data (`cbExtRst` bytes) are read and skipped. +4. **Sheet substreams.** For each kept `BoundSheet8`, seek to its `BOF` and scan + records until `EOF`: `Dimensions` → sheet extents; `LabelSst` / `Label` / + `RK` / `MulRk` / `Number` / `BoolErr` → one cell each; `Formula` → the cached + result in its `FormulaValue` (an Xnum double unless `fExprO == 0xFFFF`, then + string/bool/error/blank — a string result follows in a `String` record, + matched via a pending-cell marker). `Blank` / `MulBlank` carry no text and + are ignored. +5. **Tree.** Each non-empty cell becomes `sheet_cell → paragraph → text` (the + cell's rendered string). Cells hang off their sheet by `parent_id` only — + they are *not* in the sibling chain (mirrors `ooxml/spreadsheet`); lookup goes + through the sheet's `(column,row) → id` map, which also tracks the tight + `content` extent. +6. **Render.** `html::translate_sheet` walks the sheet purely through the public + `Sheet` / `SheetCell` API, which delegates to our adapter. + +### Value formatting + +- **RK numbers** (§2.5.217): low 2 bits are flags — bit 0 `fX100` (divide by + 100), bit 1 `fInt` (30-bit signed integer vs the *high 30 bits* of an IEEE + double, rest zero). +- Numbers are formatted with `%.15g` (≈ Excel's "General": up to 15 significant + digits, no trailing zeros, integers without a decimal point). +- Booleans → `TRUE`/`FALSE`; error codes (BErr, §2.5.10) → `#DIV/0!`, `#VALUE!`, + `#REF!`, `#NAME?`, `#NUM!`, `#N/A`, `#NULL!`. +- Dates are **not** decoded: a date cell shows its raw serial number unless the + file stored it as a string (number-format handling is open work). + +## Adapters + +`xls_document.cpp` implements the generic `ElementAdapter` plus `SheetAdapter` / +`SheetCellAdapter` / `ParagraphAdapter` / `TextAdapter`: +- `sheet_name` / `sheet_dimensions` → from the registry payload; + `sheet_content(range)` → the tight content extent, clamped to `range`. +- `sheet_cell(col,row)` → map lookup, `null_element_id` for empties; + `sheet_first_shape` → none. +- All `*_style(...)` → `{}`; `sheet_cell_value_type` → `ValueType::string` + (every value is pre-rendered text); `sheet_cell_span` → `{1,1}`. +- `paragraph_text_style` / `text_style` set `font_size = 11pt` so empty + paragraphs have height (same hack as the `.doc`/`.ppt` modules). +- `Document::is_editable()` → `false`; `save(...)` → `UnsupportedOperation`. + +## Design decisions + +- **Fail early on malformed input** (matches the sibling modules): missing or + non-BIFF8 `BOF`, a non-`CONTINUE` record where a body continuation is + required, an out-of-range SST index, a malformed `MulRk` body, an unknown + `FormulaValue` type, and truncated streams all **throw**. Records that are + merely *not modelled* are skipped. +- **Pre-rendered text instead of typed values.** Cell values are converted to + display strings at parse time; the model exposes `ValueType::string` only. + Typed values would require XF/number-format plumbing — deliberately deferred. +- **Endianness/bit order**: bytes are copied straight into native + integers/doubles and bit-field structs (`RkNumber`, `UnicodeStringFlags`, flag + fields of `BoundSheet8Fixed`/`FormulaFixed`) — little-endian, LSB-first hosts + only; shared `oldms/` assumption, see [`../AGENTS.md`](../AGENTS.md). + +## Tests + +- `xls_string_split_across_continue` — a string split mid-character-data with an + encoding switch at the boundary. +- `xls_rich_string_runs_across_continue` — formatting-run skip across a + `CONTINUE` (no flags byte there) + correct position for the next string. +- `xls_decode_rk` — all four RK flag combinations + number formatting; the + inputs are raw on-disk encodings, so it also pins the `RkNumber` bit-field + layout. +- `xls_empty` / `xls_file_example_10` / `xls_file_example_5000` — real fixtures: + sheet names, dimensions, content extents, string/number cells; the 5000-row + file exercises SST `CONTINUE` handling on real data. +- HTML output: `html_output_test` no longer skips `legacy_excel_worksheets`; + reference output lives under + `test/data/reference-output/{odr-public,odr-private}/output/xls/`. + +--- + +# Open work + +Roughly ordered by value. + +## 1. Number & date formatting (the biggest visible gap) + +Cells currently show raw values: a date cell renders as its serial number (e.g. +`43023` instead of `15/10/2017`) and numbers ignore their format codes. Fix by +following the format chain: +- Each cell record carries an `ixfe` (currently discarded — the parser already + reads it). It indexes the `XF` records (0x00E0) in the globals substream; + `XF.ifmt` picks a number format: a built-in id (0–163, the table is in + [MS-XLS] 2.4.126 `Format`) or a `Format` record (0x041E) with a format string. +- MVP: keep `ixfe` per cell, parse `XF`/`Format`, and special-case the date/time + formats (built-in ids 14–22, 45–47 + anything containing `y/m/d/h`) to convert + the serial date (days since 1899-12-31, fractional part = time; mind the + workbook's 1904 flag in `Date1904`, 0x0022) into a sensible string. Full + custom-format rendering is a rabbit hole; approximate first. + +## 2. Coverage gaps + +- **Merged cells**: `MergeCells` record (0x00E5) → `sheet_cell_span` / + `sheet_cell_is_covered` (the adapter stubs are in place). +- **Styles**: fonts (`Font`, 0x0031), fills/borders from `XF` → + `sheet_cell_style` / `text_style`; column widths (`ColInfo`, 0x007D) and row + heights (`Row`, 0x0208) → `sheet_column_style` / `sheet_row_style`. +- **Hidden rows/columns** (`Row.fDyZero`, `ColInfo.fHidden`). +- **Typed cell values**: expose numeric/bool/date `ValueType`s instead of + pre-rendered strings (needed for anything smarter than HTML text). +- **Encrypted workbooks**: a `FilePass` record (0x002F) in the globals substream + means the rest of the stream is encrypted ([MS-OFFCRYPTO]) — currently it + parses as garbage or throws; should report password-protected. +- **BIFF5/BIFF7** (`BOF.vers != 0x0600`): currently throws; older `.xls` files + exist in the wild (no SST — `Label` records carry the strings inline). +- **Drawings/charts/images** (`MsoDrawing`/`Obj`/chart substreams) — likely + never worth it for text extraction. + +## 3. Smaller shortcomings + +- **Endianness/bit order** — shared `oldms/` shortcoming, see + [`../AGENTS.md`](../AGENTS.md). +- `RString` (0x00D6, rich inline string cell) is rare and currently skipped. +- A `Formula` string result is matched to the *immediately following* `String` + record via a pending-cell marker; an intervening `SharedFmla`/`Array`/`Table` + record is tolerated only because unknown records are skipped — not validated. diff --git a/src/odr/internal/oldms/text/AGENTS.md b/src/odr/internal/oldms/text/AGENTS.md new file mode 100644 index 00000000..a934eab7 --- /dev/null +++ b/src/odr/internal/oldms/text/AGENTS.md @@ -0,0 +1,369 @@ +# `.doc` (Word) support — status, design & open work + +What the `oldms/text/` module does **today**, the **design decisions** behind +it, and the **open work**. Shared `oldms/` conventions are in +[`../AGENTS.md`](../AGENTS.md). + +**Scope.** Extract the **visible text of the main document body**, split into +paragraphs and manual line breaks, and expose it through the abstract document +model so the generic HTML renderer lays it out as a flat run of paragraphs. No +character/paragraph styles, no headers/footers/footnotes/endnotes/annotations, +no tables, frames, images, or fields beyond showing their result text. + +**Specs.** `[MS-DOC]` (the FIB, the Clx / piece table, text decoding) and +`[MS-CFB]` for the container. Section numbers are cited inline below. + +--- + +## What works + +- `.doc` is detected (`/WordDocument` stream) and decoded to a `Document` + (text), one flat element tree under the root. +- The **main document body** (the first `ccpText` characters) is read from the + piece table, decoded (compressed 8-bit *or* UTF-16), split into paragraphs / + manual line breaks, with a `page_break` element at each end-of-section / + manual page break (`0x0C`). +- Field codes are resolved to their **result** text (the instruction part is + hidden); anchor/control characters are stripped. +- The generic HTML renderer produces the body as a sequence of paragraphs. + +Verified against `[MS-DOC]`: the read path matches *Retrieving Text* (§2.4.1, +steps 1–6), the FIB version map (§2.5.1), the Clx / Pcdt / Prc lead bytes +(§2.9.38/.178/.209), `FcCompressed` incl. the `0x82–0x9F` byte map (§2.9.73), +and the field characters (§2.8.25). + +## Module layout (sibling of `../presentation`) + +| File (`oldms/text/`) | Role | +|-----------------------------------|-----------------------------------------------------| +| `doc_structs.hpp` | `#pragma pack(1)` PODs (`FibBase`, the `FibRgFcLcb97/2000/2002/2003/2007` chain, `Sprm`, `FcCompressed`, `Pcd`) + `static_assert` sizes + the `PlcPcdMap` piece-table view + `ParsedFib` | +| `doc_io.{hpp,cpp}` | `read(...)` helpers over `std::istream`: the variable-length FIB, the Clx walk, string decoding (compressed / UTF-16) | +| `doc_helper.{hpp,cpp}` | `CharacterIndex` (the decoded piece table) + `read_character_index` | +| `doc_parser.{hpp,cpp}` | `parse_tree(registry, files)` → reads the body text and builds the element tree, incl. `clean_text` (field & control-char handling) | +| `doc_element_registry.{hpp,cpp}` | Flat element store (id = vector index) + a text side-payload | +| `doc_document.{hpp,cpp}` | `internal::Document` subclass + the `ElementAdapter` | + +`ElementRegistry` is a `vector` (id = index) with parent/child/sibling +ids and a side map for the text payload; `create_element` / `create_text_element` +/ `append_child` are the only builders. + +## Pipeline: how a `.doc` becomes the element tree + +1. **Wiring.** `LegacyMicrosoftFile::parse_meta` detects the `/WordDocument` + stream → `FileType::legacy_word_document`, `DocumentType::text`, and + `document()` returns `text::Document`. +2. **Read the FIB.** `parse_tree` opens `/WordDocument` and reads the **File + Information Block** (§2.5.1). The FIB is variable-length and self-describing: + a fixed `FibBase` (32 B) followed by four counted arrays — `csw`·uint16 + (`fibRgW`), `cslw`·uint32 (`fibRgLw`), `cbRgFcLcb`·`FcLcb` (`fibRgFcLcb`), + `cswNew`·uint16 (`fibRgCswNew`). `read(ParsedFib&)` reads each count, + validates it covers the struct we model, then `ignore`s any surplus. +3. **Pick the FIB version.** The effective `nFib` is `fibRgCswNew.nFibNew` when + `cswNew > 0`, else `FibBase.nFib`. `type_dispatch_FibRgFcLcb` maps it + (`nFib97 … nFib2007`) to the right `FibRgFcLcb*` layout and `memcpy`s the raw + `fibRgFcLcb` bytes into it. We only read `clx` out of it, but the whole + versioned struct is modelled so the offset is correct. +4. **Locate & read the Clx (piece table).** The table stream is `/1Table` or + `/0Table` per `FibBase.fWhichTblStm`. The Clx (§2.9.38) lives at + `fibRgFcLcb->clx.fc`. `read_Clx` walks it: leading `Prc` entries (lead + `0x01`) are skipped, then the `Pcdt` (lead `0x02`) carries the `PlcPcd` — the + piece table mapping CP ranges to byte offsets in `/WordDocument`. + `read_character_index` turns it into a `CharacterIndex`. +5. **Concatenate the body text.** Pieces come in ascending CP order; + `parse_tree` clamps each to the remaining `ccpText` budget (so only the main + body is taken), seeks to each piece's `data_offset`, decodes it. +6. **Build the tree.** Split the body on `0x0D` (paragraph mark) — dropping the + trailing empty paragraph from the body's guard mark — then each paragraph on + `0x0C` (end-of-section / manual page break) and each segment on `0x0B` + (manual line break): + + ``` + root (ElementType::root) + ├── paragraph (ElementType::paragraph) split on 0x0D, then 0x0C + │ ├── text (ElementType::text) clean_text(...) of the run + │ └── line_break (ElementType::line_break) for 0x0B in a paragraph + └── page_break (ElementType::page_break) one per 0x0C boundary + ``` +7. **Render.** HTML works through the generic renderer via the public + `Paragraph` / `Text` / `LineBreak` API and our adapters. + +## The piece table (`CharacterIndex`) + +A `.doc` stores text in **pieces** rather than one contiguous run: the `PlcPcd` +is `n+1` ascending CP boundaries (`aCP`) followed by `n` `Pcd` structures +(`aData`). `PlcPcdMap` is a zero-copy view over the raw `plcPcd` bytes computing +`n = (cb - 4) / (4 + sizeof(Pcd))`, exposing `aCP(i)` / `aData(i)`. + +Each `Pcd` holds an `FcCompressed`: +- `fCompressed == 0` → **UTF-16**, `data_offset = fc`, 2 bytes per CP. +- `fCompressed == 1` → **compressed** (one byte per CP), `data_offset = fc / 2`. + +`read_character_index` records, per piece, `(start_cp, length_cp, data_offset, +is_compressed)`; `CharacterIndex::Iterator` derives `length_cp` from adjacent CP +boundaries and `data_length` from the compression flag. `append` enforces +ascending CP order (throws otherwise). + +### Text decoding + +- **Uncompressed**: `length_cp` UTF-16 code units → `u16string_to_string`. +- **Compressed**: each byte is one code point (§2.9.73 / §2.4.1 step 6). Bytes + `0x82–0x9F` are remapped via `uncompress_char` (the Windows-1252 "smart + quotes" block — e.g. `0x92 → U+2019`, `0x96 → U+2013`); every other byte `b` + is code point `U+00b` and UTF-8-encoded, so `0xA0–0xFF` round-trip (e.g. + `0xE9 → "é"`). +- **In-text control characters** (`clean_text`): + - `0x0D` paragraph mark, `0x0C` end-of-section / manual page break, `0x0B` + manual line break are consumed by the caller's splits and never reach + `clean_text`. A `0x0C` boundary emits a `page_break` (§2.8.26). + - `0x13`/`0x14`/`0x15` delimit a **field**: instruction (begin→separator) + hidden, result (separator→end) shown. The separator `0x14` is optional + (§2.8.25); a separator-less field is hidden up to its `0x15` end. Nesting is + tracked with a per-field stack. + - `0x09` tab kept; `0x1E` non-breaking hyphen → `-`; `0x1F` optional hyphen + dropped; all other control characters `< 0x20` (picture/OLE `0x01`, footnote + ref `0x02`, cell mark `0x07`, …) dropped. + +## Adapters + +`doc_document.cpp` implements the generic `ElementAdapter` plus +`TextRootAdapter` / `ParagraphAdapter` / `SpanAdapter` / `TextAdapter` / +`LineBreakAdapter`: +- `text_root_page_layout` / `text_root_first_master_page` → empty. +- `paragraph_style` / `span_style` / `line_break_style` → empty (`TODO`). +- `paragraph_text_style` / `text_style` set `font_size = 11pt` so empty + paragraphs still have height (same hack as the PPT module; removed when + character formatting lands — see open work). +- `Document::is_editable()` → `true` and `is_savable(encrypted)` → + `!encrypted`, but `save(...)` and `text_set_content(...)` throw + `UnsupportedOperation` — read-only in practice. + +## Binary format reference (FIB) + +The FIB is the root of every `.doc`, at offset 0 of `/WordDocument`: + +``` +FibBase 32 B fixed (wIdent, nFib, flags incl. fWhichTblStm/fEncrypted, …) +csw uint16 count of the following uint16 array +fibRgW csw·uint16 +cslw uint16 count of the following uint32 array +fibRgLw cslw·uint32 (holds ccpText at uint16 indices 6–7) +cbRgFcLcb uint16 count of the following FcLcb (8-byte) array +fibRgFcLcb cbRgFcLcb·FcLcb (holds clx → the piece table) +cswNew uint16 count of the following uint16 array +fibRgCswNew cswNew·uint16 (nFibNew overrides FibBase.nFib when present) +``` + +`ccpText` (count of CPs in the main body) is read out of `fibRgLw` as a +little-endian uint32 spanning indices 6–7; it is signed and MUST be ≥ 0, so a +value with the sign bit set **throws** (§2.5.5). `nFib` values handled: `nFib97` +(0x00C1), `nFib2000` (0x00D9), `nFib2002` (0x0101), `nFib2003` (0x010C), +`nFib2007` (0x0112). A value **above** `nFib2007` falls back to the +`FibRgFcLcb2007` layout; a value below `nFib97` **throws**. + +--- + +## Design decisions + +**Main body only, via the `ccpText` budget.** `/WordDocument` interleaves the +body with headers, footnotes, annotations, etc.; the FIB's `ccp*` counts +partition the CP space. We take only the first `ccpText` CPs by clamping each +piece to the remaining budget and stopping when exhausted. + +**Self-describing FIB read — forward-compatible.** `read(ParsedFib&)` trusts the +on-disk counts rather than a fixed layout: it reads what we model and `ignore`s +the surplus. A FIB from a newer Word that appends fields still parses — the +version dispatch picks the matching `FibRgFcLcb*` (or `FibRgFcLcb2007` for a +newer-than-2007 `nFib`), and the `FcLcb` block is copied **clamped** to +`min(sizeof(layout), cbRgFcLcb·8)`, so extra trailing entries are ignored and a +shorter block leaves the remainder zero (the `clx`/`fcClx` we need lives in the +`FibRgFcLcb97` base, always covered). The `csw`/`cslw` counts must still cover +the arrays we read, else they throw. + +**Fail early on malformed input** (matches the sibling `.ppt` parser). We +**throw** on: an `nFib` below `nFib97` or an unknown `nFibNew` (newer-than-2007 +`nFib` does **not** throw — it uses the 2007 layout); a `ccpText` with the sign +bit set (§2.5.5); a `csw`/`cslw` count too small to cover the array we read; an +unexpected lead byte while walking the Clx (anything other than `0x01`/`0x02`); +a piece table whose CP boundaries are not ascending; a compressed byte outside +`0x00–0xFF` or an early EOF while decoding. We **pass through** for things we +don't model: text after the main body, the `Prc` formatting runs, and every +control/field character `clean_text` drops. + +**Endianness.** Host byte order / LSB-first bit-fields assumed; shared `oldms/` +assumption, analysis and fix plan in [`../AGENTS.md`](../AGENTS.md). + +## Tests + +- `OldMs.doc_read_string_compressed` — the compressed (1-byte-per-CP) decoder + against the §2.9.73 byte map: ASCII passthrough, the `0x82–0x9F` remap, the + `0xA0–0xFF` UTF-8 round-trip. + +The FIB-robustness behaviours (negative `ccpText` rejected, newer-than-2007 +`nFib` falling back to the 2007 layout) and the `0x0C` page-break emission are +**not yet unit-tested**; there is also **no assertion-based render test** over a +real `.doc` fixture (unlike the `.ppt` cases). + +--- + +# Open work + +## 1. Character (font) formatting → the IR (the next feature) + +**Goal.** Extract per-run character properties (font name, size, bold, italic, +underline, strikethrough, colour, highlight) and surface them through the +abstract model's `TextStyle`, so the HTML renderer styles text instead of +emitting one flat 11pt run. This replaces the `font_size = 11pt` placeholder in +`doc_document.cpp`. + +`TextStyle` (`src/odr/style.hpp`) maps almost 1:1 onto the `.doc` character +SPRMs: + +| `TextStyle` field | SPRM (opcode) | operand → value | +|---------------------|--------------------------|------------------------------------------------------------| +| `font_size` | `sprmCHps` (0x4A43) | u16 **half-points** → `Measure(hps/2.0, pt)` (default 20 = 10pt) | +| `font_weight` | `sprmCFBold` (0x0835) | `ToggleOperand` → `FontWeight::bold` when on | +| `font_style` | `sprmCFItalic` (0x0836) | `ToggleOperand` → `FontStyle::italic` when on | +| `font_underline` | `sprmCKul` (0x2A3E) | `Kul` value, `0x00` = none → `bool` | +| `font_line_through` | `sprmCFStrike` (0x0837) | `ToggleOperand` → `bool` | +| `font_color` | `sprmCCv` (0x6870) | `COLORREF` → `Color`; legacy `sprmCIco` (0x2A42) is a palette index | +| `background_color` | `sprmCHighlight` (0x2A0C)| `Ico` highlight index → `Color` | +| `font_name` | `sprmCRgFtc0` (0x4A4F) | s16 index into `SttbfFfn` → font name (intern it; see below) | + +`font_name` is a `const char *`, so the resolved name needs stable storage — +intern it in the `ElementRegistry` (e.g. a `std::deque` whose +elements never move) and hand out the pointer. + +**How `[MS-DOC]` stores & retrieves character properties** — the authoritative +algorithm is **Direct Character Formatting** (§2.4.6.2), which reuses the +*Retrieving Text* walk we already have: +1. For a character at `cp`, run *Retrieving Text* (§2.4.1) to get its byte + offset `fc` in `/WordDocument` and the owning `Pcd` (we already compute both). +2. Read the **`PlcBteChpx`** (§2.8.5) at `fcPlcfBteChpx`/`lcbPlcfBteChpx` in the + table stream — a PLC keyed by **stream offset**: `aFC[n+1]` boundaries + + `aPnBteChpx[n]` (`PnFkpChpx`, 4 bytes each). +3. Find the largest `i` with `aFC[i] ≤ fc`; read a **`ChpxFkp`** (§2.9.33) at + `aPnBteChpx[i].pn * 512` in `/WordDocument` (a fixed 512-byte page: `rgfc` + run boundaries, parallel `rgb` offsets, `crun` in the last byte). +4. Find the largest `j` with `rgfc[j] ≤ fc`; the `Chpx` (§2.9.32) lives at + `rgb[j] * 2` within the page. `Chpx.grpprl` is an array of **`Prl`** = `Sprm` + (2 bytes) + operand. +5. Append the `Pcd.Prm` modifications (§2.9.214–216): a `Prm0` (inline) or + `Prm1` (index) carrying extra SPRMs for this run. + +`Prl`/`Sprm` is already modelled in `doc_structs.hpp` (`Sprm` with +`ispmd/fSpec/sgc/spra` and `operand_size()`); a **character** property is a SPRM +with `sgc == 2`. Walk each `Chpx.grpprl` by reading a 2-byte `Sprm` then +`operand_size()` operand bytes (note `spra == 6` is length-prefixed/variable), +keeping only the opcodes above. + +**First cut — direct formatting only.** Implement §2.4.6.2 (`Chpx.grpprl` + +`Pcd.Prm`) and map the table's SPRMs. Captures the common case: bold/italic/ +size/font/colour applied directly to runs. Resolve `sprmCRgFtc0` by reading +**`SttbfFfn`** (§2.9.286) once at `fcSttbfFfn`/`lcbSttbfFfn` (an STTB of `FFN` +records; `FFN.xszFfn` is the UTF-16 font name) and indexing it. Drop the +hardcoded 11pt; use 10pt (the `sprmCHps` default of 20 half-points). + +**Full fidelity — styles (later).** *Determining Formatting Properties* +(§2.4.6.6) layers, in order: document defaults → `STSH` (§2.4.6.5, +`fcStshf`/`lcbStshf`) paragraph- and character-style `grpprl`s resolved via the +paragraph's `istd` → table-style props → direct paragraph → direct character. +The first cut skips the STSH layer, so style-dependent props fall back to +defaults; wiring the STSH closes that gap. + +**Wiring to the abstract model.** Today `parse_tree` concatenates all body +pieces into one `body_text` and emits one `text` per paragraph. Per-run styling +needs run boundaries, expressed in `/WordDocument` byte offsets +(`ChpxFkp.rgfc`, `PlcBteChpx.aFC`) — so: +1. **Keep the FC↔text mapping.** While concatenating, retain each piece's + `data_offset` and compression so any character's source `fc` is recoverable + (the `CharacterIndex` already holds this; thread it through instead of + discarding it after building `body_text`). +2. **Split paragraphs into runs.** Within a paragraph, cut at every `ChpxFkp` + run boundary inside it, resolve each run's `TextStyle` once, and emit a + **`span`** (`ElementType::span`, already wired via `SpanAdapter`) per run, + with the `text` element(s) as its children. Paragraph/line-break splitting + stays as-is. +3. **Store the style.** Add a `TextStyle` side-map to `ElementRegistry` keyed by + span id (mirror the text side-payload, and the frame-payload pattern in + `presentation`) plus the font-name intern store. `SpanAdapter::span_style` + returns the stored style; `text_style` / `paragraph_text_style` then return + `{}` (or the paragraph mark's run style) instead of the 11pt hack. + +## 2. Coverage gaps + +- **Only the main document body.** `parse_tree` stops at the `ccpText` budget, + so headers/footers, footnotes, endnotes, comments/annotations, and text boxes + — each its own CP range after the body (`ccpFtn`, `ccpHdd`, `ccpAtn`, … in + FibRgLw97, located via the matching `plcf*` in the table stream) — are + dropped. Extending coverage means walking the later CP ranges and their + `Plcf*` structures. +- **Tables.** Cell text renders as plain paragraphs: the end-of-cell mark `0x07` + is dropped by `clean_text` and row/cell structure (§2.4.3, `sprmPFInTable` / + `sprmPTtp` / the `TC`/`TAP` tables) is unmodelled. Reconstruct table structure + from the paragraph properties to emit real `table`/`row`/`cell` elements. + Paragraph-level formatting (alignment, indent, spacing) via `PlcBtePapx` → + `PapxFkp` belongs here too, alongside the character work. +- **Fields show only the cached result.** `clean_text` keeps the field *result* + and drops the *instruction* (§2.8.25); page numbers, dates, refs show their + last-saved value and are never evaluated. Acceptable for "visible text". +- **Images / OLE / drawn objects.** The anchor characters (`0x01` inline + picture, `0x08` floating picture, OLE) are dropped. No image extraction; would + require `PlcfSpa` / the Office Art (`dggInfo`) drawing data. +- **Encrypted / obfuscated documents.** `FibBase.fEncrypted` / `fObfuscated` are + parsed but not acted on; `decrypt` throws `UnsupportedOperation`. + XOR-obfuscated and `[MS-OFFCRYPTO]`-encrypted `.doc` are unsupported. + +## 3. Smaller shortcomings + +- **Endianness.** Shared `oldms/` shortcoming — see [`../AGENTS.md`](../AGENTS.md). + For `.doc`: every field is read in host byte order, and the + `FibBase`/`Sprm`/`FcCompressed` bit-fields in `doc_structs.hpp` assume + LSB-first allocation. + +## Reference: the read path + +``` +WordDocument stream +└─ FIB @ 0 [MS-DOC] §2.5.1 + ├─ FibBase (32 B): fWhichTblStm, fEncrypted, nFib + ├─ csw·u16 fibRgW + ├─ cslw·u32 fibRgLw → ccpText (idx 6–7) §2.5.5 + ├─ cbRgFcLcb·FcLcb fibRgFcLcb → clx.fc §2.5.7 (version by nFib) + └─ cswNew·u16 fibRgCswNew → nFibNew + +Table stream (/1Table or /0Table per fWhichTblStm) §1.4 +└─ Clx @ clx.fc §2.9.38 + ├─ RgPrc: 0..n Prc (lead 0x01, skipped) §2.9.209 + └─ Pcdt (lead 0x02) §2.9.178 + └─ PlcPcd: aCp[n+1] + aPcd[n] (Pcd) §2.8.35 / §2.9.177 + └─ Pcd.fc = FcCompressed §2.9.73 + ├─ fCompressed=0 → UTF-16 @ fc + └─ fCompressed=1 → 8-bit @ fc/2 (+ 0x82–0x9F map) + +Retrieving Text algorithm: §2.4.1 (steps 1–6, matches parse_tree) +Field characters 0x13/0x14/0x15: §2.8.25 +``` + +Character-formatting path (open work §1), keyed by `/WordDocument` byte offset +`fc`: + +``` +Table stream +├─ PlcBteChpx @ fcPlcfBteChpx §2.8.5 +│ └─ aFC[n+1] (stream offsets) + aPnBteChpx[n] (PnFkpChpx, 4 B) +├─ SttbfFfn @ fcSttbfFfn (font names, FFN.xszFfn) §2.9.286 +└─ STSH @ fcStshf (styles — full fidelity only) §2.4.6.5 + +WordDocument stream +└─ ChpxFkp @ aPnBteChpx[i].pn * 512 (512-byte page) §2.9.33 + ├─ rgfc[crun+1] run boundaries (stream offsets) + ├─ rgb[crun] → Chpx @ rgb[j]*2 within page + └─ crun (last byte) + └─ Chpx = cb + grpprl(Prl[]) §2.9.32 + └─ Prl = Sprm (2 B) + operand §2.2.x + └─ character SPRMs have sgc == 2; + Pcd.Prm §2.9.214–216 + +Direct Character Formatting: §2.4.6.2 (Determining Formatting Properties: §2.4.6.6) +Font SPRMs: CHps 0x4A43, CFBold 0x0835, CFItalic 0x0836, CKul 0x2A3E, + CFStrike 0x0837, CCv 0x6870, CHighlight 0x2A0C, CRgFtc0 0x4A4F +``` diff --git a/src/odr/internal/pdf/AGENTS.md b/src/odr/internal/pdf/AGENTS.md new file mode 100644 index 00000000..1cfa284f --- /dev/null +++ b/src/odr/internal/pdf/AGENTS.md @@ -0,0 +1,450 @@ +# In-house PDF support (`pdf/`) — status, design & roadmap + +What the `pdf/` module does **today**, the **design decisions** behind it, and +the **staged roadmap** for turning it into a faithful renderer. Reference links +(web resources; offline spec docs are planned) live in [`README.md`](README.md). + +This is the `DecoderEngine::odr` path for PDF; the sibling `../pdf_poppler/` +module (poppler / pdf2htmlEX, behind `ODR_WITH_PDF2HTMLEX`) is the +production-quality alternative engine. + +**Scope today.** Parse the PDF object/file structure (classic cross-reference +tables, cross-reference streams, object streams, hybrid files), build the page +tree with fonts and annotations, tokenize page content streams into graphics +operators, and emit a **proof-of-concept HTML rendering**: absolutely positioned +text spans per `Tj`, pages sized from `MediaBox`. Encrypted files are decrypted +(RC4, AES-128, AES-256). No graphics, no images, no font files. Experimental and +not production-quality — the HTML path still contains debug `std::cout` output. + +--- + +## What works + +- `.pdf` is detected by file magic and opened as `PdfFile` + (`DecoderEngine::odr`); `is_decodable()` returns `false` and `file_meta()` + carries only the file type. All parsing is lazy, on HTML request. +- **Object syntax**: null, booleans, integers/reals, names (incl. `#xx` + escapes), literal strings (`\` and `\ooo` escapes), hex strings, arrays, + dictionaries, indirect references (`n g R`) — standalone and nested. +- **File structure**: header, `n g obj … endobj`, `stream` payloads (via + `/Length`, with a scan-to-`endstream` fallback), classic `xref` tables, + `trailer`, `startxref`, `%%EOF`; both sequential reading (`read_entry`) and + random access via the xref table. **Incremental updates**: `startxref` found + by scanning the file tail, then the `Prev` chain is followed (cycle-guarded), + merging xref tables so the newest entry for each object wins. +- **Cross-reference streams, object streams, hybrid files** (PDF 1.5+): each + trailer-chain section may be a classic table or a cross-reference stream + (`/W`/`/Index`/`Size`, decoded via the filter framework, entry types 0/1/2; + unknown types treated as absent). Xref entries are a tagged union + (`FreeEntry`/`UsedEntry`/`CompressedEntry`); compressed objects are read from + their object stream (`/N`/`/First` header, decoded once and cached per + stream). Hybrid files follow the `XRefStm`-before-`Prev` lookup order. + Lenient where the wild demands: `/Type /XRef` only warns, references to free + or absent objects resolve to null with a `Logger` warning, `n g obj` need not + end with a newline. +- **Page tree**: `Catalog` → `Pages` (recursive) → `Page` with per-page + `Resources` (fonts only) and `Annots` (raw dictionary only). Objects cached by + reference (`DocumentParser::m_objects`). +- **Inherited page attributes**: the inheritable set per spec Table 30 — + `Resources`, `MediaBox`, `CropBox`, `Rotate` — resolved by threading an + accumulator down the `Pages` recursion (no `Parent` walk). Each `Page` carries + the resolved `media_box`/`crop_box`/`rotate` and its resolved `resources`. + Lenience: `CropBox` defaults to `MediaBox`, `Rotate` normalized to + {0,90,180,270}, a `MediaBox` missing everywhere falls back to US Letter, a + missing `Resources` to an empty dict — all with a `Logger` warning. +- **Stream filters** (`pdf_filter`): `/Filter` and `/DecodeParms` honoured, + including chains and the inline-image abbreviations — FlateDecode and + LZWDecode (both with TIFF and PNG predictors), ASCIIHexDecode, ASCII85Decode, + RunLengthDecode. Image codecs (DCTDecode, JPXDecode, CCITTFaxDecode, + JBIG2Decode) are deliberately not decoded: `decode()` stops and hands back the + still-encoded payload for stage 4; `read_decoded_stream` treats them as an + error. The `Crypt` filter passes through only as `Identity`. +- **Encryption** (`pdf_encryption`): the standard security handler. An + `Authenticator` parses `/Encrypt` and authenticates the password (user then + owner; the empty password is tried first, so owner-locked files open + transparently), producing a `Decryptor` that decrypts object strings and + streams. RC4 (V 1/2, R 2/3, 40–128 bit), + AES-128 crypt filters (V 4, R 4 — `StdCF` with `V2`/`AESV2`, `Identity`, + honouring `StmF`/`StrF`) and AES-256 (V 5, R 6, AESV3) are all supported, + including owner-only files and `EncryptMetadata false`. Streams are decrypted + before `/Filter` decoding; cross-reference streams and object-stream members + are left untouched. The user password is never retained: once `authenticate` + succeeds, the derived key lives only inside the `Decryptor` (no accessor), and + `PdfFile` carries the whole authenticated `Decryptor` forward — from the + encryption probe to the render parse — so the HTML service unlocks the + document without re-deriving the key. Permission bits (`/P`) are recorded, not + enforced. +- **Fonts / text mapping**: a font's `ToUnicode` CMap stream is decoded and + parsed; `bfchar` mappings with 1-byte glyph codes and single UTF-16 units are + applied. Unmapped glyphs pass through as their byte value. +- **Content streams**: the full graphics-operator vocabulary is tokenized; + `GraphicsState` executes a subset (state stack `q`/`Q`, matrices `cm`/`Tm`, + line parameters, text state `Tc`/`Tw`/`Tz`/`TL`/`Tf`/`Tr`/`Ts`, text + positioning `Td`/`TD`, grey/RGB/CMYK colors, glyph metrics `d0`/`d1`). Unknown + operators are logged to stderr and skipped. +- **HTML**: one `document.html` view; each page is a `div` sized from `MediaBox` + (points → inches), each `Tj` becomes an absolutely positioned `span` at the + text-state offset with `font-size` from `Tf` and the CMap-translated text. + `TJ`/`'`/`"` are recognized but only printed to stdout, not rendered. + +## Module layout + +| File (`pdf/`) | Role | +|----------------------------------------|-------------------------------------------------------| +| `pdf_object.{hpp,cpp}` | Object model: `Object` (`std::any`-based variant), `Array`, `Dictionary`, `Name`, `StandardString`/`HexString`, `ObjectReference`; `to_stream`/`to_string` dumping | +| `pdf_object_parser.{hpp,cpp}` | Tokenizer over `std::streambuf`: whitespace/lines, numbers, names, strings, arrays, dictionaries, references | +| `pdf_file_object.{hpp,cpp}` | File-structure entries: `Header`, `IndirectObject`, `Trailer`, `Xref` (tagged-union entries, `append`/`merge_hybrid`), `StartXref`, `Eof`, the `Entry` any-holder; `parse_xref_stream_table` and the `ObjectStream` payload wrapper | +| `pdf_file_parser.{hpp,cpp}` | File-level reads on top of `ObjectParser`: indirect objects, xref, trailer, startxref, stream payloads, `seek_start_xref` | +| `pdf_filter.{hpp,cpp}` | Stream filter framework: `decode()` over the `/Filter`/`/DecodeParms` chain; ASCIIHex/ASCII85/LZW/Flate/RunLength decoders, TIFF/PNG predictors; image codecs returned undecoded (`DecodeResult::stopped_at_filter`) | +| `pdf_document_parser.{hpp,cpp}` | `parse_document()`: xref/trailer chain → catalog → page tree; lazy object reads with cache; (deep) reference resolution | +| `pdf_encryption.{hpp,cpp}` | Standard security handler: `Authenticator` (parse `/Encrypt`, authenticate password → `Decryptor`) and `Decryptor` (decrypt strings/streams; RC4, AES-128, AES-256), plus a `standard_security` namespace of pure key/password algorithms for known-answer tests | +| `pdf_document.hpp` | `Document`: arena of `Element`s + `catalog` pointer | +| `pdf_document_element.hpp` | Element structs: `Catalog`, `Pages`, `Page`, `Annotation`, `Resources`, `Font` | +| `pdf_cmap.{hpp,cpp}` | `CMap`: 1-byte glyph → UTF-16 `bfchar` map + string translation | +| `pdf_cmap_parser.{hpp,cpp}` | `ToUnicode` CMap stream parser (`begincodespacerange`/`beginbfchar`/`beginbfrange`; only `bfchar` applied) | +| `pdf_graphics_operator.hpp` | `GraphicsOperatorType` enum (full operator set) + `GraphicsOperator` (type + `Object` arguments) | +| `pdf_graphics_operator_parser.{hpp,cpp}` | Content-stream tokenizer: arguments then operator name | +| `pdf_graphics_state.{hpp,cpp}` | `GraphicsState`: stack of `State` (general/path/text/color), `execute(op)` for the modelled subset | +| `pdf_file.{hpp,cpp}` | `abstract::PdfFile` wrapper; probes encryption at construction and implements `password_encrypted()`/`decrypt()`, carrying the authenticated `Decryptor` (not the password) so rendering needs no re-derivation | + +Consumers outside the module: `open_strategy.cpp` (detection / engine +selection) and `html/pdf_file.cpp` (`create_pdf_service`). + +## Pipeline: how a `.pdf` becomes HTML + +1. **Wiring.** `open_strategy` maps `FileType::portable_document_format` to + `PdfFile`; `DecoderEngine::poppler` (or the unknown-file-type fallback) can + yield a `PopplerPdfFile` instead when built with `ODR_WITH_PDF2HTMLEX`. + `html::translate(PdfFile)` picks the matching HTML service. +2. **Locate the xref.** `seek_start_xref` seeks to `EOF − 64`, scans for + `startxref`; `read_start_xref` yields the most recent xref offset. + (`read_header` exists but `parse_document` does not call it — the `%PDF-` + header is only checked by magic detection earlier.) +3. **Walk the trailer chain.** `read_xref_section` dispatches: a classic table + (`read_xref` + `read_trailer`) or a cross-reference stream (an indirect + object whose dictionary doubles as the trailer dict; payload decoded via the + filter framework, entries via `parse_xref_stream_table`). A trailer `XRefStm` + (hybrid file) is read next and fills entries the classic table lacks or marks + free (`merge_hybrid`). Sections merge into the accumulated table + (`std::map::insert` keeps the first/newest entry), then `Prev` is followed + (cycle-guarded). The first/newest trailer provides `Root`. +4. **Build the page tree.** `parse_catalog` → `parse_pages` recurses over + `Kids` (dispatching on `Type`). Each `Page` keeps its raw dictionary, its + `Contents` reference(s), parsed `Resources` (the `Font` table; each font's + `ToUnicode` CMap is parsed if present) and `Annots` (raw). `read_object` + dispatches on the xref entry kind: used → seek + `read_indirect_object`; + compressed → owning object stream decoded once, cached, member parsed from + the cached payload; free/absent → null with a warning. Parsed objects cached + by reference. +5. **Decode content.** Per page (depth-first), the `Contents` streams are read, + decoded through their `/Filter` chain (`read_decoded_stream`), concatenated + with a newline between streams. +6. **Execute and emit.** `GraphicsOperatorParser` tokenizes; `GraphicsState` + updates the state stack. `T*` advances the text offset by `size + leading`; + `Tj` emits a positioned `span` using `state.text.offset` and the `Tf` size, + glyphs translated through the font's CMap. The text and transform matrices + are tracked but **not applied** to positioning. + +--- + +## Design decisions + +**Stream-based parsing with seeks, lazy object access.** Everything is parsed +off a `std::istream`/`std::streambuf` — no full-file buffer. Random access +(xref lookups, stream payloads) seeks; sequential tokenizing uses +single-character peek/bump (`geti`/`getc`/`bumpc`). Objects are parsed only when +referenced, and parsed `IndirectObject`s are cached by reference, so shared +objects are read once. Positions are `std::uint32_t` (files ≥ 4 GiB are out of +scope). + +**`std::any`-based object model.** `Object` holds its value in `std::any` with +typed `is_*`/`as_*` accessors (mirrors `oldms/`'s `Entry`). Pro: one value type +throughout parser, document elements, and operator arguments. Con: no exhaustive +matching, RTTI lookups, and accidental copies are easy — `resolve_object_copy` +exists because rvalue access proved fiddly (see the `TODO why rvalue not +working?` in `pdf_document_parser.cpp`). + +**References are recognized by lookahead.** `n g R` is plain integers until the +`R` appears, so `read_array`/`read_dictionary` patch references after the fact. +A standalone `read_object` therefore returns the *id* integer of a reference — +only array/dictionary contexts and `read_object_reference` assemble real +references. Works for well-formed files; a known sharp edge (`TODO this seems +hacky`). + +**Element tree as an arena.** `Document` owns all elements +(`vector>`); `Catalog`/`Pages`/`Page`/… hold raw non-owning +pointers plus their original dictionary (`Element::object`), so unmodelled keys +stay inspectable. Navigation is by typed `is_()`/`as_()` accessors over +`kids` — thin `dynamic_cast` wrappers mirroring `Object`'s `is_*`/`as_*` +surface (the former `Type` tag enum was dropped in favour of RTTI). + +**Fail early on malformed structure, tolerate unknown content.** Structural +surprises **throw** `std::runtime_error` (missing `obj`/`endobj`/`stream`/ +`endstream`/`xref`/`startxref`, unexpected characters, an unresolvable +`/Length`, an unknown page-tree element type, stream exhaustion). Unknown +**content** is tolerated: unrecognized operators logged and skipped, unmodelled +operators ignored by `execute`, annotations keep their raw dictionary, CMap +`codespacerange`/`bfrange` parsed past without effect. References to free/absent +objects resolve to null with a warning; unknown xref-stream entry types treated +as absent (7.5.8.3). + +**Debug output still in place.** `html/pdf_file.cpp`, `pdf_graphics_state.cpp`, +`pdf_graphics_operator_parser.cpp` and `pdf_cmap_parser.cpp` print diagnostics +(and one leftover `"hi"` breakpoint marker) to stdout/stderr instead of +`Logger`. Proof-of-concept residue; should move to `Logger` or be removed. +`DocumentParser` itself takes an optional `Logger &` (default `Logger::null()`) +and routes its warnings through it — new diagnostics should do the same. + +--- + +## Tests + +- `test/src/internal/pdf/pdf_filter.cpp` — **assertion-based**, all inputs + inline strings: every decoder, predictors, chains, image-codec stop, + `Crypt`/unknown-filter errors. +- `test/src/internal/pdf/pdf_file_object.cpp` — **assertion-based**, inline + only: cross-reference-stream entry decoding (field widths incl. 0, type + default, big-endian fields, subsections, unknown types, error paths), + `ObjectStream` header parsing and member lookup, `Xref::append` / + `Xref::merge_hybrid` precedence. +- `test/src/internal/pdf/pdf_encryption.cpp` — **assertion-based**, inline + vectors only: the standard security handler across R 2 (RC4-40), R 3 + (RC4-128), R 4 (AES-128/AESV2, incl. `EncryptMetadata false` and an + owner-locked file) and R 6 (AES-256). Vectors come from the real fixtures and + from `qpdf --encrypt` output frozen as literals — decrypting back to a known + marker, so no test is circular and no fixture file ships. + `crypto_util_test.cpp` covers the new MD5/RC4/SHA-384/512 primitives against + public standard vectors. +- `test/src/internal/pdf/pdf_document_parser.cpp` — **assertion-based** + whole-file tests over mini-PDFs assembled by the test-only + `pdf_test_file_builder.{hpp,cpp}` (computes xref offsets/`startxref`, so tests + show only the dictionaries; classic-table and uncompressed-xref-stream + variants), plus inherited-page-attribute coverage (a multi-level `Pages` tree: + per-page resolved `MediaBox`/`CropBox`/`Rotate`/`Resources`, override vs. + inheritance, the `CropBox` ← `MediaBox` default, the missing-`MediaBox` + US-Letter lenience). End-to-end: the classic fixture + `odr-public/pdf/style-various-1.pdf`, plus decryption of + `odr-public/pdf/Casio_WVA-M650-7AJF.pdf` (RC4, empty password) and + `odr-private/pdf/encrypted_fontfile3_opentype.pdf` (AES-256; skipped when the + private submodule is absent). The `odr-private` xref-stream/objstm/hybrid + fixtures (`basic_text.pdf`, `geneve_1564.pdf`, `test_fail.pdf`, `Kayla….pdf`, + `svg_background…issue402.pdf`, `Core_v5.1.pdf`, `onepage.pdf`) were verified + manually but are not pinned in unit tests. Also still contains the original + print-everything smoke test. +- `test/src/internal/pdf/pdf_file_parser.cpp` — sequential `read_entry` walk + (smoke) + assertion-based xref/trailer/root navigation over + `style-various-1.pdf`. + +No assertion-based coverage of the tokenizer (escapes, references, hex strings), +the CMap, or the HTML output. + +--- + +# Roadmap + +Goal: faithful read-only HTML for common real-world PDFs through the odr engine, +so the poppler/pdf2htmlEX engine becomes optional rather than required. Stages +are ordered by what they unlock; 0–2 are roughly sequential, 3 and 4 are +independent, 5 builds on whatever pages already render. Each stage gets its own +detailed design before implementation. + +## Stage 0 — file-format compatibility (prerequisite) — **mostly done** + +Modern producers write PDF 1.5+ structures the original parser rejected. +Cross-reference/object streams + hybrid files, the filter framework (incl. PNG +predictors), inherited page attributes, and encryption (RC4 / AES-128 / AES-256) +are **all implemented** (see *What works*). The one remaining piece: + +**Xref recovery for broken files** (post-stage-0; the WP2 code left room): +- Trigger: any structural throw during xref-chain walking or a failed object + lookup (`startxref` missing/garbage, offsets wrong). +- Recovery: a single forward scan for `n g obj` line starts (the existing + sequential `read_entry` machinery is most of this), building a synthetic + `Xref` (last definition of an id wins), collecting `trailer` dicts and + `/Type /Catalog` objects as `Root` candidates; objstm members indexed by + scanning recovered object streams. +- Tests fit inline strings well: the scan ignores xref offsets, so a broken + mini-PDF needs no offset bookkeeping — write a literal with a garbage + `startxref`, duplicate ids, or a missing trailer, and assert what got rebuilt. + Real-world fixture: `odr-private/pdf/order-EK52VKL0.pdf` — an HTTP response + accidentally saved as `.pdf` (starts with `HTTP/1.0 200 OK`). + +Remaining encryption edge cases (deferred until a real file needs them): +per-stream `/Crypt` filter `Name` overrides, the `EncryptMetadata false` +metadata-stream `Identity` special case, and `Perms` (Algorithm 13) validation; +the public-key security handler and R 5 are out of scope. + +## Stage 1 — text extraction: the code → Unicode chain + +PDF strings are **character codes**; per font, walk this chain and record +per-code Unicode (or "unknown", which stage 3 handles): + +1. **`ToUnicode` CMap** — extend the existing `CMap`: `bfrange`, + `codespacerange` (multi-byte codes), multi-character targets. +2. **Simple fonts**: `/Encoding` base (WinAnsi/MacRoman/Standard) + + `/Differences` → glyph names → Unicode via the Adobe Glyph List (incl. + `uniXXXX`/`uXXXXXX` names). +3. **Composite (Type0/CID) fonts**: `Identity-H/V` plus the predefined CMaps + (CJK); map CID → Unicode via the CID system info where defined. +4. **Embedded font fallback** (needs stage 3's font *reading*): reverse the + TrueType `cmap`; read glyph names from Type1/CFF charstrings. +5. Nothing applies → mark the run "no Unicode" for stage 3's re-encoding. + +`/ActualText` (tagged PDFs, ligatures) overrides the whole chain for extraction. + +## Stage 2 — text positioning & metrics + +Independent of Unicode work; fixes layout even with today's partial CMaps. + +- Apply the full transform: text matrix × CTM (both tracked in `GraphicsState` + but never applied), text rise, horizontal scaling. +- **Glyph advances**: `/Widths` + `/MissingWidth` (simple), `/W` + `/DW` (CID), + char/word spacing, the numeric adjustments in `TJ` — so `TJ`, `'`, `"` finally + render and `Tj` runs land correctly. +- **Form XObjects** (`Do` on a `/Form`): recursive content-stream execution with + scoped `/Resources` and the form matrix. Many producers put most page content + inside forms, and tiling patterns (stage 4) and annotation appearances + (stage 5) run on the same machinery — a structural prerequisite. +- **Text render modes** (`Tr`): mode 3 (invisible text, OCR-over-scan) must stay + selectable but unpainted; stroke/clip modes (1–2, 4–7) need graceful + degradation. +- **Space inference**: PDFs routinely encode no spaces; insert them from + glyph-gap heuristics (as pdf2htmlEX does) so copy/paste and search work. +- Layout side of bidi (RTL run ordering) and vertical writing (Identity-V/CJK). +- HTML mapping decision: per-run spans with CSS `transform` (cheap, breaks on + heavy kerning) vs. per-glyph positioning (exact, verbose) — likely per-run + with a kerning threshold that splits runs, like pdf2htmlEX. + +## Stage 3 — fonts in HTML + +Needed for visual fidelity regardless of text extraction. + +**Decision (2026-06): in-house, no FontForge.** pdf.js proves complete font +conversion is doable without a font library; pdf2htmlEX uses FontForge at the +cost of a notoriously heavy build. No trimmed off-the-shelf alternative does +what we need (FreeType/stb_truetype are read-only; hb-subset can only subset +along the *existing* `cmap`, so it cannot inject the PUA mappings below). +Expected ~5–8k lines of focused C++ — on the order of an `oldms/` module. +Reading (SFNT tables, CFF charsets) is the easy part and is needed by stage 1.4 +anyway. + +**Architecture: IR for facts, pass-through for glyphs.** No glyph-level font IR: +decompiling and recompiling outlines is the FontForge model — loses hinting, +risks fidelity, and with one output format (SFNT) the M×N payoff never +materializes. Glyph data (outlines, hinting, charstrings) passes through +byte-for-byte; even Type1 → Type2 charstrings is a direct sibling-format +translation. What *is* shared: a thin `FontProgram`-style interface — per-flavor +readers producing the facts every consumer needs (glyph count, glyph → Unicode, +advance widths, units-per-em, name, bbox, symbolic flag) with raw bytes kept +alongside. Stage 1.4 reads Unicode from it, the OTF wrap synthesizes +`head`/`hhea`/`hmtx`/`OS/2` from it, the re-encoder assigns PUA code points from +its glyph count. + +**Intermediate milestone — fonts as first-class library citizens.** Before +wiring fonts into PDF output, ship them standalone: +- **Font files as a `DecodedFile` type** (precedent: `SvmFile`, `ImageFile`): + `FileType` entries + magic detection (SFNT `0x00010000`/`OTTO`/`wOFF`), a + `FontFile` category, and `html::translate(FontFile)` emitting a **specimen + page** — name/metrics header plus a glyph grid, font served via `@font-face`. + Keep the UI at "specimen page"; no font-editor scope creep. +- The glyph grid must show **every** glyph, including ones no `cmap` reaches — + which forces building the PUA re-encoding, table-directory rebuild, and OTF + wrap *first*, against a directly viewable deliverable with font-only tests. +- **In parallel: PDF as a container.** Expose embedded fonts as an + `abstract::Filesystem` (`/fonts/F1.ttf`, …) and reuse the filesystem HTML + service (as for ZIP/CFB). Doubles as the corpus harvester. + +Sub-stages, ordered by corpus frequency, each independently useful: +1. **TrueType** (`FontFile2`, CIDFontType2 — bulk of modern PDFs): serve nearly + as-is via `@font-face`; implement the `cmap` rewrite (format-4/12 subtable, + splice the table directory, recompute `head.checkSumAdjustment`). +2. **Bare CFF** (`FontFile3`/Type1C): wrap into an OTF container by synthesizing + the ~8 required tables; take advance widths from `/Widths`/`/W` rather than + interpreting charstrings. +3. **Type1** (`FontFile` — older docs, pdfTeX/academic PDFs): `eexec` + decryption, Type1 → Type2 charstrings, build a CFF, reuse sub-stage 2. The + hardest single piece but precisely specified (Adobe T1 spec; pdf.js as + reference). +4. **Type3** (drawing procedures, no font file — scientific plots) → SVG glyphs + reusing stage 4's path rendering; plus **non-embedded fonts**: substitute the + standard 14 + common names with CSS fallbacks + metrics from `/Widths`. + +Mechanisms and guards: +- **Re-encoding for unmapped glyphs** (the general workaround): rewrite the + `cmap` so deterministic PUA code points (`U+E000 + glyph index`) map to the + glyphs, emit those in the HTML, mark such runs non-extractable + (`user-select: none`, `aria-hidden`). Display correct; copy/search knowingly + garbage. Option: re-encode *all* fonts this way (pdf2htmlEX's choice) for one + uniform pipeline. +- **Broken-font long tail**: real embedded fonts are routinely malformed, and + browsers run web fonts through a sanitizer (OTS) that silently rejects them. + Regenerating the table directory (which the re-encode/wrap does anyway) covers + most of it; start strict, add repair heuristics as real files demand. CI gate: + run **OTS** over every produced font (test-time only); optionally FreeType as + a second oracle. Neither ships in the product. + +## Stage 4 — graphics + +**Decision (2026-06): SVG generation, no rasterizer.** pdf2htmlEX uses poppler +to render non-text into a per-page background image; we generate SVG instead — +serialization, not rendering. pdf.js proves the full PDF graphics model needs no +native renderer. The PDF and SVG imaging models are close cousins (PostScript +heritage), so the mapping is mostly mechanical. Trade-off: pdf2htmlEX gets the +long tail right for free via poppler, while our fidelity is bounded by operator +coverage — countered by the test oracle below. The rasterized-background +fallback is **rejected**: it reintroduces exactly the renderer dependency this +stage exists to avoid. + +- Vector content → inline SVG per page, layered under the text spans: paths, + fill rules, stroke parameters, transforms; clipping → nested ``; + tiling patterns → `` (form-XObject machinery from stage 2); + axial/radial shadings (types 2/3) → `linearGradient`/`radialGradient`. +- **Images**: `DCTDecode` → `` JPEG pass-through; Flate/LZW raster → PNG + encode; inline images (`BI`/`ID`/`EI` — currently not even tokenized correctly + past `ID`); image masks and SMasks later. +- **SVG residue** — where no 1:1 primitive exists; all at generation time, never + rasterization: mesh/function shadings (types 1, 4–7) → tessellate into small + flat polygons (pdf.js's approach); color spaces + (Separation/DeviceN/Indexed/Lab/ICC) → convert to RGB when emitting (sample + tint transforms, approximate ICC as sRGB, ignore overprint); transparency: + `CA`/`ca` → `opacity`, soft masks → ``, blend modes → `mix-blend-mode`; + isolated/knockout groups don't map cleanly — punt (rare). +- **Renderer as test oracle, not dependency** (parallels stage 3's OTS gate): + render corpus fixtures with poppler or pdf.js in CI, screenshot our output, + perceptual-diff. + +## Stage 5 — interaction & navigation + +Builds on whatever pages render; needs stage 0 plus destinations from the page +tree, little else. + +- **Links**: URI actions and internal `GoTo` destinations (incl. named) as `` + overlays. +- **Annotation appearances**: render `/AP` appearance streams (form XObjects + again) for highlights, stamps, form-field appearances; AcroForm + *interactivity* stays out of scope (read-only). +- **Document outline** (`/Outlines`) → navigation anchors/sidebar. +- **Optional content groups** (layers): honor default visibility; no toggle UI. +- **Metadata** (`/Info`, XMP) into `file_meta()`. +- **Output scaling**: monolithic HTML vs. per-page lazy loading for large + documents (check what odr's HTML service model already provides first). + +## Cross-cutting (any time) + +- Route diagnostics through `Logger` instead of stdout/stderr; drop the leftover + debug code (incl. the `"hi"` marker) in `html/pdf_file.cpp`. +- Grow a corpus: `odr-public` fixtures, the PDF101 "nasty files" collection + linked in `README.md`; assertion-based tests per stage. +- Spec docs offline under `offline/documentation/PDF/` (ISO 32000-1:2008, ISO + 32000-2:2020, Adobe PDF Reference 1.7, with markdown conversions); still to + do: fold them into `README.md` in place of the web links. + +## Other known gaps + +- **Linearized files** are not handled specially (the tail-first read usually + still works, but hint streams are ignored). +- **CMap coverage**: only single-byte `bfchar`; `bfrange`/`codespacerange` + skipped, multi-byte codes unsupported, fonts without `ToUnicode` fall back to + identity bytes (stage 1). +- **Annotations** are collected but their content is not interpreted (stage 5). +- Revisit the reference-by-lookahead parsing and `read_stream(-1)` fallback.