Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,17 +78,17 @@ Two pieces per engine:

## Build & test

A configured build dir already exists (`cmake-build-debug`, also `…-release`,
`…-relwithdebinfo`). Typical loop:
A configured build dir already exists (`cmake-build-relwithdebinfo`, also `…-debug`,
`…-release`). Typical loop:

```bash
# library
cmake --build cmake-build-debug --target odr
cmake --build cmake-build-relwithdebinfo --target odr
# tests (the ODR_TEST option is on in this build dir)
cmake --build cmake-build-debug --target odr_test
./cmake-build-debug/test/odr_test --gtest_filter='OldMs.*'
cmake --build cmake-build-relwithdebinfo --target odr_test
./cmake-build-relwithdebinfo/test/odr_test --gtest_filter='OldMs.*'
# CLI (renders a file to a directory of HTML)
cmake --build cmake-build-debug --target translate
cmake --build cmake-build-relwithdebinfo --target translate
```

Notable CMake options (`CMakeLists.txt`): `ODR_TEST`, `ODR_CLI`,
Expand Down
3 changes: 2 additions & 1 deletion src/odr/internal/html/pdf_file.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include <odr/html.hpp>

#include <odr/internal/abstract/file.hpp>
#include <odr/internal/html/common.hpp>
#include <odr/internal/html/html_service.hpp>
#include <odr/internal/html/html_writer.hpp>
#include <odr/internal/pdf/pdf_document.hpp>
Expand Down Expand Up @@ -143,7 +144,7 @@ class HtmlServiceImpl final : public HtmlService {
o << "bottom:" << offset[1] / 72.0 << "in;";
o << "font-size:" << size << "pt;";
}));
out.write_raw(unicode);
out.write_raw(escape_text(unicode));
out.write_element_end("span");
} else if (op.type ==
pdf::GraphicsOperatorType::show_text_manual_spacing) {
Expand Down
2 changes: 1 addition & 1 deletion src/odr/internal/ooxml/ooxml_util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ ooxml::read_pct_attribute(const pugi::xml_attribute attribute) {
// potentially this should be moved to a table parser

std::string val = attribute.value();
util::string::trim(val);
util::string::trim_inplace(val);

if (val.find('%') != std::string::npos) {
util::string::replace_all(val, "%", "");
Expand Down
69 changes: 40 additions & 29 deletions src/odr/internal/pdf/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ module (poppler / pdf2htmlEX, behind `ODR_WITH_PDF2HTMLEX`) is the
production-quality alternative engine.

**Scope today.** Parse the PDF object/file structure (classic cross-reference
tables, cross-reference streams, object streams, hybrid files), build the page
tables, cross-reference streams, object streams, hybrid files, with a
forward-scan recovery path for broken cross-references), build the page
tree with fonts and annotations, tokenize page content streams into graphics
operators, and emit a **proof-of-concept HTML rendering**: absolutely positioned
text spans per `Tj`, pages sized from `MediaBox`. Encrypted files are decrypted
Expand Down Expand Up @@ -42,6 +43,15 @@ not production-quality — the HTML path still contains debug `std::cout` output
Lenient where the wild demands: `/Type /XRef` only warns, references to free
or absent objects resolve to null with a `Logger` warning, `n g obj` need not
end with a newline.
- **Cross-reference recovery**: when the trailer-chain walk throws (missing or
garbage `startxref`, a broken `Prev` chain) or the document fails to build
(no `/Root`, offsets pointing at the wrong objects), the whole file is
forward-scanned for `n g obj` starts, rebuilding a synthetic xref (last
definition of an id wins). `trailer` dictionaries are collected for `/Root`,
`/Encrypt`, `/ID`; recovered `/Type /ObjStm` members are indexed as
compressed entries; and, when no trailer supplied a `/Root`, a `/Type
/Catalog` object is searched. Handles e.g. an HTTP response saved as `.pdf`
(every offset shifted by the header).
- **Page tree**: `Catalog` → `Pages` (recursive) → `Page` with per-page
`Resources` (fonts only) and `Annots` (raw dictionary only). Objects cached by
reference (`DocumentParser::m_objects`).
Expand Down Expand Up @@ -186,7 +196,9 @@ surprises **throw** `std::runtime_error` (missing `obj`/`endobj`/`stream`/
operators ignored by `execute`, annotations keep their raw dictionary, CMap
`codespacerange`/`bfrange` parsed past without effect. References to free/absent
objects resolve to null with a warning; unknown xref-stream entry types treated
as absent (7.5.8.3).
as absent (7.5.8.3). A structural throw in the cross-reference layer is not
fatal, though: it is caught once and the file is forward-scanned to rebuild the
table (*Cross-reference recovery* above) before giving up.

**Debug output still in place.** `html/pdf_file.cpp`, `pdf_graphics_state.cpp`,
`pdf_graphics_operator_parser.cpp` and `pdf_cmap_parser.cpp` print diagnostics
Expand Down Expand Up @@ -222,11 +234,16 @@ and routes its warnings through it — new diagnostics should do the same.
variants), plus inherited-page-attribute coverage (a multi-level `Pages` tree:
per-page resolved `MediaBox`/`CropBox`/`Rotate`/`Resources`, override vs.
inheritance, the `CropBox` ← `MediaBox` default, the missing-`MediaBox`
US-Letter lenience). End-to-end: the classic fixture
US-Letter lenience), plus cross-reference-recovery coverage (inline broken
mini-PDFs: garbage prepended, a bad `startxref`, no trailer at all → catalog
scan, a duplicate id → last definition wins, a page tree living in an object
stream). End-to-end: the classic fixture
`odr-public/pdf/style-various-1.pdf`, plus decryption of
`odr-public/pdf/Casio_WVA-M650-7AJF.pdf` (RC4, empty password) and
`odr-private/pdf/encrypted_fontfile3_opentype.pdf` (AES-256; skipped when the
private submodule is absent). The `odr-private` xref-stream/objstm/hybrid
private submodule is absent), and recovery of the real
`odr-private/pdf/order-EK52VKL0.pdf` (an HTTP response saved as `.pdf`;
likewise skipped when absent). The `odr-private` xref-stream/objstm/hybrid
fixtures (`basic_text.pdf`, `geneve_1564.pdf`, `test_fail.pdf`, `Kayla….pdf`,
`svg_background…issue402.pdf`, `Core_v5.1.pdf`, `onepage.pdf`) were verified
manually but are not pinned in unit tests. Also still contains the original
Expand All @@ -248,31 +265,16 @@ are ordered by what they unlock; 0–2 are roughly sequential, 3 and 4 are
independent, 5 builds on whatever pages already render. Each stage gets its own
detailed design before implementation.

## Stage 0 — file-format compatibility (prerequisite) — **mostly done**

Modern producers write PDF 1.5+ structures the original parser rejected.
Cross-reference/object streams + hybrid files, the filter framework (incl. PNG
predictors), inherited page attributes, and encryption (RC4 / AES-128 / AES-256)
are **all implemented** (see *What works*). The one remaining piece:

**Xref recovery for broken files** (post-stage-0; the WP2 code left room):
- Trigger: any structural throw during xref-chain walking or a failed object
lookup (`startxref` missing/garbage, offsets wrong).
- Recovery: a single forward scan for `n g obj` line starts (the existing
sequential `read_entry` machinery is most of this), building a synthetic
`Xref` (last definition of an id wins), collecting `trailer` dicts and
`/Type /Catalog` objects as `Root` candidates; objstm members indexed by
scanning recovered object streams.
- Tests fit inline strings well: the scan ignores xref offsets, so a broken
mini-PDF needs no offset bookkeeping — write a literal with a garbage
`startxref`, duplicate ids, or a missing trailer, and assert what got rebuilt.
Real-world fixture: `odr-private/pdf/order-EK52VKL0.pdf` — an HTTP response
accidentally saved as `.pdf` (starts with `HTTP/1.0 200 OK`).

Remaining encryption edge cases (deferred until a real file needs them):
per-stream `/Crypt` filter `Name` overrides, the `EncryptMetadata false`
metadata-stream `Identity` special case, and `Perms` (Algorithm 13) validation;
the public-key security handler and R 5 are out of scope.
## Stage 0 — file-format compatibility (prerequisite) — **done**

The prerequisite for everything below: read the structures modern producers
write that the original parser rejected, so a real-world `.pdf` reaches the page
tree at all. All of it has landed (see *What works*): the stream-filter
framework (incl. PNG predictors), PDF 1.5+ cross-reference/object streams and
hybrid files, inherited page attributes, encryption (RC4 / AES-128 / AES-256),
and last-resort cross-reference recovery for broken files. Remaining odds and
ends are folded into *Other known gaps* below; the staged renderer work now
builds on a parser that opens the common corpus.

## Stage 1 — text extraction: the code → Unicode chain

Expand Down Expand Up @@ -441,6 +443,15 @@ tree, little else.

## Other known gaps

- **Encryption edge cases** (deferred from stage 0 until a real file needs
them): per-stream `/Crypt` filter `Name` overrides, the `EncryptMetadata
false` metadata-stream `Identity` special case, and `Perms` (Algorithm 13)
validation. The public-key security handler and revision 5 are out of scope.
- **Recovery limitations** (deferred from stage 0): when several `/Type
/Catalog` objects survive, the first in id order is picked rather than the
newest; the constructor-triggered recovery path cannot decode object streams
in an *encrypted* broken file (no decryptor yet), so such members go
unindexed. Both are edge cases beyond the corpus seen so far.
- **Linearized files** are not handled specially (the tail-first read usually
still works, but hint streams are ignored).
- **CMap coverage**: only single-byte `bfchar`; `bfrange`/`codespacerange`
Expand Down
Loading
Loading