PDF: cross-reference recovery and robustness for real-world files by andiwand · Pull Request #529 · opendocument-app/OpenDocument.core

andiwand · 2026-06-14T08:57:44Z

Completes the last open piece of stage 0 (file-format compatibility) for the in-house PDF parser: last-resort cross-reference recovery for broken files, plus a batch of parsing/robustness fixes that let several real-world PDFs render. Built on top of #526 (the AGENTS.md consolidation), so review that first / merge it before this.

1. ObjectParser parsing fixes

read_unsigned_integer reports a missing number instead of silently returning 0 (via read_unsigned_integer_and_count, which throws when no digit is present). peek_unsigned_integer tests for a leading digit without consuming.
read_integer_or_real parses the sign once for the whole number and treats both the integer and fractional parts as optional. Fixes negative reals (-1.5 was -0.5) and sign-led fractions (-.5), and counts fractional digits directly instead of diffing stream positions (tellg returns -1 at eof, which mis-scaled the fraction).

2. Cross-reference recovery

When the trailer-chain walk throws (missing/garbage startxref, broken Prev chain) or the document fails to build (no /Root, wrong offsets), the file is forward-scanned for n g obj starts and a synthetic xref is rebuilt (last definition wins).

recover_xref() — forward scan recording object positions, skipping stream … endstream bodies, collecting trailer dictionaries for /Root, /Encrypt, /ID. The header match reuses ObjectParser over a zero-copy util::stream::ViewStreamBuf.
index_object_streams() — indexes recovered /Type /ObjStm members as compressed entries (a direct entry wins).
recover_root() — when no trailer supplies /Root, scans recovered objects for /Type /Catalog.
Runs at most once (m_recovered guard). Six inline broken-mini-PDF tests, plus the real order-EK52VKL0.pdf HTTP-response fixture.

3. Render more real-world files

/ri rendering intent is a name, not a number (color_rendering_intent now stored via as_name()). Then four crashes on valid PDFs, each an over-strict assumption:

/Annots entries may be inline annotation dictionaries, not only indirect references (12.5.2).
a /Contents reference may point at an array of content streams, not only a single stream (7.7.3.3) — resolve then flatten.
endobj/stream/endstream keyword matching tolerated no trailing space (endobj \n) or CRLF — compare the trimmed keyword.
inline image data (BI … ID <bytes> EI) was tokenized as operators, corrupting the parse and losing the text font; the bytes from ID to EI are now skipped.

Supporting util change: string::trim/ltrim/rtrim return a trimmed copy, with *_inplace companions that mutate.

With these, the html_output odr-engine skip list for PDFs is emptied: Core_v5.1.pdf, invalid_unicode_issue477.pdf, pdf.pdf and with_form.pdf all render.

Docs

pdf/AGENTS.md: records recovery under What works / Tests / Scope today and the Fail early note; collapses the finished stage 0 to a summary.

🤖 Generated with Claude Code

`read_unsigned_integer` now reports a missing number rather than silently returning 0: it delegates to a new `read_unsigned_integer_and_count`, which throws when no digit is present. `peek_unsigned_integer` lets callers test for a leading digit without consuming. `read_integer_or_real` is reworked to parse the sign once for the whole number and to treat both the integer and fractional parts as optional. This fixes negative reals (`-1.5` was `-0.5`) and sign-led fractions (`-.5`), and counts fractional digits directly instead of diffing stream positions (`tellg` returns -1 once the digits run to eof, which mis-scaled the fraction). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

When the cross-reference table/stream cannot be parsed, scan the whole file for `n g obj` headers to rebuild the xref, recover the trailer, index object streams, and fall back to a catalog scan when the trailer has no /Root. The header scan reuses ObjectParser (via the new `peek_unsigned_integer` / `read_unsigned_integer`) over a zero-copy `util::stream::ViewStreamBuf` so each candidate line is parsed with the same number grammar as the rest of the parser rather than a hand-rolled scanner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The `ri` operator takes a name, not a number; store color_rendering_intent as a string read via as_name() instead of as_real(). Narrow the odr-engine skip list in the html_meta test from all of odr-private down to the two files that still throw (Core_v5.1.pdf, invalid_unicode_issue477 .pdf), so the in-house PDF path is exercised against the rest of the corpus. Note: two newly-exercised files (odr-private/pdf/pdf.pdf "key not found" and with_form.pdf "expected stream") still fail and are left visible for follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

`ltrim`/`rtrim`/`trim` now return a trimmed copy, with `*_inplace` companions that mutate. Update the one mutating caller (ooxml `read_pct_attribute`) to the in-place form. This gives the PDF parser a value-returning `trim` for keyword comparisons. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Four crashes on valid PDFs, each from an over-strict assumption: - parse_page assumed every /Annots entry is an indirect reference; inline annotation dictionaries are equally valid (12.5.2). - parse_page assumed a /Contents *reference* points straight at a stream, but it may reference an array of content streams (7.7.3.3). Resolve the reference first and flatten an array. - read_indirect_object / read_stream compared the whole line to "endobj" / "stream", which fails on a trailing space ("endobj \n") or CRLF; compare the trimmed keyword instead. - the graphics operator parser tokenized inline image data: after `ID` the raw bytes were read as operators (corrupting the parse and losing the text font, then crashing on the font lookup). Skip from `ID` to the `EI` terminator. With these, Core_v5.1.pdf, invalid_unicode_issue477.pdf, pdf.pdf and with_form.pdf render, so they are removed from the html_output skip list. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed17ae6279

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Generalize the PDF parser's trim_line into reusable trim_view/ltrim_view/ rtrim_view utilities that return a string_view subrange of the input, so callers needing the leading offset can recover it via the data-pointer difference. A CharPredicate parameter (defaulting to is_ascii_space) lets callers supply their own whitespace set; the owning trims now delegate to the views and accept the same predicate, and the PDF recovery path passes ObjectParser::is_whitespace. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A compact object that inlines its dictionary and the `stream` token on one line (`N G obj<<...>>stream`) was matched and recorded by the forward scan, which then continued straight into the stream data without skipping it. Any object-shaped bytes in that body (e.g. a line starting `1 0 obj`) were then treated as candidate headers and could overwrite a real recovered entry. Detect the `stream` keyword on a word boundary (which also subsumes the bare `stream` line) and skip the body to `endstream` in this case too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a byte-level `skip_past(marker)` that advances the cursor just past the next occurrence of a marker (KMP-based, so it stays correct for markers with internal repetition and may straddle line breaks). Use it in two places that previously hand-rolled the scan: - xref recovery skips a stream body by scanning past `endstream` directly, replacing the line-oriented read_line/find loop and its eofbit bookkeeping. - inline-image data skipping now scans for `EI` and only checks the trailing boundary afterwards, dropping the bespoke char-by-char state machine. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

andiwand force-pushed the refactor-agent-md-files branch from f0d04b0 to be15b48 Compare June 14, 2026 09:01

andiwand force-pushed the pdf-xref-recovery branch from 27a7f29 to e526e23 Compare June 14, 2026 09:05

andiwand commented Jun 14, 2026

View reviewed changes

Comment thread src/odr/internal/pdf/pdf_document_parser.cpp Outdated

Comment thread src/odr/internal/pdf/pdf_document_parser.cpp Outdated

Base automatically changed from refactor-agent-md-files to main June 14, 2026 09:44

andiwand force-pushed the pdf-xref-recovery branch from e526e23 to b92b2d8 Compare June 14, 2026 10:16

andiwand and others added 3 commits June 14, 2026 13:37

andiwand force-pushed the pdf-xref-recovery branch from b92b2d8 to 2b62dae Compare June 14, 2026 11:39

andiwand and others added 3 commits June 14, 2026 13:40

update agents md to hopefully prefer relwithdebinfo

5d2450a

andiwand changed the title ~~PDF: cross-reference recovery for broken files~~ PDF: cross-reference recovery and robustness for real-world files Jun 14, 2026

andiwand marked this pull request as ready for review June 14, 2026 12:37

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread src/odr/internal/pdf/pdf_document_parser.cpp Outdated

andiwand and others added 3 commits June 14, 2026 14:55

andiwand enabled auto-merge (squash) June 14, 2026 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF: cross-reference recovery and robustness for real-world files#529

PDF: cross-reference recovery and robustness for real-world files#529
andiwand wants to merge 9 commits into
mainfrom
pdf-xref-recovery

andiwand commented Jun 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andiwand commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. ObjectParser parsing fixes

2. Cross-reference recovery

3. Render more real-world files

Docs

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andiwand commented Jun 14, 2026 •

edited

Loading