Skip to content

PDF: cross-reference recovery and robustness for real-world files#529

Open
andiwand wants to merge 9 commits into
mainfrom
pdf-xref-recovery
Open

PDF: cross-reference recovery and robustness for real-world files#529
andiwand wants to merge 9 commits into
mainfrom
pdf-xref-recovery

Conversation

@andiwand

@andiwand andiwand commented Jun 14, 2026

Copy link
Copy Markdown
Member

Completes the last open piece of stage 0 (file-format compatibility) for the in-house PDF parser: last-resort cross-reference recovery for broken files, plus a batch of parsing/robustness fixes that let several real-world PDFs render. Built on top of #526 (the AGENTS.md consolidation), so review that first / merge it before this.

1. ObjectParser parsing fixes

  • read_unsigned_integer reports a missing number instead of silently returning 0 (via read_unsigned_integer_and_count, which throws when no digit is present). peek_unsigned_integer tests for a leading digit without consuming.
  • read_integer_or_real parses the sign once for the whole number and treats both the integer and fractional parts as optional. Fixes negative reals (-1.5 was -0.5) and sign-led fractions (-.5), and counts fractional digits directly instead of diffing stream positions (tellg returns -1 at eof, which mis-scaled the fraction).

2. Cross-reference recovery

When the trailer-chain walk throws (missing/garbage startxref, broken Prev chain) or the document fails to build (no /Root, wrong offsets), the file is forward-scanned for n g obj starts and a synthetic xref is rebuilt (last definition wins).

  • recover_xref() — forward scan recording object positions, skipping stream … endstream bodies, collecting trailer dictionaries for /Root, /Encrypt, /ID. The header match reuses ObjectParser over a zero-copy util::stream::ViewStreamBuf.
  • index_object_streams() — indexes recovered /Type /ObjStm members as compressed entries (a direct entry wins).
  • recover_root() — when no trailer supplies /Root, scans recovered objects for /Type /Catalog.
  • Runs at most once (m_recovered guard). Six inline broken-mini-PDF tests, plus the real order-EK52VKL0.pdf HTTP-response fixture.

3. Render more real-world files

/ri rendering intent is a name, not a number (color_rendering_intent now stored via as_name()). Then four crashes on valid PDFs, each an over-strict assumption:

  • /Annots entries may be inline annotation dictionaries, not only indirect references (12.5.2).
  • a /Contents reference may point at an array of content streams, not only a single stream (7.7.3.3) — resolve then flatten.
  • endobj/stream/endstream keyword matching tolerated no trailing space (endobj \n) or CRLF — compare the trimmed keyword.
  • inline image data (BI … ID <bytes> EI) was tokenized as operators, corrupting the parse and losing the text font; the bytes from ID to EI are now skipped.

Supporting util change: string::trim/ltrim/rtrim return a trimmed copy, with *_inplace companions that mutate.

With these, the html_output odr-engine skip list for PDFs is emptied: Core_v5.1.pdf, invalid_unicode_issue477.pdf, pdf.pdf and with_form.pdf all render.

Docs

pdf/AGENTS.md: records recovery under What works / Tests / Scope today and the Fail early note; collapses the finished stage 0 to a summary.

🤖 Generated with Claude Code

@andiwand andiwand force-pushed the refactor-agent-md-files branch from f0d04b0 to be15b48 Compare June 14, 2026 09:01
@andiwand andiwand force-pushed the pdf-xref-recovery branch from 27a7f29 to e526e23 Compare June 14, 2026 09:05
Comment thread src/odr/internal/pdf/pdf_document_parser.cpp Outdated
Comment thread src/odr/internal/pdf/pdf_document_parser.cpp Outdated
Base automatically changed from refactor-agent-md-files to main June 14, 2026 09:44
@andiwand andiwand force-pushed the pdf-xref-recovery branch from e526e23 to b92b2d8 Compare June 14, 2026 10:16
andiwand and others added 3 commits June 14, 2026 13:37
`read_unsigned_integer` now reports a missing number rather than silently
returning 0: it delegates to a new `read_unsigned_integer_and_count`, which
throws when no digit is present. `peek_unsigned_integer` lets callers test for
a leading digit without consuming.

`read_integer_or_real` is reworked to parse the sign once for the whole number
and to treat both the integer and fractional parts as optional. This fixes
negative reals (`-1.5` was `-0.5`) and sign-led fractions (`-.5`), and counts
fractional digits directly instead of diffing stream positions (`tellg`
returns -1 once the digits run to eof, which mis-scaled the fraction).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the cross-reference table/stream cannot be parsed, scan the whole file
for `n g obj` headers to rebuild the xref, recover the trailer, index object
streams, and fall back to a catalog scan when the trailer has no /Root.

The header scan reuses ObjectParser (via the new `peek_unsigned_integer` /
`read_unsigned_integer`) over a zero-copy `util::stream::ViewStreamBuf` so each
candidate line is parsed with the same number grammar as the rest of the
parser rather than a hand-rolled scanner.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The `ri` operator takes a name, not a number; store color_rendering_intent as
a string read via as_name() instead of as_real().

Narrow the odr-engine skip list in the html_meta test from all of odr-private
down to the two files that still throw (Core_v5.1.pdf, invalid_unicode_issue477
.pdf), so the in-house PDF path is exercised against the rest of the corpus.

Note: two newly-exercised files (odr-private/pdf/pdf.pdf "key not found" and
with_form.pdf "expected stream") still fail and are left visible for follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand force-pushed the pdf-xref-recovery branch from b92b2d8 to 2b62dae Compare June 14, 2026 11:39
andiwand and others added 3 commits June 14, 2026 13:40
`ltrim`/`rtrim`/`trim` now return a trimmed copy, with `*_inplace` companions
that mutate. Update the one mutating caller (ooxml `read_pct_attribute`) to the
in-place form. This gives the PDF parser a value-returning `trim` for keyword
comparisons.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Four crashes on valid PDFs, each from an over-strict assumption:

- parse_page assumed every /Annots entry is an indirect reference; inline
  annotation dictionaries are equally valid (12.5.2).
- parse_page assumed a /Contents *reference* points straight at a stream, but
  it may reference an array of content streams (7.7.3.3). Resolve the reference
  first and flatten an array.
- read_indirect_object / read_stream compared the whole line to "endobj" /
  "stream", which fails on a trailing space ("endobj \n") or CRLF; compare the
  trimmed keyword instead.
- the graphics operator parser tokenized inline image data: after `ID` the raw
  bytes were read as operators (corrupting the parse and losing the text font,
  then crashing on the font lookup). Skip from `ID` to the `EI` terminator.

With these, Core_v5.1.pdf, invalid_unicode_issue477.pdf, pdf.pdf and
with_form.pdf render, so they are removed from the html_output skip list.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand changed the title PDF: cross-reference recovery for broken files PDF: cross-reference recovery and robustness for real-world files Jun 14, 2026
@andiwand andiwand marked this pull request as ready for review June 14, 2026 12:37

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed17ae6279

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/odr/internal/pdf/pdf_document_parser.cpp Outdated
andiwand and others added 3 commits June 14, 2026 14:55
Generalize the PDF parser's trim_line into reusable trim_view/ltrim_view/
rtrim_view utilities that return a string_view subrange of the input, so
callers needing the leading offset can recover it via the data-pointer
difference. A CharPredicate parameter (defaulting to is_ascii_space) lets
callers supply their own whitespace set; the owning trims now delegate to
the views and accept the same predicate, and the PDF recovery path passes
ObjectParser::is_whitespace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A compact object that inlines its dictionary and the `stream` token on one
line (`N G obj<<...>>stream`) was matched and recorded by the forward scan,
which then continued straight into the stream data without skipping it. Any
object-shaped bytes in that body (e.g. a line starting `1 0 obj`) were then
treated as candidate headers and could overwrite a real recovered entry.

Detect the `stream` keyword on a word boundary (which also subsumes the bare
`stream` line) and skip the body to `endstream` in this case too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a byte-level `skip_past(marker)` that advances the cursor just past the
next occurrence of a marker (KMP-based, so it stays correct for markers with
internal repetition and may straddle line breaks).

Use it in two places that previously hand-rolled the scan:
- xref recovery skips a stream body by scanning past `endstream` directly,
  replacing the line-oriented read_line/find loop and its eofbit bookkeeping.
- inline-image data skipping now scans for `EI` and only checks the trailing
  boundary afterwards, dropping the bespoke char-by-char state machine.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand enabled auto-merge (squash) June 14, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant