PDF: cross-reference recovery and robustness for real-world files#529
Open
andiwand wants to merge 9 commits into
Open
PDF: cross-reference recovery and robustness for real-world files#529andiwand wants to merge 9 commits into
andiwand wants to merge 9 commits into
Conversation
f0d04b0 to
be15b48
Compare
27a7f29 to
e526e23
Compare
andiwand
commented
Jun 14, 2026
e526e23 to
b92b2d8
Compare
`read_unsigned_integer` now reports a missing number rather than silently returning 0: it delegates to a new `read_unsigned_integer_and_count`, which throws when no digit is present. `peek_unsigned_integer` lets callers test for a leading digit without consuming. `read_integer_or_real` is reworked to parse the sign once for the whole number and to treat both the integer and fractional parts as optional. This fixes negative reals (`-1.5` was `-0.5`) and sign-led fractions (`-.5`), and counts fractional digits directly instead of diffing stream positions (`tellg` returns -1 once the digits run to eof, which mis-scaled the fraction). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the cross-reference table/stream cannot be parsed, scan the whole file for `n g obj` headers to rebuild the xref, recover the trailer, index object streams, and fall back to a catalog scan when the trailer has no /Root. The header scan reuses ObjectParser (via the new `peek_unsigned_integer` / `read_unsigned_integer`) over a zero-copy `util::stream::ViewStreamBuf` so each candidate line is parsed with the same number grammar as the rest of the parser rather than a hand-rolled scanner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The `ri` operator takes a name, not a number; store color_rendering_intent as a string read via as_name() instead of as_real(). Narrow the odr-engine skip list in the html_meta test from all of odr-private down to the two files that still throw (Core_v5.1.pdf, invalid_unicode_issue477 .pdf), so the in-house PDF path is exercised against the rest of the corpus. Note: two newly-exercised files (odr-private/pdf/pdf.pdf "key not found" and with_form.pdf "expected stream") still fail and are left visible for follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
b92b2d8 to
2b62dae
Compare
`ltrim`/`rtrim`/`trim` now return a trimmed copy, with `*_inplace` companions that mutate. Update the one mutating caller (ooxml `read_pct_attribute`) to the in-place form. This gives the PDF parser a value-returning `trim` for keyword comparisons. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Four crashes on valid PDFs, each from an over-strict assumption:
- parse_page assumed every /Annots entry is an indirect reference; inline
annotation dictionaries are equally valid (12.5.2).
- parse_page assumed a /Contents *reference* points straight at a stream, but
it may reference an array of content streams (7.7.3.3). Resolve the reference
first and flatten an array.
- read_indirect_object / read_stream compared the whole line to "endobj" /
"stream", which fails on a trailing space ("endobj \n") or CRLF; compare the
trimmed keyword instead.
- the graphics operator parser tokenized inline image data: after `ID` the raw
bytes were read as operators (corrupting the parse and losing the text font,
then crashing on the font lookup). Skip from `ID` to the `EI` terminator.
With these, Core_v5.1.pdf, invalid_unicode_issue477.pdf, pdf.pdf and
with_form.pdf render, so they are removed from the html_output skip list.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed17ae6279
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Generalize the PDF parser's trim_line into reusable trim_view/ltrim_view/ rtrim_view utilities that return a string_view subrange of the input, so callers needing the leading offset can recover it via the data-pointer difference. A CharPredicate parameter (defaulting to is_ascii_space) lets callers supply their own whitespace set; the owning trims now delegate to the views and accept the same predicate, and the PDF recovery path passes ObjectParser::is_whitespace. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A compact object that inlines its dictionary and the `stream` token on one line (`N G obj<<...>>stream`) was matched and recorded by the forward scan, which then continued straight into the stream data without skipping it. Any object-shaped bytes in that body (e.g. a line starting `1 0 obj`) were then treated as candidate headers and could overwrite a real recovered entry. Detect the `stream` keyword on a word boundary (which also subsumes the bare `stream` line) and skip the body to `endstream` in this case too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a byte-level `skip_past(marker)` that advances the cursor just past the next occurrence of a marker (KMP-based, so it stays correct for markers with internal repetition and may straddle line breaks). Use it in two places that previously hand-rolled the scan: - xref recovery skips a stream body by scanning past `endstream` directly, replacing the line-oriented read_line/find loop and its eofbit bookkeeping. - inline-image data skipping now scans for `EI` and only checks the trailing boundary afterwards, dropping the bespoke char-by-char state machine. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Completes the last open piece of stage 0 (file-format compatibility) for the in-house PDF parser: last-resort cross-reference recovery for broken files, plus a batch of parsing/robustness fixes that let several real-world PDFs render. Built on top of #526 (the AGENTS.md consolidation), so review that first / merge it before this.
1. ObjectParser parsing fixes
read_unsigned_integerreports a missing number instead of silently returning0(viaread_unsigned_integer_and_count, which throws when no digit is present).peek_unsigned_integertests for a leading digit without consuming.read_integer_or_realparses the sign once for the whole number and treats both the integer and fractional parts as optional. Fixes negative reals (-1.5was-0.5) and sign-led fractions (-.5), and counts fractional digits directly instead of diffing stream positions (tellgreturns-1at eof, which mis-scaled the fraction).2. Cross-reference recovery
When the trailer-chain walk throws (missing/garbage
startxref, brokenPrevchain) or the document fails to build (no/Root, wrong offsets), the file is forward-scanned forn g objstarts and a synthetic xref is rebuilt (last definition wins).recover_xref()— forward scan recording object positions, skippingstream … endstreambodies, collectingtrailerdictionaries for/Root,/Encrypt,/ID. The header match reuses ObjectParser over a zero-copyutil::stream::ViewStreamBuf.index_object_streams()— indexes recovered/Type /ObjStmmembers as compressed entries (a direct entry wins).recover_root()— when no trailer supplies/Root, scans recovered objects for/Type /Catalog.m_recoveredguard). Six inline broken-mini-PDF tests, plus the realorder-EK52VKL0.pdfHTTP-response fixture.3. Render more real-world files
/rirendering intent is a name, not a number (color_rendering_intentnow stored viaas_name()). Then four crashes on valid PDFs, each an over-strict assumption:/Annotsentries may be inline annotation dictionaries, not only indirect references (12.5.2)./Contentsreference may point at an array of content streams, not only a single stream (7.7.3.3) — resolve then flatten.endobj/stream/endstreamkeyword matching tolerated no trailing space (endobj \n) or CRLF — compare the trimmed keyword.BI … ID <bytes> EI) was tokenized as operators, corrupting the parse and losing the text font; the bytes fromIDtoEIare now skipped.Supporting util change:
string::trim/ltrim/rtrimreturn a trimmed copy, with*_inplacecompanions that mutate.With these, the html_output odr-engine skip list for PDFs is emptied:
Core_v5.1.pdf,invalid_unicode_issue477.pdf,pdf.pdfandwith_form.pdfall render.Docs
pdf/AGENTS.md: records recovery under What works / Tests / Scope today and the Fail early note; collapses the finished stage 0 to a summary.🤖 Generated with Claude Code