Skip to content

Long-doc (PageIndex) images are extracted but never surface in the rendered wiki #166

Description

@designcomputer

Problem

For long documents that route through PageIndex (doc_type: pageindex), images are correctly extracted to wiki/sources/images/<doc_name>/ and referenced with correct wiki-relative paths inside wiki/sources/<doc_name>.json (each page object has "images": [{"path": "sources/images/<doc>/pX_imgY.png"}], and the paths are also inlined as ![image](...) in each page's content).

However, tree_renderer.py's render_summary_md() — which builds wiki/summaries/<doc_name>.md, the actual page a user opens in Obsidian — never reads or embeds any of this. Its per-node renderer (_render_nodes_summary) explicitly strips ![]() syntax found in node["text"] (necessary, since PageIndex's own embedded refs point into a private .openkb/files/{doc_id}/images/... cache that doesn't resolve from the wiki), but never re-inserts the correctly-pathed images that live in the page JSON.

Net effect: images are on disk, and technically "referenced" in a JSON data file, but invisible everywhere a human actually browses the vault — not in the summary, not in any concept/entity page, not in index.md. wiki/sources/<doc_name>.json isn't rendered as a wiki page by Obsidian (or anything else), so those references are effectively inert.

Reproduction

  1. openkb add a PDF long enough to trigger PageIndex (pageindex_threshold, default 20 pages), with no PAGEINDEX_API_KEY set (so it falls back to local pymupdf extraction, which does extract images — see images.py:convert_pdf_to_pages).
  2. Open the resulting wiki/summaries/<doc>.md in Obsidian, or grep '!\[' wiki/summaries/*.md wiki/concepts/*.md wiki/entities/*.md wiki/index.md.
  3. Zero image references anywhere, despite wiki/sources/images/<doc>/ containing real extracted files and wiki/sources/<doc>.json referencing them correctly.

Confirmed on openkb 0.4.2 with a 31-page manual (35 images extracted across 21 pages, 0 surfaced in the summary).

Relation to existing issues

Suggested fix

_write_long_doc_artifacts in indexer.py already has the per-page pages list (with images) in scope when it calls render_summary_md — it's just not passed through. render_summary_md/_render_nodes_summary could accept that list, build a page_num -> [image paths] map, and embed each node's page-range images inline (tracking already-emitted paths the same way duplicate summaries are already collapsed, so a page split across many sibling nodes doesn't repeat the same figure at every one of them).

Happy to share a working patch/diff if useful — implemented and verified this locally against a real ingest (35/35 images now appear in the rendered summary, none duplicated across sibling nodes on the same page).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions