-
Notifications
You must be signed in to change notification settings - Fork 693
Open
Labels
upstream bugbug outside this packagebug outside this package
Description
Description of the bug
Calling doc.rewrite_images() on a PDF where the same image xref is referenced from many pages causes a segmentation fault due to a buffer overflow in MuPDF's underlying C function pdf_rewrite_images.
The PDF attached has ~99 total image references across 39 pages, with a single image xref being reused (shared) on multiple pages. This appears to overflow an internal MuPDF buffer. The crash is deterministic and reproducible.
How to reproduce the bug
import pymupdf
# Open any PDF where the same image xref is shared across many pages
# (e.g., a logo or watermark repeated on every page)
# The test PDF has ~99 image references across 39 pages.
doc = pymupdf.open("shared_image_xref.pdf")
# This segfaults:
doc.rewrite_images(dpi_threshold=150, dpi_target=100, quality=50)
Expected behavior: Images are rewritten/compressed without crashing.
Actual behavior: Segmentation fault (SIGSEGV) / memory corruption.
Workaround
Currently I bypass doc.rewrite_images() entirely and implement image rewriting per-xref using lower-level PyMuPDF APIs. But this is probably not ideal
import sys
import math
import pymupdf
def safe_rewrite_images(doc, dpi_target=None, dpi_threshold=None, quality=None, set_to_gray=False):
"""Workaround for segfault in doc.rewrite_images() with shared image xrefs."""
if not (dpi_target or quality is not None or set_to_gray):
return
# Collect unique image xrefs and their smask info
xref_info = {}
for page in doc:
for img in page.get_images(full=True):
xref, smask = img[0], img[1]
if xref > 0:
xref_info.setdefault(xref, {"smask": smask, "min_dpi": float("inf")})
# Calculate effective DPI for each xref across all page usages
for page in doc:
for info in page.get_image_info(hashes=False, xrefs=True):
xref = info.get("xref", 0)
if xref not in xref_info:
continue
bbox = info.get("bbox")
w, h = info.get("width", 0), info.get("height", 0)
if bbox and w > 0 and h > 0:
disp_w = abs(bbox[2] - bbox[0])
disp_h = abs(bbox[3] - bbox[1])
if disp_w > 0 and disp_h > 0:
dpi = min(w / disp_w * 72, h / disp_h * 72)
if dpi < xref_info[xref]["min_dpi"]:
xref_info[xref]["min_dpi"] = dpi
effective_threshold = max(dpi_threshold or 0, (dpi_target or 0) + 10) if dpi_target else None
# Rewrite each image xref individually
for xref, meta in xref_info.items():
min_dpi = meta["min_dpi"]
smask_xref = meta["smask"]
needs_downscale = bool(
dpi_target and effective_threshold
and min_dpi != float("inf")
and min_dpi > effective_threshold
)
if not needs_downscale and quality is None and not set_to_gray:
continue
try:
pix = pymupdf.Pixmap(doc, xref)
if set_to_gray and pix.colorspace and pix.colorspace.n > 1:
pix = pymupdf.Pixmap(pymupdf.csGRAY, pix)
elif pix.alpha:
pix = pymupdf.Pixmap(pix.colorspace or pymupdf.csRGB, pix)
if needs_downscale:
ratio = min_dpi / dpi_target
shrink_n = max(0, min(7, int(math.log2(ratio))))
if shrink_n > 0:
pix.shrink(shrink_n)
q = quality if quality is not None else 85
jpeg_bytes = pix.tobytes("jpeg", jpg_quality=q)
cs_name = "/DeviceGray" if pix.colorspace and pix.colorspace.n == 1 else "/DeviceRGB"
smask_entry = f"/SMask {smask_xref} 0 R " if smask_xref else ""
new_obj = (
f"<</Type /XObject /Subtype /Image /BitsPerComponent 8"
f" /ColorSpace {cs_name} /Filter /DCTDecode"
f" /Height {pix.height} /Width {pix.width}"
f" {smask_entry}>>"
)
doc.update_object(xref, new_obj)
doc.update_stream(xref, jpeg_bytes, compress=0)
pix = None
except Exception as e:
sys.stderr.write(f"[pymupdf] safe_rewrite_images xref {xref}: {e}\n")
PDF used:
PyMuPDF version
1.27.1
Operating system
MacOS
Python version
3.14
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
upstream bugbug outside this packagebug outside this package