FEAT Add safe_extract_zip helper for remote dataset loaders by francose · Pull Request #1957 · microsoft/PyRIT

francose · 2026-06-08T21:25:03Z

Closes #1879.

so the gap is that the 3 remote dataset loaders (figstep, vlguard, jailbreakv-28k) all call zipfile.ZipFile.extractall() straight on whatever upstream zip they downloaded. extractall doesn't check member paths or sizes or entry types, so if any of those upstream zips ever get tampered with you get classic Zip Slip — single entry named ../../etc/cron.d/x and you've landed a cron job on whoever ran the loader. Symlink entries do the same thing without even needing ...

Adds a new helper pyrit/common/safe_extract.py and swaps the 3 call sites. The helper validates every member's metadata BEFORE writing anything to disk — if any member fails a check, nothing gets written. matters because otherwise a malicious zip with 99 valid entries + 1 bad one would leak 99 files before raising.

6 checks, each kills a different attack class:

realpath stays under destination → Zip Slip (../etc/x)
reject symlink / block / char / fifo / socket entries → symlink path escape
per-file size cap (default 1 GiB) → single-entry decompression DoS
total uncompressed size cap (default 5 GiB) → multi-file zip bomb
compression ratio cap (default 100×) → ratio bomb
file count cap (default 50,000) → inode / hash-table DoS

all caps are kwargs so individual callers can tighten them if they want.

note remote_dataset_loader.py also has a zipfile.ZipFile() block but it uses zf.open(inner) to stream a known-name file into memory, never writes by member name. different code path, not a Zip Slip site, left it alone.

16 tests in tests/unit/common/test_safe_extract.py covering happy path, dotdot traversal, absolute paths (unix and drive-letter), symlink / block / fifo entries, per-file bomb, total-size bomb, compression ratio bomb, file-count cap, no-partial-write guarantee, both bytes and path sources, dest auto-create, resolved-path return. all green locally.

on scope — Ruslan suggested in the thread that we drop Py 3.10/3.11 and switch to tar (PEP 706 has the data filter built in), which honestly would be cleaner if we could, but Roman flagged 3.11 has plenty of runway left and the upstream datasets aren't ours to convert anyway. app-level defensive extraction is the only fix that covers all supported runtimes today, which is what this does.

romanlutz

Fantastic! Thanks @francose!

romanlutz · 2026-06-10T03:16:42Z

            logger.info(f"Extracting {zip_file_path} to {self.zip_dir}")
-            with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
-                zip_ref.extractall(self.zip_dir)
+            safe_extract_zip(zip_file_path, self.zip_dir)


safe_extract_zip is sync and does real disk I/O (potentially extracting a multi-GB image bundle), but it's being called directly inside async def fetch_dataset_async, which blocks the event loop for the duration of the extraction. The figstep loader added in this PR correctly wraps the call in await asyncio.to_thread(...); this call site (and the matching one in vlguard_dataset.py) should do the same for consistency and to follow the "no blocking I/O in async paths" rule in style-guide.instructions.md (which explicitly calls out zipfile). The pre-existing extractall had the same problem, but since the PR is rewriting these exact lines it's the natural place to fix it.

romanlutz · 2026-06-10T03:16:43Z

            logger.info("Extracting VLGuard test images...")
-            with zipfile.ZipFile(str(zip_path), "r") as zf:
-                zf.extractall(str(cache_dir))
+            safe_extract_zip(zip_path, cache_dir)


Same event-loop concern as the jailbreakv loader: safe_extract_zip is a blocking call but is invoked directly inside _download_dataset_files_async. The download just above is already wrapped via await asyncio.to_thread(_download_sync), so the extraction should follow the same pattern (matching figstep). Suggested fix:

if zip_path.exists(): logger.info("Extracting VLGuard test images...") await asyncio.to_thread(safe_extract_zip, zip_path, cache_dir)

(or via a small _extract closure if you switch the helper to keyword-only as suggested on safe_extract.py).

romanlutz · 2026-06-10T03:16:43Z

+def safe_extract_zip(
+    source: ZipSource,
+    dest_dir: str | os.PathLike,
+    *,
+    max_total_size: int = DEFAULT_MAX_TOTAL_SIZE,
+    max_file_size: int = DEFAULT_MAX_FILE_SIZE,
+    max_file_count: int = DEFAULT_MAX_FILE_COUNT,
+    max_compression_ratio: int = DEFAULT_MAX_COMPRESSION_RATIO,
+) -> Path:


Per the style guide, functions with more than one parameter should use * after the first arg (or after self/cls) to enforce keyword-only call sites — every other multi-arg helper in pyrit/common/ (get_random_indices, warn_if_set, get_kwarg_param, get_required_value, get_non_required_value) follows this. source and dest_dir are both passed positionally at the three call sites today, which is exactly the readability/typo risk the convention is meant to prevent (safe_extract_zip(zip_file_path, self.zip_dir) reads ambiguously). Suggest:

def safe_extract_zip( *, source: ZipSource, dest_dir: str | os.PathLike, max_total_size: int = DEFAULT_MAX_TOTAL_SIZE, ...

and updating the three callers accordingly.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a defensive ZIP extraction utility and migrates remote dataset loaders to use it, mitigating Zip Slip and zip bomb risks from untrusted archives.

Changes:

Introduce pyrit.common.safe_extract.safe_extract_zip with validation for paths, entry types, size caps, and compression ratio.
Replace zipfile.ZipFile(...).extractall(...) usage in multiple remote dataset loaders with safe_extract_zip.
Add unit tests covering traversal, absolute paths, symlinks/devices/FIFOs, size/count limits, compression ratio bombs, and malformed headers.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/unit/common/test_safe_extract.py	Adds unit tests validating safe ZIP extraction behavior and failure modes.
pyrit/common/safe_extract.py	Implements safe ZIP extraction with member validation and caps.
pyrit/datasets/seed_datasets/remote/vlguard_dataset.py	Switches VLGuard ZIP extraction to `safe_extract_zip`.
pyrit/datasets/seed_datasets/remote/jailbreakv_28k_dataset.py	Switches JailBreakV-28K ZIP extraction to `safe_extract_zip`.
pyrit/datasets/seed_datasets/remote/figstep_dataset.py	Switches FigStep ZIP extraction to `safe_extract_zip` in async download flow.

francose · 2026-06-11T19:32:31Z

+    dest_real = Path(dest_dir).resolve()
+    dest_real.mkdir(parents=True, exist_ok=True)
+
+    with zipfile.ZipFile(source) as zf:
+        members = zf.infolist()
+        try:
+            _validate_members(
+                members,
+                dest_real=dest_real,
+                max_total_size=max_total_size,
+                max_file_size=max_file_size,
+                max_file_count=max_file_count,
+                max_compression_ratio=max_compression_ratio,
+            )
+        except UnsafeArchiveError as exc:
+            logger.warning("safe_extract_zip rejected archive: %s", exc)
+            raise
+        for m in members:
+            zf.extract(m, dest_real)


both threats are real but want to talk through whether temp-dir-then-rename is the right shape here.

TOCTOU angle: dest_dir is always under DB_DATA_PATH (the user's data dir) for all 3 callers. anyone who can plant a symlink inside dest_real between validate and extract already has write access to the user's cache dir, which means local code exec is already on the table. temp-dir doesn't actually shrink the threat model since the writable parent dir is the real exposure.

mid-extract failure (disk full / perms) is a separate, legit concern. but the rewrite has costs worth flagging:

doubles peak disk usage during extract (matters for the multi-GB image datasets)

atomic rename only holds within a single fs, falls back to copy-then-delete cross-fs which races again

breaks the vlguard pattern where cache_dir already contains an hf-downloaded json sitting next to test.zip, a clean replace would clobber it

suggest splitting:

fix the docstring now ("no archive members are written" is honest; "left empty" overpromises)

defer the temp-dir rewrite unless we hit a case where partial extract actually bites

happy to do it if you'd rather close the gap proactively, just flagging the trade-offs.

+        for m in members:
+            zf.extract(m, dest_real)


+``safe_extract_zip`` validates every archive member before writing anything to
+disk. If any member fails validation, the destination directory is left empty.


+    if stat.S_ISLNK(mode) or stat.S_ISBLK(mode) or stat.S_ISCHR(mode) or stat.S_ISFIFO(mode) or stat.S_ISSOCK(mode):
+        raise UnsafeArchiveError(f"disallowed entry type: {m.filename}")


+    idx = raw.rfind(b"PK\x01\x02")
+    struct.pack_into("<I", raw, idx + 20, 0)
+    patched = io.BytesIO(bytes(raw))


Signed-off-by: francose <13445813+francose@users.noreply.github.com>

romanlutz reviewed Jun 10, 2026

View reviewed changes

romanlutz self-assigned this Jun 10, 2026

francose force-pushed the feat/safe-zip-extract branch from bac6c33 to 1f1e792 Compare June 10, 2026 12:37

francose marked this pull request as ready for review June 10, 2026 13:12

Copilot AI review requested due to automatic review settings June 10, 2026 13:12

Copilot AI reviewed Jun 10, 2026

View reviewed changes

FEAT Add safe_extract_zip helper for defensive remote ZIP extraction

72f553a

Signed-off-by: francose <13445813+francose@users.noreply.github.com>

francose force-pushed the feat/safe-zip-extract branch from 1f1e792 to 72f553a Compare June 11, 2026 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT Add safe_extract_zip helper for remote dataset loaders#1957

FEAT Add safe_extract_zip helper for remote dataset loaders#1957
francose wants to merge 1 commit into
microsoft:mainfrom
francose:feat/safe-zip-extract

francose commented Jun 8, 2026

Uh oh!

romanlutz left a comment

Uh oh!

romanlutz Jun 10, 2026

Uh oh!

romanlutz Jun 10, 2026

Uh oh!

romanlutz Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

francose Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		``safe_extract_zip`` validates every archive member before writing anything to
		disk. If any member fails validation, the destination directory is left empty.

		if stat.S_ISLNK(mode) or stat.S_ISBLK(mode) or stat.S_ISCHR(mode) or stat.S_ISFIFO(mode) or stat.S_ISSOCK(mode):
		raise UnsafeArchiveError(f"disallowed entry type: {m.filename}")

Conversation

francose commented Jun 8, 2026

Uh oh!

romanlutz left a comment

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

francose Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants