Skip to content

fix: harden streamed TAR and NeMo inspection#1665

Open
mldangelo-oai wants to merge 62 commits into
mainfrom
mdangelo/codex/hf-fp-t38-large-tar-streaming-20260610
Open

fix: harden streamed TAR and NeMo inspection#1665
mldangelo-oai wants to merge 62 commits into
mainfrom
mdangelo/codex/hf-fp-t38-large-tar-streaming-20260610

Conversation

@mldangelo-oai

@mldangelo-oai mldangelo-oai commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Harden TAR/NeMo archive inspection for large raw and compressed TAR-backed model artifacts without whole-archive materialization. Compressed TARs traverse through bounded r| readers; raw TARs retain the seekable path. TAR findings, NeMo Hydra coverage, HDF5 owner/supplemental precedence, link safety, incomplete/no-cache outcomes, and finding exit precedence are preserved.

Final exact-head repairs

  • Strict gzip/bzip2/xz concatenated-member TAR reading validates compressed stream endings and bounds physical zero padding; gzip uses zlib gzip-member validation so CRC/ISIZE/FHCRC failures and late zlib errors fail closed.
  • Proven non-HDF5 compressed TARs whose bounded NeMo metadata probe exhausts hand off to TarScanner, preserving earlier/reachable root-config findings; HDF5 overlap remains owner-first with TAR supplemental analysis.
  • Bare .gz/.bz2/.xz TARs route by structure before wrapper fallback, while file-type validation accepts only a structurally proven TAR with the matching physical codec, avoiding wrong-codec S901 false positives.
  • Declared compressed wrappers with matching structurally valid codec headers retain compressed ownership across SafeTensors header collisions, so gzip+SafeTensors polyglots fail closed through compressed supplemental scanning.
  • Raw TAR dispatch first proves a valid raw TAR header before considering wrapper magic, and raw TAR EOF rejects bounded nonzero tails while allowing valid sparse zero padding.
  • Removed replay-only NeMo assignments; typed the modified compressed-TAR fixture under tmp_path; documented generic TAR handoff compatibility.

Scope note

The replacement remains additive and does not rewrite PR history or main. It is not the requested ~1200 net production-line compact rewrite: current exact diff versus origin/main is production Python +2296/-464 (net +1832). The current behavior is validated, but further simplification is still warranted before merge if the compact-shape requirement is mandatory.

Validation

Exact local head: cfd24355.

  • uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1 (before final tiny bare-wrapper validation repair: 19,676 passed, 1,401 skipped)
  • Exact final focused TAR scanner: 184 passed, 1 opt-in artifact test skipped.
  • Exact final TAR/NeMo/compressed/filetype affected surface before final tiny bare-wrapper validation repair: 1,928 passed, 12 optional-dependency/opt-in skips.
  • Final gzip+SafeTensors routing repair: exact failing test 1 passed; adjacent core routing 5 passed; adjacent file-detection 7 passed; changed-scope Ruff format/check and mypy clean.

Exact-head GitHub Actions and required human approval are pending; do not merge until both are green.

Copy link
Copy Markdown
Contributor Author

@codex review

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Workflow run and artifacts

Performance Benchmarks

Compared 13 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 0 improved, 13 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 4.415s -> 4.368s (-1.1%).

Workload Benchmark Target Size Files Baseline Current Change Status
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_base64] nested_base64 98 B 1 291.1us 299.3us +2.8% stable
rejected-basic-auth-candidates tests/benchmarks/test_scan_benchmarks.py::test_rejected_basic_auth_candidates_scan_linearly - 371.1 KiB 1 2.577s 2.512s -2.5% stable
padded-multi-stream-upload tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_padded_multi_stream_upload multi_stream_padded 4.1 KiB 1 349.2us 342.5us -1.9% stable
suspicious-pickle-intake tests/benchmarks/test_scan_benchmarks.py::test_scan_suspicious_pickle_intake suspicious-intake 183.8 KiB 4 148.42ms 150.88ms +1.7% stable
clean-training-checkpoint tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_clean_training_checkpoint safe_large 278.2 KiB 1 111.63ms 113.39ms +1.6% stable
warm-cache-rescan tests/benchmarks/test_scan_benchmarks.py::test_scan_warm_cached_repository_rescan release-candidate 547.3 KiB 32 152.76ms 154.78ms +1.3% stable
duplicate-heavy-registry tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_registry_snapshot registry-snapshot 915.2 KiB 13 571.78ms 578.75ms +1.2% stable
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_hex] nested_hex 130 B 1 305.9us 303.2us -0.9% stable
mixed-model-repository tests/benchmarks/test_scan_benchmarks.py::test_scan_release_candidate_repository release-candidate 547.3 KiB 32 632.24ms 636.92ms +0.7% stable
single-checkpoint-preflight tests/benchmarks/test_scan_benchmarks.py::test_scan_single_checkpoint_before_load single_checkpoint.pkl 183.0 KiB 1 103.25ms 103.72ms +0.5% stable
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_raw] nested_raw 78 B 1 272.6us 273.4us +0.3% stable
chunked-upload-stream tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_chunked_upload_stream chunked_stream 278.2 KiB 1 116.16ms 116.37ms +0.2% stable
direct-malicious-upload tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_direct_malicious_upload malicious_reduce 52 B 1 221.3us 220.9us -0.2% stable

Comment thread modelaudit/scanners/tar_scanner.py Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2cd64e965

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/scanners/tar_scanner.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex address these exact-head independent-review blockers before merge:

  • High: core TAR routing traverses and decompresses the stream before the new scanner guard. A 32 KiB gzip TAR with a 32 MiB member returned more than 67 MiB from gzip reads despite a 64 KiB cap; the pinned 1.25 GB artifact was decompressed four full times (5.38 GB).
  • High: PAX/GNU extension headers are materialized inside tar.next() before limits are checked. An 8 KiB gzip TAR with an 8 MiB PAX pathname caused about 67 MiB peak allocation under a 64 KiB limit.
  • High: max_total_size is only diagnosed after all TAR members are extracted and dispatched. Ten 32 KiB members were all scanned under a 64 KiB total budget; nested archives reset the budget.
  • Medium: unsafe/rejected links are still published with scan_status="link_validated".

Please carry one shared bounded work/decompression budget through routing, preflight, extraction, and nested dispatch; bound raw extension headers before tarfile materializes them; and add instrumented end-to-end regressions. Independent report: /Users/mdangelo/modelaudit-pr-swarm/outputs/pr-1665-independent-review-20260611.md.

Copy link
Copy Markdown
Contributor Author

@codex review the current head after the TAR link inventory review fix. Please verify unsafe symlink/hardlink members are recorded as rejected while safe links remain link_validated, and that sparse compatibility handling stays unchanged.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

PR #1665 independent re-review

Findings

1. High — compressed-TAR limits still run after an unbounded routing traversal

The current head still routes TAR content before TarScanner preflight. core.scan_file() calls format detection at modelaudit/core.py:3620-3649; compressed/TAR detection calls _detect_tar_route() at modelaudit/utils/file/detection.py:6076-6108; and _detect_tar_route() opens and iterates the entire archive at modelaudit/utils/file/detection.py:3796-3824. The configured decompression limit is not consulted until TarScanner._preflight_tar_archive() calls tar.next() and then checks tar.offset at modelaudit/scanners/tar_scanner.py:405-456.

Exact-head reproduction at 4c0a5be212f19ceb530d2f6980db069abf18af06:

  • Input: 32,720-byte gzip TAR declaring one 32 MiB regular member.
  • Configuration: compressed_max_decompressed_bytes=65,536, permissive ratio, caching disabled.
  • Result: the scanner eventually returned the expected limit failure, but instrumented gzip._GzipReader.read() returned 67,129,344 decompressed bytes before termination—more than 1,000 times the configured cap and approximately two complete logical traversals.

The pushed delta from a2cd64e9655b68a996960aa6290fd75ca1586109 to the current head changes only link inventory handling, tests, and an explanatory comment; it does not alter routing or preflight. The direct-scanner bomb test at tests/scanners/test_tar_scanner.py:1452-1474 verifies only the terminal result. The core-routing test at tests/scanners/test_tar_scanner.py:2265-2285 likewise does not measure work performed before the failure.

Required fix: make TAR routing bounded/header-only or transfer TAR ownership before _detect_tar_route() traverses the stream. Enforce one work/decompression budget across routing, preflight, extraction, and nested dispatch. Add a core.scan_file() regression that instruments the decompressor and asserts bytes returned never exceed a small bounded allowance around the configured cap.

2. High — PAX/GNU extension metadata is still materialized before any size check

Both TAR passes call tar.next() before inspecting the consumed offset (modelaudit/scanners/tar_scanner.py:412-456 and modelaudit/scanners/tar_scanner.py:628-648). Python tarfile consumes hidden PAX extended headers and GNU longname/longlink records inside tar.next(), including allocating and parsing their bodies, before returning a visible TarInfo. The scanner therefore cannot enforce its decompression or entry-size limits until after metadata allocation has occurred.

Exact-head reproduction:

  • Input: 8,318-byte gzip TAR with an 8 MiB PAX pathname.
  • Configuration: compressed_max_decompressed_bytes=65,536, permissive ratio.
  • Result: the scanner failed with Decompressed size exceeded limit (8396800 > 65536), but only after tracemalloc recorded 67,177,758 bytes of peak allocation.

The sparse regression at tests/scanners/test_tar_scanner.py:1922-1947 synthesizes GNU.sparse.size through normal tarfile writing and validates the post-parse classification. It does not bound allocation or cover adversarial PAX path/linkpath/global headers, malformed PAX lengths, GNU longname/longlink bodies, or real GNU/PAX sparse encodings.

Required fix: inspect and cumulatively bound raw extension-header bodies before delegating them to tarfile, then carry the same metadata/work budget across all passes and nested archives. Add adversarial PAX/GNU tests that assert both fail-closed outcomes and bounded decompressor bytes/peak allocation.

3. High — max_total_size remains an after-the-fact diagnostic and resets across nested archives

TAR extraction enforces only a per-member size at modelaudit/scanners/tar_scanner.py:293-333. Every regular member is extracted and dispatched at modelaudit/scanners/tar_scanner.py:749-818; nested scans receive fresh copies of the unchanged configuration at modelaudit/scanners/tar_scanner.py:768-804. The aggregate limit is checked only after scan_file() returns at modelaudit/core.py:3178-3198.

Exact-head reproductions:

  • Flat archive: ten 32 KiB members under max_total_size=65,536 produced bytes_scanned=327,680; only afterward did core report Total scan size limit exceeded: 327680 bytes (max: 65536).
  • Nested archive: two nested TARs containing one 40 KiB member each under the same total limit produced bytes_scanned=81,920; both children were scanned before the aggregate failure.

This leaves max_total_size unable to bound extraction, scanner dispatch, or nested decompression work. The only nearby aggregate regression, tests/test_core.py:2182-2206, mocks scan_file() for a directory and proves only that core stops between top-level targets. No test exercises cumulative member or nested-archive work.

Required fix: pass a shared remaining-byte/work budget object through the complete archive tree; reject declared members that cannot fit, decrement actual streamed bytes, and stop before dispatching work that crosses the budget. Preserve earlier security findings and mark the stopped archive inconclusive. Add flat and multi-level nested end-to-end tests with read/dispatch counters.

Prior-blocker reconciliation

Fixed — rejected-link inventory state

The current commit 4c0a5be212f19ceb530d2f6980db069abf18af06 moves inventory publication after link validation and selects rejected when validation fails (modelaudit/scanners/tar_scanner.py:667-715). Exact-head runtime coverage verified:

  • safe parent-relative symlink: link_validated
  • safe root-relative hardlink: link_validated
  • escaping symlink: rejected
  • escaping hardlink: rejected
  • empty-target symlink: rejected
  • critical absolute hardlink: rejected

The focused suite also passes the new assertions at tests/scanners/test_tar_scanner.py:976-1023 and tests/scanners/test_tar_scanner.py:1082-1115. The prior Codex thread is now outdated but remains unresolved in GitHub; its underlying behavior is fixed. The current code-quality comment thread is resolved.

Test gap: the committed assertions cover rejected escaping symlinks, empty symlinks/hardlinks, and safe symlinks, but do not directly assert inventory status for safe hardlinks, escaping hardlinks, or critical absolute targets. The independent matrix passed; adding those cases would prevent the fixed state from regressing.

Exact review snapshot

  • PR: fix: harden streamed TAR and NeMo inspection #1665
  • Current reviewed head: 4c0a5be212f19ceb530d2f6980db069abf18af06
  • Previous reviewed head, explicitly not reused as current: a2cd64e9655b68a996960aa6290fd75ca1586109
  • Current base/main SHA: 8d6c4864fe2ea833ceaef1b9803d225afb1e8d69
  • Head commits: a2cd64e9655b68a996960aa6290fd75ca1586109, 4c0a5be212f19ceb530d2f6980db069abf18af06
  • Changed files: CHANGELOG.md, modelaudit/scanners/tar_scanner.py, tests/scanners/test_tar_scanner.py
  • Snapshot refreshed from GitHub at 2026-06-11T04:54:08Z; the head remained unchanged across the review.
  • AGENTS.md scope: repository root only for all changed files; no scoped descendant AGENTS.md applies.

Independent validation

All commands used an extracted source archive of the exact head under /private/tmp; the existing local checkout was not modified.

Focused exact-head suite:

PYTHONDONTWRITEBYTECODE=1 \
PYTHONPATH=/private/tmp/modelaudit-pr1665-4c0 \
PROMPTFOO_DISABLE_TELEMETRY=1 NO_ANALYTICS=1 \
/Users/mdangelo/code/modelaudit/.venv/bin/python -m pytest \
  tests/scanners/test_tar_scanner.py -q -p no:cacheprovider

140 passed, 1 skipped in 32.93s

The skipped test is the opt-in 1.25 GB pinned Hugging Face artifact test at tests/scanners/test_tar_scanner.py:1506.

Additional exact-head checks:

ruff check modelaudit/scanners/tar_scanner.py tests/scanners/test_tar_scanner.py
All checks passed!

ruff format --check modelaudit/scanners/tar_scanner.py tests/scanners/test_tar_scanner.py
2 files already formatted

Custom end-to-end probes:

Probe Configured bound Observed work/result
Core-routed gzip TAR 65,536 decompressed bytes 67,129,344 bytes returned by gzip before failure
8 MiB PAX pathname 65,536 decompressed bytes 67,177,758-byte peak allocation before failure
Ten 32 KiB members 65,536 total bytes 327,680 bytes scanned before aggregate failure
Two nested 40 KiB members 65,536 total bytes 81,920 bytes scanned before aggregate failure
Link status matrix safe/rejected semantics all six expected statuses and S406 failures matched

No additional independent correctness or policy finding survived validation beyond the three unchanged resource-bound blockers above.

Live CI and review state

At the final snapshot:

  • GitHub reported mergeable: MERGEABLE, mergeStateStatus: BLOCKED, and reviewDecision: REVIEW_REQUIRED.
  • No completed check was failing.
  • CodeQL, lint/format, type check, dependency audit, package build, vendored-proto checks, Docker checks, documentation formatting, title validation, and benchmarks had passed.
  • Quick Feedback Python 3.12, Windows Python 3.11, Python 3.10, and Python 3.13 were still in progress.
  • Review threads: one resolved current-head code-quality thread; one unresolved but outdated link-inventory thread whose behavior is now fixed.
  • PR CI still has no end-to-end assertion for pre-routing decompression work, metadata allocation, or cumulative nested TAR budgets.

Merge disposition

REQUEST CHANGES — NOT MERGE READY.

The rejected-link blocker is fixed at the current exact head. The three high-severity resource-bound blockers are unchanged and independently reproduced: compressed TAR routing performs substantial work before the advertised cap, PAX/GNU metadata allocates before limits are checked, and max_total_size does not stop flat or nested archive work. Green focused tests and currently green completed CI jobs do not exercise those boundaries.

Before merge, implement shared end-to-end routing/decompression/metadata/total-work budgets, add instrumented core-level regressions, rerun the focused suite and adversarial probes, and complete exact-head CI.

@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

Testing

  • PROMPTFOO_DISABLE_TELEMETRY=1 NO_ANALYTICS=1 uv run pytest tests/scanners/test_tar_scanner.py tests/test_core.py::test_scan_file_size_limit_precedes_zip_preflight tests/scanners/test_compressed_scanner.py -q
  • PROMPTFOO_DISABLE_TELEMETRY=1 NO_ANALYTICS=1 uv run pytest tests/test_core.py -q
  • PROMPTFOO_DISABLE_TELEMETRY=1 NO_ANALYTICS=1 uv run pytest tests/utils/file/test_filetype.py -q
  • uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • git diff --check
  • ⚠️ PROMPTFOO_DISABLE_TELEMETRY=1 NO_ANALYTICS=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1 (warning: manually stopped after unrelated failures appeared before completion; targeted affected suites above passed)

Committed changes on the current branch with commit 80b5740 and created the PR record via make_pr.

View task →

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

Comment thread modelaudit/scanners/tar_scanner.py Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0311885d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/scanners/tar_scanner.py Outdated
Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

Comment thread modelaudit/scanners/tar_scanner.py Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0fb2da481

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/scanners/tar_scanner.py
Comment thread modelaudit/scanners/tar_scanner.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ad0802b00

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/scanners/tar_scanner.py
Comment thread modelaudit/scanners/tar_scanner.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a9465d57c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/scanners/tar_scanner.py
Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/scanners/tar_scanner.py
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f2ad9c6b2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4300acb118

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/core.py Outdated
@mldangelo-oai mldangelo-oai requested review from a team and agold-oai and removed request for a team June 17, 2026 03:34
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

Comment thread modelaudit/scanners/compressed_scanner.py Fixed
Comment thread modelaudit/scanners/compressed_scanner.py Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e6aafc229

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/scanners/tar_scanner.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

Reviewed commit: 4763b15dfe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

Reviewed commit: 67b7f48ca7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…t38-large-tar-streaming-20260610

# Conflicts:
#	modelaudit/core.py
#	modelaudit/utils/file/detection.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36563ef2b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment on lines +6853 to +6854
if not allow_incomplete_generic_tar_route and find_hdf5_signature_offset(path) is not None:
return NEMO_ROUTING_INCONCLUSIVE_FORMAT

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Route proven compressed TARs despite HDF5 bytes

When a renamed gzip/bzip2/xz TAR has a first regular member larger than the 64 KiB NeMo routing skip budget, this branch returns nemo_routing_inconclusive solely because find_hdf5_signature_offset() finds a legal-offset HDF5 superblock anywhere in the physical wrapper. A gzip FEXTRA field can place a plausible HDF5 superblock at offset 512 while the stream is still a valid TAR; with a generic .gz/.bin suffix, core then treats the file as an HDF5/NeMo overlap and never invokes TarScanner, so later malicious TAR members that the streaming scanner would inspect are reduced to an operational routing failure. Distinguish top-level HDF5/userblock ownership from HDF5 bytes inside the compressed wrapper, or keep the already-proven compressed TAR routed to tar for generic suffixes.

AGENTS.md reference: AGENTS.md:L113-L116

Useful? React with 👍 / 👎.

Comment thread modelaudit/scanners/tar_scanner.py Outdated
Comment on lines +2499 to +2500
if entry_count > scanner.max_entries:
return "inconclusive"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve raw TAR user-block scans at entry limits

When an HDF5 user block contains a complete raw TAR prefix, this ownership proof uses the caller's max_tar_entries as a structural cutoff and returns inconclusive. If a root model_config.yaml appears before a later entry that exceeds that configured scan limit, core treats ownership as incomplete and never runs the bounded NeMo/TAR supplemental scan that would have reported the reachable Hydra finding plus the entry-limit failure. Return a scan-limit state here, as the compressed ownership path does, so detections before the limit are preserved.

AGENTS.md reference: AGENTS.md:L113-L116

Useful? React with 👍 / 👎.

Comment on lines +225 to +231
while True:
chunk = source.read(chunk_size)
if not chunk:
break
if any(chunk):
raise _CorruptStreamError("Invalid compressed stream: nonzero bytes follow zero padding")
padding_bytes += len(chunk)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound accepted compressed user-block padding

For HDF5 user-block supplemental compressed scans, core enables zero-padding acceptance and bounds the source at the HDF5 signature offset, so a tiny compressed payload followed by a large zero-filled user block makes this loop read every padding byte before returning. With large-file limits raised or disabled for real artifacts, a sparse-looking but allocated user block can spend hundreds of MiB or more in padding validation even though those bytes are non-payload; cap accepted padding with a configured limit or fail closed once it is exceeded.

AGENTS.md reference: AGENTS.md:L105-L111

Useful? React with 👍 / 👎.

Comment thread modelaudit/core.py Outdated

assert hdf5_signature_offset is not None
supplemental_config = dict(config)
supplemental_config[COMPRESSED_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prove compressed ownership before truncating

When the supplemental route is compressed, this unconditionally caps CompressedScanner at the HDF5 signature offset. For a gzip wrapper whose FEXTRA places a plausible HDF5 superblock at offset 512 while the compressed payload continues afterward, the primary HDF5 route owns the file and the supplemental compressed scan only sees the truncated prefix, reports a decode failure, and never scans the payload that an unbounded compressed scanner would have reached. Prove the compressed stream actually ends at the HDF5 boundary before applying this limit, or fail closed without suppressing the compressed-payload scan.

AGENTS.md reference: AGENTS.md:L113-L116

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2b0c21bd3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/scanners/tar_scanner.py Outdated
while True:
member = archive.next()
if member is None:
return "complete" if scanner._raw_tar_has_complete_end_marker(archive) else "incomplete"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve raw TAR user-block scans at tail padding limits

When an HDF5 user block starts with a valid raw TAR containing model_config.yaml and then has zero padding beyond compressed_max_xz_padding_bytes before the HDF5 superblock, this returns incomplete rather than a scan-limit state. Core treats that as unowned and skips the supplemental NeMo/TAR scan, so a reachable Hydra _target_ before the padding is never reported; classify bounded post-EOF padding exhaustion as a scan-limit/fail-closed condition and merge the prefix scan like entry-limit handling.

AGENTS.md reference: AGENTS.md:L113-L115

Useful? React with 👍 / 👎.

Comment thread modelaudit/core.py Outdated
Comment on lines +7063 to +7068
if hdf5_compressed_prefix_ownership == "complete":
supplemental_config[COMPRESSED_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset
supplemental_config[ALLOW_ZERO_PADDING_TRAILING_CONFIG_KEY] = True
supplemental_config[COMPRESSED_PREFIX_OWNERSHIP_CONFIG_KEY] = True
else:
supplemental_config["cache_enabled"] = False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve compressed user-block payloads at scan limits

When an HDF5 user block contains a compressed payload before excessive zero padding or another compressed scan limit, classify_compressed_prefix_ownership() returns scan_limit, but this branch only applies the HDF5 boundary for complete. The supplemental CompressedScanner then scans the whole .h5 without zero-padding acceptance, fails on the HDF5 body/trailer before dispatching the decompressed payload, and can miss a malicious pickle that was already reachable before the padding; treat scan_limit like the TAR path and run the prefix-bounded compressed scan while preserving the incomplete result.

AGENTS.md reference: AGENTS.md:L113-L115

Useful? React with 👍 / 👎.

Comment on lines +2507 to +2508
except (EOFError, OSError, tarfile.TarError, ValueError):
return "inconclusive"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve raw TAR user-block scans at metadata limits

When an HDF5 user block contains a raw TAR prefix with an early model_config.yaml and a later PAX/GNU metadata record that exceeds max_tar_metadata_bytes, the bounded TarInfo path raises _TarStreamBudgetExceeded, but this catch collapses it to inconclusive. Core then marks TAR ownership incomplete and skips the supplemental NeMo/TAR scan entirely, so the reachable Hydra _target_ finding before the metadata limit is lost; return a scan-limit state for metadata-budget exhaustion so the bounded prefix scan still runs and reports the incomplete coverage.

AGENTS.md reference: AGENTS.md:L113-L115

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd9361b06f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/core.py Outdated
Comment on lines +7064 to +7069
if hdf5_compressed_prefix_ownership in {"complete", "scan_limit"}:
supplemental_config[COMPRESSED_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset
supplemental_config[ALLOW_ZERO_PADDING_TRAILING_CONFIG_KEY] = True
supplemental_config[COMPRESSED_PREFIX_OWNERSHIP_CONFIG_KEY] = True
if hdf5_compressed_prefix_ownership == "scan_limit":
supplemental_config[PRESERVE_LIMITED_PREFIX_PAYLOAD_CONFIG_KEY] = True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve compressed user-block findings after corrupt tails

When a validated HDF5 user block begins with a complete compressed payload, such as a malicious pickle, and then has non-zero user-block bytes before the HDF5 superblock, ownership becomes incomplete; because this block only enables the bounded/preserve path for complete or scan_limit, the supplemental CompressedScanner sees the trailer as a decode error and deletes the temp payload before routing it, so the reachable S201 finding is missed. Preserve the decoded prefix and report the ownership failure afterward.

AGENTS.md reference: AGENTS.md:L113-L115

Useful? React with 👍 / 👎.

Comment thread modelaudit/core.py Outdated
Comment on lines +7094 to +7095
elif hdf5_tar_prefix_ownership != "embedded_member":
supplemental_ownership_inconclusive = True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve raw TAR user-block findings after tail failures

When a validated HDF5 user block starts with a raw TAR/NeMo archive that already contains a malicious model_config.yaml or pickle, adding non-zero bytes after the TAR EOF but before the HDF5 superblock makes classify_raw_tar_prefix_ownership() return incomplete; this branch then suppresses the supplemental NeMo/TAR scan, so reachable findings are replaced by only hdf5_tar_prefix_ownership_incomplete. Run the prefix-bounded scan and merge the ownership failure afterward.

AGENTS.md reference: AGENTS.md:L113-L115

Useful? React with 👍 / 👎.

Comment on lines +6882 to +6884
ownership = classify_compressed_tar_prefix_ownership(path, hdf5_signature_offset)
if ownership != "incomplete":
return NEMO_ROUTING_INCONCLUSIVE_FORMAT

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep proven HDF5 user-block TARs routed to TAR

When a valid HDF5 user block contains a complete compressed TAR whose first regular member exceeds the 64 KiB NeMo routing skip budget, this returns nemo_routing_inconclusive even though ownership was proved within the HDF5 boundary. In core that inconclusive format is not selected as a supplemental tar/nemo scanner, so later reachable members such as a malicious model_config.yaml or pickle are skipped and only the routing-incomplete result is reported; return tar (or otherwise carry the bounded TAR route) so the streaming scanner preserves those findings.

AGENTS.md reference: AGENTS.md:L113-L115

Useful? React with 👍 / 👎.

Comment on lines +1233 to 1236
source_size = self._source_size(path)
size_check_result = self._check_source_size_limit(path, source_size)
if size_check_result:
return size_check_result

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve bounded compressed prefixes despite source caps

Fresh evidence in the current tree is that scan_limit now enables the preserve flag, but this early size check still returns before _decompress_to_tempfiles() can preserve anything. When an HDF5 user block is large (for example >512 MiB of zero padding before the HDF5 superblock) but contains a small malicious gzip/bzip2/xz/zlib payload at the start, ownership is classified as scan_limit, the supplemental compressed scan hits this source-size result, and the reachable payload findings are never merged; skip this source-size short-circuit for bounded prefix ownership or run the preserved-prefix path before reporting the limit.

AGENTS.md reference: AGENTS.md:L113-L115

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d129e2657

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

},
)
if compression_codec is not None:
break

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Continue scanning after oversized compressed TAR members

When a gzip/bzip2/xz TAR contains an entry that trips _TarEntryExtractionIncomplete only because it is over max_entry_size, this break stops before later headers even though the aggregate and compressed-stream budget checks above accepted the member. A crafted archive can put an oversized benign file before a malicious pickle or NeMo config and get only an incomplete entry check, so continue draining/skipping the member and keep scanning until a real stream or budget failure occurs.

AGENTS.md reference: AGENTS.md:L113-L115

Useful? React with 👍 / 👎.

Comment thread modelaudit/core.py
Comment on lines +7070 to +7072
if hdf5_compressed_prefix_ownership != "complete":
supplemental_config["cache_enabled"] = False
_mark_inconclusive_scan_outcome(result, _HDF5_COMPRESSED_PREFIX_OWNERSHIP_INCOMPLETE_REASON)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor scanner selection before ownership failures

When scanner selection excludes the supplemental user-block scanner (for example --scanners keras_h5 on an HDF5 file whose user block starts with a corrupt gzip payload), this branch still marks the primary Keras result incomplete and unsuccessful before _merge_supplemental_scanner_analysis() has a chance to skip compressed. That makes a scanner the user did not enable affect the exit status; gate these ownership-failure checks on scanner_selection.allows(hdf5_userblock_supplemental_scanner_id) or emit only the selection skip when the supplemental scanner is disabled.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if _tar_member_materializes_file_content(member):

P2 Badge Bound compressed skips for non-file TAR bodies

When a compressed TAR contains a non-regular member (for example a directory, symlink, or special entry) with a large declared size, this branch skips the body-budget check below because it only runs for members that materialize file content. tarfile still has to consume that body before the next header in stream mode, so routing can decompress an arbitrary amount of data before handing off to TarScanner; apply the same skip budget to any member with a body, or return tar before continuing.

AGENTS.md reference: AGENTS.md:L111-L111

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/core.py Outdated
if hdf5_tar_prefix_ownership in {"complete", "scan_limit"}:
supplemental_config = dict(config)
supplemental_config[TAR_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset
elif hdf5_tar_prefix_ownership != "embedded_member":

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve TAR analysis for embedded HDF5 members

When a valid raw TAR has a plausible HDF5 superblock inside a regular member, classify_raw_tar_prefix_ownership() returns embedded_member; this branch then skips both the bounded supplemental TAR scan and the ownership failure path. Since scan_file() has already selected keras_h5 whenever find_hdf5_signature_offset() succeeds, any later TAR members (for example a malicious pickle or NeMo config after the embedded HDF5 member) are never inspected even though the outer TAR structure is trusted and routable. Route this state to TAR ownership, or run an unbounded/appropriately bounded supplemental TAR scan instead of suppressing it.

AGENTS.md reference: AGENTS.md:L115-L115

Useful? React with 👍 / 👎.

Comment thread modelaudit/core.py Outdated
Comment on lines +6942 to +6943
if validated_hdf5_nemo_overlap:
result.merge(_make_incomplete_nemo_routing_result(path))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor scanner selection for NeMo routing overlaps

When HDF5 owns the file but the user has selected only the HDF5 scanner, an inconclusive NeMo/TAR user-block probe still reaches this unconditional merge and adds nemo_routing_incomplete, making the primary Keras result unsuccessful even though the supplemental NeMo/TAR scanner was not enabled. Gate this fail-closed overlap result on the supplemental scanner being allowed, or emit only the scanner-selection skip as the other HDF5 user-block supplemental paths do.

Useful? React with 👍 / 👎.

ownership = classify_compressed_tar_prefix_ownership(path, hdf5_signature_offset)
if ownership == "inconclusive":
return NEMO_ROUTING_INCONCLUSIVE_FORMAT
return "tar"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve linked NeMo configs after compressed skip limits

When a compressed TAR has a large regular member before a root model_config.yaml symlink, this early handoff routes the archive to generic TAR before the link can be observed. Generic TarScanner only records conservative link-semantics incomplete for a root config symlink and does not follow the safe target, so adding a large first member can turn a reachable Hydra _target_ finding in the linked YAML into a non-actionable inconclusive TAR result; keep enough NeMo link evidence to route to NemoScanner or teach the TAR fallback to preserve the linked config finding.

AGENTS.md reference: AGENTS.md:L115-L115

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a2bec28f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/core.py
Comment on lines 6624 to 6627
hdf5_userblock_supplemental_scanner_id = (
_select_hdf5_userblock_supplemental_scanner_id(path, magic_format, ext, config)
if scanner_id == "keras_h5" and hdf5_signature_offset not in (None, 0)
else None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run selected user-block scanners when Keras is skipped

When scanner selection disables keras_h5 but enables the discovered user-block owner, this supplemental scanner can be dropped before it runs. For example, --scanners compressed on an .h5 file whose user block starts with a gzip-compressed pickle sets hdf5_userblock_supplemental_scanner_id here, but the later active-selection fallback returns a keras_h5 skip result before the supplemental merge, so the allowed compressed payload is never inspected. Carry the allowed user-block scanner into that skip path and run it with the same bounded config.

AGENTS.md reference: AGENTS.md:L136-L136

Useful? React with 👍 / 👎.

Comment thread modelaudit/core.py Outdated
Comment on lines +7102 to +7104
if hdf5_tar_prefix_ownership in {"complete", "scan_limit"}:
supplemental_config = dict(config)
supplemental_config[TAR_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound primary TAR selection to the HDF5 user block

When --scanners tar/nemo makes the TAR-family scanner the primary result for an HDF5 file with a complete TAR user block, this supplemental-only source limit is never applied because the later merge is skipped once result.scanner_name already matches the user-block scanner. The primary TAR scan then reads into the HDF5 body and reports raw/compressed trailing data for a bounded benign prefix; apply the same HDF5 boundary before running TAR/NeMo as the selected primary scanner.

AGENTS.md reference: AGENTS.md:L136-L136

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant