fix: harden streamed TAR and NeMo inspection#1665
Conversation
|
@codex review |
Performance BenchmarksCompared
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2cd64e965
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex address these exact-head independent-review blockers before merge:
Please carry one shared bounded work/decompression budget through routing, preflight, extraction, and nested dispatch; bound raw extension headers before |
|
@codex review the current head after the TAR link inventory review fix. Please verify unsafe symlink/hardlink members are recorded as rejected while safe links remain link_validated, and that sparse compatibility handling stays unchanged. |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
PR #1665 independent re-reviewFindings1. High — compressed-TAR limits still run after an unbounded routing traversalThe current head still routes TAR content before Exact-head reproduction at
The pushed delta from Required fix: make TAR routing bounded/header-only or transfer TAR ownership before 2. High — PAX/GNU extension metadata is still materialized before any size checkBoth TAR passes call Exact-head reproduction:
The sparse regression at Required fix: inspect and cumulatively bound raw extension-header bodies before delegating them to 3. High —
|
| Probe | Configured bound | Observed work/result |
|---|---|---|
| Core-routed gzip TAR | 65,536 decompressed bytes | 67,129,344 bytes returned by gzip before failure |
| 8 MiB PAX pathname | 65,536 decompressed bytes | 67,177,758-byte peak allocation before failure |
| Ten 32 KiB members | 65,536 total bytes | 327,680 bytes scanned before aggregate failure |
| Two nested 40 KiB members | 65,536 total bytes | 81,920 bytes scanned before aggregate failure |
| Link status matrix | safe/rejected semantics | all six expected statuses and S406 failures matched |
No additional independent correctness or policy finding survived validation beyond the three unchanged resource-bound blockers above.
Live CI and review state
At the final snapshot:
- GitHub reported
mergeable: MERGEABLE,mergeStateStatus: BLOCKED, andreviewDecision: REVIEW_REQUIRED. - No completed check was failing.
- CodeQL, lint/format, type check, dependency audit, package build, vendored-proto checks, Docker checks, documentation formatting, title validation, and benchmarks had passed.
- Quick Feedback Python 3.12, Windows Python 3.11, Python 3.10, and Python 3.13 were still in progress.
- Review threads: one resolved current-head code-quality thread; one unresolved but outdated link-inventory thread whose behavior is now fixed.
- PR CI still has no end-to-end assertion for pre-routing decompression work, metadata allocation, or cumulative nested TAR budgets.
Merge disposition
REQUEST CHANGES — NOT MERGE READY.
The rejected-link blocker is fixed at the current exact head. The three high-severity resource-bound blockers are unchanged and independently reproduced: compressed TAR routing performs substantial work before the advertised cap, PAX/GNU metadata allocates before limits are checked, and max_total_size does not stop flat or nested archive work. Green focused tests and currently green completed CI jobs do not exercise those boundaries.
Before merge, implement shared end-to-end routing/decompression/metadata/total-work budgets, add instrumented core-level regressions, rerun the focused suite and adversarial probes, and complete exact-head CI.
Summary
Testing
Committed changes on the current branch with commit |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0311885d1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0fb2da481
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ad0802b00
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a9465d57c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6f2ad9c6b2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4300acb118
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e6aafc229
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
…t38-large-tar-streaming-20260610 # Conflicts: # modelaudit/core.py # modelaudit/utils/file/detection.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 36563ef2b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not allow_incomplete_generic_tar_route and find_hdf5_signature_offset(path) is not None: | ||
| return NEMO_ROUTING_INCONCLUSIVE_FORMAT |
There was a problem hiding this comment.
Route proven compressed TARs despite HDF5 bytes
When a renamed gzip/bzip2/xz TAR has a first regular member larger than the 64 KiB NeMo routing skip budget, this branch returns nemo_routing_inconclusive solely because find_hdf5_signature_offset() finds a legal-offset HDF5 superblock anywhere in the physical wrapper. A gzip FEXTRA field can place a plausible HDF5 superblock at offset 512 while the stream is still a valid TAR; with a generic .gz/.bin suffix, core then treats the file as an HDF5/NeMo overlap and never invokes TarScanner, so later malicious TAR members that the streaming scanner would inspect are reduced to an operational routing failure. Distinguish top-level HDF5/userblock ownership from HDF5 bytes inside the compressed wrapper, or keep the already-proven compressed TAR routed to tar for generic suffixes.
AGENTS.md reference: AGENTS.md:L113-L116
Useful? React with 👍 / 👎.
| if entry_count > scanner.max_entries: | ||
| return "inconclusive" |
There was a problem hiding this comment.
Preserve raw TAR user-block scans at entry limits
When an HDF5 user block contains a complete raw TAR prefix, this ownership proof uses the caller's max_tar_entries as a structural cutoff and returns inconclusive. If a root model_config.yaml appears before a later entry that exceeds that configured scan limit, core treats ownership as incomplete and never runs the bounded NeMo/TAR supplemental scan that would have reported the reachable Hydra finding plus the entry-limit failure. Return a scan-limit state here, as the compressed ownership path does, so detections before the limit are preserved.
AGENTS.md reference: AGENTS.md:L113-L116
Useful? React with 👍 / 👎.
| while True: | ||
| chunk = source.read(chunk_size) | ||
| if not chunk: | ||
| break | ||
| if any(chunk): | ||
| raise _CorruptStreamError("Invalid compressed stream: nonzero bytes follow zero padding") | ||
| padding_bytes += len(chunk) |
There was a problem hiding this comment.
Bound accepted compressed user-block padding
For HDF5 user-block supplemental compressed scans, core enables zero-padding acceptance and bounds the source at the HDF5 signature offset, so a tiny compressed payload followed by a large zero-filled user block makes this loop read every padding byte before returning. With large-file limits raised or disabled for real artifacts, a sparse-looking but allocated user block can spend hundreds of MiB or more in padding validation even though those bytes are non-payload; cap accepted padding with a configured limit or fail closed once it is exceeded.
AGENTS.md reference: AGENTS.md:L105-L111
Useful? React with 👍 / 👎.
|
|
||
| assert hdf5_signature_offset is not None | ||
| supplemental_config = dict(config) | ||
| supplemental_config[COMPRESSED_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset |
There was a problem hiding this comment.
Prove compressed ownership before truncating
When the supplemental route is compressed, this unconditionally caps CompressedScanner at the HDF5 signature offset. For a gzip wrapper whose FEXTRA places a plausible HDF5 superblock at offset 512 while the compressed payload continues afterward, the primary HDF5 route owns the file and the supplemental compressed scan only sees the truncated prefix, reports a decode failure, and never scans the payload that an unbounded compressed scanner would have reached. Prove the compressed stream actually ends at the HDF5 boundary before applying this limit, or fail closed without suppressing the compressed-payload scan.
AGENTS.md reference: AGENTS.md:L113-L116
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2b0c21bd3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| while True: | ||
| member = archive.next() | ||
| if member is None: | ||
| return "complete" if scanner._raw_tar_has_complete_end_marker(archive) else "incomplete" |
There was a problem hiding this comment.
Preserve raw TAR user-block scans at tail padding limits
When an HDF5 user block starts with a valid raw TAR containing model_config.yaml and then has zero padding beyond compressed_max_xz_padding_bytes before the HDF5 superblock, this returns incomplete rather than a scan-limit state. Core treats that as unowned and skips the supplemental NeMo/TAR scan, so a reachable Hydra _target_ before the padding is never reported; classify bounded post-EOF padding exhaustion as a scan-limit/fail-closed condition and merge the prefix scan like entry-limit handling.
AGENTS.md reference: AGENTS.md:L113-L115
Useful? React with 👍 / 👎.
| if hdf5_compressed_prefix_ownership == "complete": | ||
| supplemental_config[COMPRESSED_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset | ||
| supplemental_config[ALLOW_ZERO_PADDING_TRAILING_CONFIG_KEY] = True | ||
| supplemental_config[COMPRESSED_PREFIX_OWNERSHIP_CONFIG_KEY] = True | ||
| else: | ||
| supplemental_config["cache_enabled"] = False |
There was a problem hiding this comment.
Preserve compressed user-block payloads at scan limits
When an HDF5 user block contains a compressed payload before excessive zero padding or another compressed scan limit, classify_compressed_prefix_ownership() returns scan_limit, but this branch only applies the HDF5 boundary for complete. The supplemental CompressedScanner then scans the whole .h5 without zero-padding acceptance, fails on the HDF5 body/trailer before dispatching the decompressed payload, and can miss a malicious pickle that was already reachable before the padding; treat scan_limit like the TAR path and run the prefix-bounded compressed scan while preserving the incomplete result.
AGENTS.md reference: AGENTS.md:L113-L115
Useful? React with 👍 / 👎.
| except (EOFError, OSError, tarfile.TarError, ValueError): | ||
| return "inconclusive" |
There was a problem hiding this comment.
Preserve raw TAR user-block scans at metadata limits
When an HDF5 user block contains a raw TAR prefix with an early model_config.yaml and a later PAX/GNU metadata record that exceeds max_tar_metadata_bytes, the bounded TarInfo path raises _TarStreamBudgetExceeded, but this catch collapses it to inconclusive. Core then marks TAR ownership incomplete and skips the supplemental NeMo/TAR scan entirely, so the reachable Hydra _target_ finding before the metadata limit is lost; return a scan-limit state for metadata-budget exhaustion so the bounded prefix scan still runs and reports the incomplete coverage.
AGENTS.md reference: AGENTS.md:L113-L115
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd9361b06f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if hdf5_compressed_prefix_ownership in {"complete", "scan_limit"}: | ||
| supplemental_config[COMPRESSED_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset | ||
| supplemental_config[ALLOW_ZERO_PADDING_TRAILING_CONFIG_KEY] = True | ||
| supplemental_config[COMPRESSED_PREFIX_OWNERSHIP_CONFIG_KEY] = True | ||
| if hdf5_compressed_prefix_ownership == "scan_limit": | ||
| supplemental_config[PRESERVE_LIMITED_PREFIX_PAYLOAD_CONFIG_KEY] = True |
There was a problem hiding this comment.
Preserve compressed user-block findings after corrupt tails
When a validated HDF5 user block begins with a complete compressed payload, such as a malicious pickle, and then has non-zero user-block bytes before the HDF5 superblock, ownership becomes incomplete; because this block only enables the bounded/preserve path for complete or scan_limit, the supplemental CompressedScanner sees the trailer as a decode error and deletes the temp payload before routing it, so the reachable S201 finding is missed. Preserve the decoded prefix and report the ownership failure afterward.
AGENTS.md reference: AGENTS.md:L113-L115
Useful? React with 👍 / 👎.
| elif hdf5_tar_prefix_ownership != "embedded_member": | ||
| supplemental_ownership_inconclusive = True |
There was a problem hiding this comment.
Preserve raw TAR user-block findings after tail failures
When a validated HDF5 user block starts with a raw TAR/NeMo archive that already contains a malicious model_config.yaml or pickle, adding non-zero bytes after the TAR EOF but before the HDF5 superblock makes classify_raw_tar_prefix_ownership() return incomplete; this branch then suppresses the supplemental NeMo/TAR scan, so reachable findings are replaced by only hdf5_tar_prefix_ownership_incomplete. Run the prefix-bounded scan and merge the ownership failure afterward.
AGENTS.md reference: AGENTS.md:L113-L115
Useful? React with 👍 / 👎.
| ownership = classify_compressed_tar_prefix_ownership(path, hdf5_signature_offset) | ||
| if ownership != "incomplete": | ||
| return NEMO_ROUTING_INCONCLUSIVE_FORMAT |
There was a problem hiding this comment.
Keep proven HDF5 user-block TARs routed to TAR
When a valid HDF5 user block contains a complete compressed TAR whose first regular member exceeds the 64 KiB NeMo routing skip budget, this returns nemo_routing_inconclusive even though ownership was proved within the HDF5 boundary. In core that inconclusive format is not selected as a supplemental tar/nemo scanner, so later reachable members such as a malicious model_config.yaml or pickle are skipped and only the routing-incomplete result is reported; return tar (or otherwise carry the bounded TAR route) so the streaming scanner preserves those findings.
AGENTS.md reference: AGENTS.md:L113-L115
Useful? React with 👍 / 👎.
| source_size = self._source_size(path) | ||
| size_check_result = self._check_source_size_limit(path, source_size) | ||
| if size_check_result: | ||
| return size_check_result |
There was a problem hiding this comment.
Preserve bounded compressed prefixes despite source caps
Fresh evidence in the current tree is that scan_limit now enables the preserve flag, but this early size check still returns before _decompress_to_tempfiles() can preserve anything. When an HDF5 user block is large (for example >512 MiB of zero padding before the HDF5 superblock) but contains a small malicious gzip/bzip2/xz/zlib payload at the start, ownership is classified as scan_limit, the supplemental compressed scan hits this source-size result, and the reachable payload findings are never merged; skip this source-size short-circuit for bounded prefix ownership or run the preserved-prefix path before reporting the limit.
AGENTS.md reference: AGENTS.md:L113-L115
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d129e2657
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }, | ||
| ) | ||
| if compression_codec is not None: | ||
| break |
There was a problem hiding this comment.
Continue scanning after oversized compressed TAR members
When a gzip/bzip2/xz TAR contains an entry that trips _TarEntryExtractionIncomplete only because it is over max_entry_size, this break stops before later headers even though the aggregate and compressed-stream budget checks above accepted the member. A crafted archive can put an oversized benign file before a malicious pickle or NeMo config and get only an incomplete entry check, so continue draining/skipping the member and keep scanning until a real stream or budget failure occurs.
AGENTS.md reference: AGENTS.md:L113-L115
Useful? React with 👍 / 👎.
| if hdf5_compressed_prefix_ownership != "complete": | ||
| supplemental_config["cache_enabled"] = False | ||
| _mark_inconclusive_scan_outcome(result, _HDF5_COMPRESSED_PREFIX_OWNERSHIP_INCOMPLETE_REASON) |
There was a problem hiding this comment.
Honor scanner selection before ownership failures
When scanner selection excludes the supplemental user-block scanner (for example --scanners keras_h5 on an HDF5 file whose user block starts with a corrupt gzip payload), this branch still marks the primary Keras result incomplete and unsuccessful before _merge_supplemental_scanner_analysis() has a chance to skip compressed. That makes a scanner the user did not enable affect the exit status; gate these ownership-failure checks on scanner_selection.allows(hdf5_userblock_supplemental_scanner_id) or emit only the selection skip when the supplemental scanner is disabled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
modelaudit/modelaudit/utils/file/detection.py
Line 6853 in 9d858fb
When a compressed TAR contains a non-regular member (for example a directory, symlink, or special entry) with a large declared size, this branch skips the body-budget check below because it only runs for members that materialize file content. tarfile still has to consume that body before the next header in stream mode, so routing can decompress an arbitrary amount of data before handing off to TarScanner; apply the same skip budget to any member with a body, or return tar before continuing.
AGENTS.md reference: AGENTS.md:L111-L111
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if hdf5_tar_prefix_ownership in {"complete", "scan_limit"}: | ||
| supplemental_config = dict(config) | ||
| supplemental_config[TAR_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset | ||
| elif hdf5_tar_prefix_ownership != "embedded_member": |
There was a problem hiding this comment.
Preserve TAR analysis for embedded HDF5 members
When a valid raw TAR has a plausible HDF5 superblock inside a regular member, classify_raw_tar_prefix_ownership() returns embedded_member; this branch then skips both the bounded supplemental TAR scan and the ownership failure path. Since scan_file() has already selected keras_h5 whenever find_hdf5_signature_offset() succeeds, any later TAR members (for example a malicious pickle or NeMo config after the embedded HDF5 member) are never inspected even though the outer TAR structure is trusted and routable. Route this state to TAR ownership, or run an unbounded/appropriately bounded supplemental TAR scan instead of suppressing it.
AGENTS.md reference: AGENTS.md:L115-L115
Useful? React with 👍 / 👎.
| if validated_hdf5_nemo_overlap: | ||
| result.merge(_make_incomplete_nemo_routing_result(path)) |
There was a problem hiding this comment.
Honor scanner selection for NeMo routing overlaps
When HDF5 owns the file but the user has selected only the HDF5 scanner, an inconclusive NeMo/TAR user-block probe still reaches this unconditional merge and adds nemo_routing_incomplete, making the primary Keras result unsuccessful even though the supplemental NeMo/TAR scanner was not enabled. Gate this fail-closed overlap result on the supplemental scanner being allowed, or emit only the scanner-selection skip as the other HDF5 user-block supplemental paths do.
Useful? React with 👍 / 👎.
| ownership = classify_compressed_tar_prefix_ownership(path, hdf5_signature_offset) | ||
| if ownership == "inconclusive": | ||
| return NEMO_ROUTING_INCONCLUSIVE_FORMAT | ||
| return "tar" |
There was a problem hiding this comment.
Preserve linked NeMo configs after compressed skip limits
When a compressed TAR has a large regular member before a root model_config.yaml symlink, this early handoff routes the archive to generic TAR before the link can be observed. Generic TarScanner only records conservative link-semantics incomplete for a root config symlink and does not follow the safe target, so adding a large first member can turn a reachable Hydra _target_ finding in the linked YAML into a non-actionable inconclusive TAR result; keep enough NeMo link evidence to route to NemoScanner or teach the TAR fallback to preserve the linked config finding.
AGENTS.md reference: AGENTS.md:L115-L115
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a2bec28f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| hdf5_userblock_supplemental_scanner_id = ( | ||
| _select_hdf5_userblock_supplemental_scanner_id(path, magic_format, ext, config) | ||
| if scanner_id == "keras_h5" and hdf5_signature_offset not in (None, 0) | ||
| else None |
There was a problem hiding this comment.
Run selected user-block scanners when Keras is skipped
When scanner selection disables keras_h5 but enables the discovered user-block owner, this supplemental scanner can be dropped before it runs. For example, --scanners compressed on an .h5 file whose user block starts with a gzip-compressed pickle sets hdf5_userblock_supplemental_scanner_id here, but the later active-selection fallback returns a keras_h5 skip result before the supplemental merge, so the allowed compressed payload is never inspected. Carry the allowed user-block scanner into that skip path and run it with the same bounded config.
AGENTS.md reference: AGENTS.md:L136-L136
Useful? React with 👍 / 👎.
| if hdf5_tar_prefix_ownership in {"complete", "scan_limit"}: | ||
| supplemental_config = dict(config) | ||
| supplemental_config[TAR_SOURCE_SIZE_LIMIT_CONFIG_KEY] = hdf5_signature_offset |
There was a problem hiding this comment.
Bound primary TAR selection to the HDF5 user block
When --scanners tar/nemo makes the TAR-family scanner the primary result for an HDF5 file with a complete TAR user block, this supplemental-only source limit is never applied because the later merge is skipped once result.scanner_name already matches the user-block scanner. The primary TAR scan then reads into the HDF5 body and reports raw/compressed trailing data for a bounded benign prefix; apply the same HDF5 boundary before running TAR/NeMo as the selected primary scanner.
AGENTS.md reference: AGENTS.md:L136-L136
Useful? React with 👍 / 👎.
Summary
Harden TAR/NeMo archive inspection for large raw and compressed TAR-backed model artifacts without whole-archive materialization. Compressed TARs traverse through bounded
r|readers; raw TARs retain the seekable path. TAR findings, NeMo Hydra coverage, HDF5 owner/supplemental precedence, link safety, incomplete/no-cache outcomes, and finding exit precedence are preserved.Final exact-head repairs
.gz/.bz2/.xzTARs route by structure before wrapper fallback, while file-type validation accepts only a structurally proven TAR with the matching physical codec, avoiding wrong-codec S901 false positives.tmp_path; documented generic TAR handoff compatibility.Scope note
The replacement remains additive and does not rewrite PR history or main. It is not the requested ~1200 net production-line compact rewrite: current exact diff versus
origin/mainis production Python+2296/-464(net+1832). The current behavior is validated, but further simplification is still warranted before merge if the compact-shape requirement is mandatory.Validation
Exact local head:
cfd24355.uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1(before final tiny bare-wrapper validation repair: 19,676 passed, 1,401 skipped)Exact-head GitHub Actions and required human approval are pending; do not merge until both are green.