fix(onnx): narrow bounded quantized-weight analysis#1670
fix(onnx): narrow bounded quantized-weight analysis#1670mldangelo-oai wants to merge 66 commits into
Conversation
|
@codex review |
Performance BenchmarksCompared
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e16ccb99ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e16ccb99ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Independent review: promptfoo/modelaudit PR #1670Reviewed at DispositionNO MERGE / REQUEST CHANGES at exact head The branch is current with Exact target and live state
FindingsP1 — Integer operators trust an unrelated name-matched scale, enabling a clean false negative and making normal spec-valid models incompleteLocation:
Reproduction with checker-valid
This is both a false positive/incomplete result for ordinary valid integer operators and a false-negative bypass when an attacker supplies a name-matched zero scale. The new test at Validation: VALIDATED, confidence 100, exact head Required fix: do not infer scales by name. For integer operators, analyze centered integer weights from the operator's actual zero-point input, or trace a scale through real downstream graph edges before using it. Add checker-valid tests for no scale, misleading unused scales, zero/NaN scales, and real ORT graph patterns. P1 — Post-dequantization transforms use a stale quantization axis and silently analyze the wrong valuesLocations: The lineage stores the That ordering is only correct if the quantization axis is remapped by every downstream shape transform. It is not. If the stale axis has a different dimension length, the scan becomes incomplete. If the old and new axes happen to have the same length, the scan silently analyzes incorrect values. Checker-valid adversarial probe:
This can create either false positives or false negatives in the weight-distribution detector. Validation: VALIDATED, confidence 100, exact head Required fix: materialize DQ at its actual lineage position before downstream transforms, or rigorously remap the quantization axis/parameter shape through every supported transform. Add non-square and equal-dimension adversarial tests so shape equality cannot mask wrong-axis behavior. P1 — Per-array bounds allow up to 100 retained float32 copies of one initializer (approximately 10 GiB at defaults)Locations: Each unique Checker-valid resource probe:
The existing group-fanout test at Validation: VALIDATED, confidence 100, exact head Required fix: enforce a cumulative retained-byte budget in P1 — Valid INT4/UINT4 and INT2/UINT2 QDQ weights are skipped clean instead of analyzed or marked incompleteLocations: The new quantized integer allowlist omits Independent ONNX 1.21.0 probes for all four dtypes passed
Validation: VALIDATED, confidence 100, exact head Required fix: support every DQ input type the installed ONNX schema and decoder can safely materialize, including packed INT4/UINT4 and INT2/UINT2, or fail closed with explicit eligible/analyzed coverage metadata. Add benign and malicious tests for each supported packed type. P2 — Aggregate clustering merges non-identical, semantically distinct models and labels them equivalentLocations: The cluster key does not include content hash, tensor digest, actual extreme positions, sign, or another proof of export equivalence. Content hashes are added only as provenance after clustering. Reproduction:
Provenance remains visible, but two distinct causes are collapsed and incorrectly described as equivalent. The added test at Validation: VALIDATED, confidence 98, exact head Required fix: cluster only byte-identical exports by content hash, or add a bounded tensor/value fingerprint that proves semantic equivalence. Keep distinct hashes separate unless equivalence is independently established. P2 — The pinned clean MiniLM export produces 31 new user-visible weight anomaly issues, but the integration test checks coverage onlyLocation: Real pinned artifact:
Head result:
The PR body says sampled flagged tensors match the clean ONNX/safetensors reference and that quantized deltas are consistent with normal quantization, but the added integration test asserts only the 72/72 coverage counts and never asserts the expected anomaly/issue outcome. The feature therefore exposes a large user-visible false-positive/noise surface on its own pinned clean model. Validation: VALIDATED as an operational false-positive/test gap, confidence 90, exact head Required fix: define and assert the expected clean-model issue outcome. If these INFO issues are intentionally retained, document that contract and bound/report them distinctly from suspicious corruption; otherwise calibrate quantized analysis so the clean pinned export is not reported as anomalous. P2 — Required
|
| Thread | Location | Classification | Review result |
|---|---|---|---|
PRRT_kwDOOEde4c6Is2tA |
modelaudit/scanners/onnx_scanner.py:180 |
Accurate, non-blocking cleanup | _QUANTIZED_WEIGHT_TRANSFORM_OPERATORS is unused. This is dead code but not a behavioral finding. |
PRRT_kwDOOEde4c6Is88S |
modelaudit/scanners/onnx_scanner.py:1156 |
Actionable P1 | Confirmed for INT4/UINT4; independent probes also confirm INT2/UINT2. |
PRRT_kwDOOEde4c6Is88U |
modelaudit/scanners/onnx_scanner.py:1355 |
Actionable P1 | Confirmed: an unrelated W_scale=0 suppresses the anomaly and returns success. |
PRRT_kwDOOEde4c6ItD1P |
modelaudit/scanners/onnx_scanner.py:315 |
Actionable P2 policy issue | Confirmed missing [Unreleased] entry. |
No thread is resolved or outdated. There are no approvals. Existing reviews are COMMENTED only.
CI state at exact head
All 29 check runs are terminal: 23 successful and 6 skipped. No check is pending, cancelled, or failed. GitHub reports only CI Success as required, and it passed.
Successful checks:
- Analyze (python)
- Analyze Python
- Analyze GitHub Actions
- Benchmarks (Python 3.11)
- Build and Package
- Build and Test Lightweight Docker Image
- Build and Test TensorFlow Docker Image
- Check Documentation Formatting
- CodeQL
- CI Success
- Dependency Audit
- Detect Changes
- Detect Docker Changes
- Docker CI Success
- Lint and Format
- Quick Feedback (Python 3.12)
- Test Python 3.10
- Test Python 3.13
- Test Vendored TensorFlow Protos
- Type Check
- Validate PR Title
- Verify Vendored Proto Reproducibility
- Windows Tests (Python 3.11)
Skipped checks:
- Build and Test Full Docker Image
- Extras Smoke Test (
matrix.extra) - License Compliance
- Lock File Consistency
- Standalone Pickle Package (
matrix.python-version) - Test NumPy (
matrix.numpy-mode/matrix.python-version)
CI is not sufficient evidence for merge because the reproduced cases are absent from the suite.
Validation evidence
Repository tests and static checks
- Focused new quantized tests: 11 passed, 243 deselected.
- Full ONNX scanner file excluding integration: 253 passed, 1 deselected, 1 warning.
- Full ONNX scanner file without marker exclusion reached 251 passes, then the pinned HF test failed before download because the environment's configured SOCKS proxy lacked
socksio. The exact artifact was subsequently fetched with the approved system transfer path and scanned directly. - Ruff check on all four changed files: passed.
- Ruff format check on all four changed files: passed.
- Mypy on all four changed files: passed.
git diff --check: passed.- Disposable checkout remained clean after validation.
Adversarial, malformed, and resource-bound probes
- Integer-op disconnected scale spoof: reproduced clean false negative.
- Valid integer op without invented scale: reproduced incomplete false positive.
- INT4/UINT4/INT2/UINT2 QDQ: all checker-valid, all skipped clean.
- Post-DQ transpose with equal-sized moved axes: reproduced silently wrong values.
- 100-group QLinearMatMul fan-out: reproduced 100x retained-memory amplification without a coverage gap.
- Non-identical aggregate models: reproduced incorrect equivalence clustering.
- Malformed QDQ scale shape: failed closed as expected.
- External-data QDQ model:
eligible=1,analyzed=0, coverage check present,success=false; failed closed as expected. - Edge-visit limit: failed closed as expected.
- Initializer group 101 limit: failed closed as expected, but does not bound retained bytes for the first 100.
- Self-overwrite malformed lineage: traversal remained bounded in the existing test.
Real pinned Hugging Face QA
The exact MiniLM revision/path named above was downloaded by immutable URL, content-hash verified, and scanned at the reviewed head. Coverage matched the PR claim (72/72/72 with lineages_per_value_limit: 109), while the user-visible anomaly issue count and memory footprint exposed the additional P2 risk documented above.
Full workflow lens summary
- Gatekeeper:
REVIEW; open, non-trivial runtime/security scanner change with explicit verification request. - Policy paths: root
AGENTS.mdapplies to all four changed files; no narrower applicable policy file. - PR summary/history: one commit directly atop current main; no linked issue in the PR body; intent is to replace blanket quantized exclusions with bounded semantic analysis and cluster repeated export findings.
- Diff-only and contextual review: every hunk in
modelaudit/core.py,modelaudit/models.py,modelaudit/scanners/onnx_scanner.py, andtests/scanners/test_onnx_scanner.pyinspected with surrounding call sites and tests. - Code-comment verification: no separate changed-comment contradiction survived validation.
- Policy review x2: missing changelog entry validated; no other unambiguous policy violation.
- Behavior-impact sidecar: valid integer models become incomplete; low-bit models remain cleanly skipped; pinned clean model gains 31 INFO anomaly issues.
- Runtime sidecar: executed focused, malformed, external-data, alias/transform, resource, aggregate, and real pinned HF probes.
- Branch/CI sidecar: current with main; all checks terminal; blocked by review, not branch freshness or CI.
- Repo-surface sidecar: changelog drift validated; no cross-repo contract applies.
Merge recommendation
Do not merge e16ccb99ecee6f7237bf225bd502efbfe451f526.
At minimum, the next head must:
- Remove name-based integer scale inference and use graph-proven lineage.
- Correct quantization-axis semantics through downstream transforms.
- Enforce a cumulative retained-memory bound for quantized copies.
- Support or explicitly fail closed all valid low-bit DQ types, including 2-bit.
- Prevent clustering across non-identical models without equivalence proof.
- Define and test the expected clean pinned-model issue outcome.
- Add the required changelog entry and resolve/reconcile all current threads.
- Re-run focused adversarial QA, the real pinned artifact, and exact-head CI before reconsidering merge.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 10fb037ca9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Independent review of exact head
Validation: the quantized/aggregate ONNX slice had 23 passes; the non-integration ONNX scanner suite had 265 passes and one deselection. Both reproductions were confirmed through CLI/model QA. Exact-head CI is terminal green/skipped and all review threads are resolved, but these behavioral bugs block merge. |
|
Fleet sequencing note: keep the goal active and continue the full review, simplification, conflict resolution, validation, title/body cleanup, and exact-head CI work. Please hold only the final merge until Nightly repair PR #1679 reports MERGED. Immediately afterward, fetch the new |
|
Fleet release: Nightly repair PR #1679 is MERGED as |
Preserve authoritative quantization scale factors and execution precision, fail closed on ambiguous or malformed lineage, and keep anomaly clustering tied to byte-identical exports. Add regression coverage for operand orientation, graph bounds, parameter layouts, aggregation identity, and metadata consistency.
|
Addressed the current-head review blockers in ede50b3 (on top of the additive main merge 91b8739):
Validation: ruff clean (422 files), full mypy clean (477 files), ONNX scanner 341 passed, weight-distribution scanner 130 passed/1 TensorFlow skip, and the pinned MiniLM qint8 regression passed at 72/72/72. Broad-suite stops reproduced unchanged on current main in unrelated static-getattr tests; exact-head CI is running and remains the merge gate. |
|
Final coordination gates for
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ede50b34b3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Keep QuantizeLinear-derived weights eligible for coverage failures and account for copies created by post-dequantization transforms.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 440933e0e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Fleet sequencing update: post-#1679 After the CI repair lands, #1673 remains the first test-only serial gate. Before this PR's eventual merge, fetch and additively merge the newest |
Trace integer scale paths across bounded bias adds and preserve fail-closed lineage for standard DynamicQuantizeLinear weight outputs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 15f3b04d70
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Coordinator review gate: four fresh unresolved quantized-ONNX threads remain, including two P1 fail-open lineage cases. Fix or rebut every thread with regressions, reply/resolve, then rerun exact-head gates. Final merge remains held behind #1680. |
|
Corrected the exact-head shared-sidecar expectation at |
|
Codex Review: Didn't find any major issues. Another round soon, please! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Exact-head CI at |
|
Addressed the three exact-head Codex findings additively at |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7bdc1866c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Exact-head ONNX CI validated the clustered check count but showed anomaly stats intentionally exclude INFO findings; removed only that over-specific new assertion at |
|
Addressed the next exact-head Codex findings additively at |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 89684f97b3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Exact-head ONNX CI showed the new directory budget fixture used a separately scheduled |
|
Addressed the exact-head consolidated-finding Codex followup additively at |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c967806101
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Exact-head ONNX CI showed my new max-total-size fixture scheduled its sidecar independently rather than reaching the guarded skipped-sidecar branch; dropped only that invalid new fixture at |
|
Addressed the next exact-head Codex findings additively at |
|
Codex Review: Didn't find any major issues. 🎉 Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Exact-head ONNX CI exposed the existing refetched-sidecar expectation and the Codex-requested skipped directory sidecar accounting; fixed additively at |
|
Codex Review: Didn't find any major issues. Hooray! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Ready for human approval: exact head |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b61ceca490
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if is_dvc_pointer and len(target_files) == 1: | ||
| single_dvc_directory_content_hash = nested_result.content_hash | ||
| elif nested_result.content_hash and nested_result.content_hash not in file_hashes: | ||
| file_hashes.append(nested_result.content_hash) |
There was a problem hiding this comment.
Preserve flat DVC directory hashes
When a .dvc file resolves to multiple directory outputs, this branch appends each nested directory's already-aggregated content_hash and the parent later aggregates those aggregate hashes. compute_aggregate_hash is not associative, so the same two output directories produce a different top-level hash when scanned via one DVC pointer than when scanned directly or flattened as before, which breaks content-based deduplication for multi-output DVC models. The single-directory shortcut above preserves direct-scan parity only for len(target_files) == 1; the multi-directory path should still flatten the nested file/package hashes instead of hashing nested aggregates.
Useful? React with 👍 / 👎.
| onnx_external_data_bytes_scanned += external_data_size | ||
| top_level_hashed_bytes += external_data_size |
There was a problem hiding this comment.
Avoid double-counting self-referential ONNX sidecars
When an ONNX external_data location resolves back to the primary .onnx file itself, the primary file has already been added to top_level_hashed_bytes before this loop, but these lines add the same file's size again as if it were a separate sidecar. That makes direct scans of self-referential ONNX models report doubled scanned bytes and can incorrectly trip max_total_size for a single physical file whose real size is within the limit; skip the external-data accounting when its identity matches the primary model identity.
Useful? React with 👍 / 👎.
Summary
Preserved security cases
Narrowing
+11,806/-4,208across 9 files at3f8b8796to+3,682/-458across 14 files at27493341.tests/scanners/test_onnx_quantized_analysis.py.mainis included additively via merge commit625b85c9; no history rewrite ormainwrite.uv.lockso CI can stay fail-closed.Validation
ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/: passed.ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/: passed.mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/: passed (478 source files).pytest tests/test_streaming_scan.py::test_scan_model_streaming_fails_closed_after_max_total_size -q: passed after preserving the streamed hash hook.pytest tests/scanners/test_onnx_quantized_analysis.py -q: skipped locally because this disk-conscious checkout does not have optionalonnxinstalled; exact-head CI must exercise ONNX extras.27493341adds focused regressions for duplicate sidecar hashes, clustered failed checks, and max-size-skipped direct sidecar accounting.Merge gate
Do not merge until exact-head CI is green, an exact-head Codex review is clean, unresolved actionable threads remain zero, and a human approval is present.