Skip to content

feat: auto-load HF ONNX artifacts on CPU#402

Open
aidamian wants to merge 8 commits into
developfrom
onnx-hf-serving
Open

feat: auto-load HF ONNX artifacts on CPU#402
aidamian wants to merge 8 commits into
developfrom
onnx-hf-serving

Conversation

@aidamian
Copy link
Copy Markdown
Contributor

@aidamian aidamian commented May 9, 2026

No description provided.

@aidamian aidamian requested a review from cristibleotiu May 9, 2026 06:22
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7cba8f217

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread extensions/serving/default_inference/nlp/th_hf_model_base.py Outdated
Comment thread extensions/serving/default_inference/nlp/th_hf_model_base.py
cristibleotiu and others added 7 commits May 11, 2026 12:57
What changed:
- Make auto ONNX startup opportunistic and fall back to Transformers/PT on ONNX init or warmup failure.
- Keep explicit ONNX runtimes fail-fast while explicit PT skips manifest lookup.
- Gate decoder and tokenizer remote code on global and runtime trust flags.
- Confine manifest-declared artifact paths to the downloaded HF snapshot and filter broad/framework-weight allow patterns.
- Forward runtime metadata consistently for privacy-filter responses and add focused regression coverage.

Why:
- Preserve seamless CPU ONNX when available without breaking Transformers fallback or weakening remote-code/path safety.
What changed:
- Require selected ONNX runtime config trust_remote_code=True before executing artifact decoder or tokenizer remote code.
- Add regression coverage proving a top-level manifest trust flag cannot enable runtime code execution by itself.

Why:
- Avoid remote-code trust bypasses from broad manifest metadata; the selected runtime must explicitly opt in.
What changed:
- Added subclass ONNX fallback hooks in the HF serving base.
- Added local privacy-filter ONNX discovery and BIOES/Viterbi span decoding.
- Covered fallback runtime selection and privacy-filter decoder behavior with tests.

Why:
- Allow openai/privacy-filter ONNX artifacts to run without a remote artifact manifest or remote Python decoder code.
What changed:
- Keep HF artifact path traversal checks lexical so valid snapshot symlinks into the cache blob store are accepted.
- Merge exact manifest files with recommended ONNX allow patterns after filtering broad or framework-weight downloads.
- Add regression coverage for both behaviors.

Why:
- Live PR image validation showed Sentinel and privacy-filter ONNX startup falling back because valid HF snapshot files were rejected as escaping the snapshot.
What changed:
- Temporarily allow ONNX artifact decoders without runtime-level trust_remote_code to inherit global TRUST_REMOTE_CODE=True.
- Keep explicit runtime trust_remote_code=False as a hard block.
- Add a TODO documenting the security concern and declarative decoder replacement path.

Why:
- The current Sentinel ONNX artifact predates runtime-level trust metadata and uses a reviewed contract decoder, so it needs a compatibility path until the artifact moves to declarative decoding.
What changed:
- Split ONNX remote-code trust between tokenizer/model loading and decoder execution.
- Keep tokenizer/model loading tied to runtime-level trust_remote_code.
- Temporarily allow Python decoder execution when global TRUST_REMOTE_CODE=True, even for legacy runtimes that mark ONNX trust_remote_code=False.

Why:
- Current Sentinel ONNX artifacts use trust_remote_code=False for tokenizer/model loading but still declare a Python contract decoder. This keeps the temporary compatibility path narrow until declarative decoding replaces it.
What changed:
- Prepare HF ONNX artifacts in an edge-node-owned materialized cache before creating ONNX Runtime sessions.
- Hardlink resolved HF cache blobs when possible and copy as fallback.
- Preserve runtime relative layout for .onnx and external data sidecars.
- Add regression coverage for symlinked external data files.

Why:
- ONNX Runtime rejects HF snapshot symlinks for external data because resolved sidecars can escape the model directory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants