Sync with Microsoft ONNX Runtime - 30052026 by ai-fw-intg · Pull Request #1113 · intel/onnxruntime

ai-fw-intg · 2026-05-29T20:34:04Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

…ove shader key validation to nightly build (microsoft#28674) ### Description  Allow shader code to be dumped to the file specified in the `ORT_WEBGPU_EP_SHADER_DUMP_FILE` environment variable. Previously, shader code was only dumped by verbose logging. Create new nightly CI pipeline to run shader key validation test. That test is removed from the CI pipeline in microsoft#28642. ### Motivation and Context  More shader dump output options. Moving shader key validation test.

…u_inc/cub.cuh" wrapper. (microsoft#28705) ### Description  Replace direct inclusion of `<cub/cub.cuh>` with `"core/providers/cuda/cu_inc/cub.cuh"` wrapper. The wrapper accounts for a problematic macro definition which causes issues. ### Motivation and Context  Fix pipeline build error.

…mic quantization (microsoft#28228) ## Summary - Fix `quantize_dynamic(per_channel=True)` so weights quantized per-channel produce a `DequantizeLinear` node with the correct `axis` attribute. - Stop dropping the channel axis when `quantize_weight_per_channel` populates `QuantizedValue` (was hardcoded to `None`). - Gate the scalar-scale assertion in `_dequantize_value` on `axis is None` so per-channel scales (1-D tensors) are accepted. ## Motivation Fixes microsoft#19997. When a model is quantized with `quantize_dynamic(..., per_channel=True)` and a per-channel weight reaches `_dequantize_value` (e.g. via `_dequantize_outputs` when the weight is in the graph outputs), two bugs surface: 1. `quantize_weight_per_channel` stores `QuantizedValue.axis = None` even though it received a real `channel_axis`, so the per-channel information is lost. 2. `_dequantize_value` (a) asserts `scale_init.size == 1`, which fails for a 1-D per-channel scale, and (b) builds the `DequantizeLinear` node without an `axis` attribute, producing an invalid ONNX node when the model is consumed. PR microsoft#22283 (Nov 2024) softened the assertion against `None`-typed scales but left the underlying axis-propagation bug in place. ## Changes - `onnxruntime/python/tools/quantization/onnx_quantizer.py` - `quantize_weight_per_channel`: pass `channel_axis` (was `None`) into `QuantizedValue`. - `_dequantize_value`: only require a scalar scale on the per-tensor path (`axis is None`); forward `axis=quantized_value.axis` to `onnx.helper.make_node("DequantizeLinear", ...)`. `make_node` silently omits the attribute when `axis` is `None`, so the per-tensor path is unchanged. - `onnxruntime/test/python/quantization/test_quant_issues.py` - New regression test `test_dynamic_quantize_per_channel_emits_axis_attribute` that builds a minimal MatMul model with the weight routed to a graph output (to force the `_dequantize_outputs` -> `_dequantize_value` path), runs `quantize_dynamic(per_channel=True)`, and asserts the emitted `DequantizeLinear` has the `axis` attribute and a 1-D multi-element scale initializer. ## Test Plan - `python -m pytest onnxruntime/test/python/quantization/test_quant_issues.py -xvs` — new test passes; existing test skipped as before. - `python -m pytest onnxruntime/test/python/quantization/test_op_matmul.py` — 7 passed, 8 skipped (no regression). - `python -m pytest onnxruntime/test/python/quantization/test_qdq.py -k per_channel` — 1 passed. - `lintrunner -a` on changed files: clean.

…Def (microsoft#28608) ## Summary `utils::MakeComputeCapability` is the shared helper used by `utils::CreateSupportedPartitions` to build an `IndexedSubGraph::MetaDef` from a group of supported nodes. When a supported group contains a control-flow op (`Loop`, `If`, `Scan`), `MakeComputeCapability` currently walks only `node->InputDefs()` and silently drops the outer-scope captures (`node->ImplicitInputDefs()`). The captures never enter `meta_def->inputs`, so after `Graph::FinalizeFuseSubGraph` the fused node's `InputDefs()` is missing them — the EP that owns the fused subgraph has no boundary value-info for the captured tensors and cannot bind them at Compute time. This PR adds a second loop in `MakeComputeCapability` that walks `node->ImplicitInputDefs()` with the same "produced inside the partition → skip, otherwise add to subgraph inputs" semantics already applied to `InputDefs()`. ## Why this is the right fix `onnxruntime::Node` partitions inputs into two arrays by design: - `InputDefs()` — formal operand list as declared in the op's ONNX schema. - `ImplicitInputDefs()` — outer-scope SSA values referenced from inside body subgraphs of `Loop` / `If` / `Scan`. These are real boundary inputs at runtime (the body kernel reads them) but they don't appear in the op's formal operand list. `Graph::FinalizeFuseSubGraph` consumes only `meta_def->inputs` to populate the fused node's `InputDefs()` and rewire outer-scope edges. So whatever `MakeComputeCapability` puts in `meta_def->inputs` is what ends up at the fused-node boundary. Omitting `ImplicitInputDefs()` here means the captures are unreachable downstream — there is no other place that can patch them back in. The fix is intentionally a mirror of the existing `InputDefs()` loop (same `Contains(node_outputs, ...)` produced-inside check, same `ordered_subgraph_inputs.push_back` ordering). The new loop runs after the explicit loop so explicit-operand index ordering in `meta_def->inputs` is preserved (EPs that have implicitly relied on `meta_def->inputs[i].name == node.InputDefs()[i].name` for non-control-flow op groups are not perturbed). ## Scope of impact Only EPs that consume `utils::MakeComputeCapability` / `utils::CreateSupportedPartitions` are affected. A quick audit: | EP | Uses `partitioning_utils::MakeComputeCapability`? | Affected by bug? | |---|---|---| | Plugin EPs (`EpGraphSupportInfo_AddNodesToFuse` → `PluginExecutionProvider::GetCapability`) | yes, in `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc` | **yes** | | `internal_testing_ep` (used by ORT's own unit tests) | yes, in `onnxruntime/test/internal_testing_ep/internal_testing_execution_provider.cc` | **yes** | | TensorRT, MIGraphX, NV-TRT-RTX, VitisAI | no — they build `MetaDef::inputs` themselves and already walk `ImplicitInputDefs()` (e.g. `tensorrt_execution_provider.cc:2084`, `migraphx_execution_provider.cc:735`) | no | | DML / CPU / CUDA / ROCm / OpenVINO / QNN / CANN / WebGPU / CoreML | don't use it for Loop/If/Scan fusion paths | no | So the impact is bounded to the plugin EP architecture (ORT 1.23+) and the in-tree testing EP — both of which delegate boundary calculation to this shared helper. ## Reproduction The bug is reproducible against this repo's `internal_testing_ep`. No external code required. A minimal repro model with a Loop body that captures an outer-scope tensor `B`: ```python # build_repro.py — produces a ~1.5 KB onnx import numpy as np, onnx from onnx import TensorProto, helper as h, numpy_helper as nph A = h.make_tensor_value_info("A", TensorProto.FLOAT, ["N", 2, 2]) B = h.make_tensor_value_info("B", TensorProto.FLOAT, [2, 2]) out = h.make_tensor_value_info("v_final", TensorProto.FLOAT, [2, 2]) acc_init = nph.from_array(np.zeros((2, 2), np.float32), name="acc_init") cond_init = nph.from_array(np.array([1], np.bool_), name="cond_init") sq_ax = nph.from_array(np.array([0], np.int64), name="sq_ax") body = h.make_graph( nodes=[ h.make_node("Gather", ["A", "iter"], ["slice"], axis=0), h.make_node("Add", ["slice", "B"], ["tmp"]), # captures outer B h.make_node("Add", ["acc_in", "tmp"], ["acc_out"]), h.make_node("Identity", ["cond_in"], ["cond_out"]), ], name="loop_body", inputs=[h.make_tensor_value_info("iter", TensorProto.INT64, []), h.make_tensor_value_info("cond_in", TensorProto.BOOL, []), h.make_tensor_value_info("acc_in", TensorProto.FLOAT, [2, 2])], outputs=[h.make_tensor_value_info("cond_out", TensorProto.BOOL, []), h.make_tensor_value_info("acc_out", TensorProto.FLOAT, [2, 2])], ) g = h.make_graph( nodes=[ h.make_node("Shape", ["A"], ["M_1d"], start=0, end=1), h.make_node("Squeeze", ["M_1d", "sq_ax"], ["M"]), h.make_node("Loop", ["M", "cond_init", "acc_init"], ["v_final"], body=body), ], name="loop_with_outer_capture", inputs=[A, B], outputs=[out], initializer=[acc_init, cond_init, sq_ax], ) onnx.save(h.make_model(g, opset_imports=[h.make_opsetid("", 16)]), "loop_with_outer_capture.onnx") ``` Observable bug path (against any EP using `CreateSupportedPartitions`, e.g. `InternalTestingExecutionProvider`): ```cpp // Claim every node (Shape/Squeeze/Constant/Loop) as compiled. SessionOptions so; InferenceSession session(so, env); session.RegisterExecutionProvider( std::make_unique<InternalTestingExecutionProvider>(/*supported=*/{...})); session.Load("loop_with_outer_capture.onnx"); session.Initialize(); // In EP::Compile, iterate fused_node.InputDefs(): // for (const auto* in : fused_node.InputDefs()) std::cerr << in->Name() << "\n"; // BEFORE this fix: only "A" is printed (Shape(A) makes A explicit; // B is consumed only via Loop's ImplicitInputDefs and gets dropped). // AFTER this fix: both "A" and "B" are printed. ``` A small unit-test fixture exercising the same path can be added to `onnxruntime/test/providers/partitioning_utils_test.cc` following the existing `CheckAllNodesProcessed` pattern, asserting that `result[0]->sub_graph->GetMetaDef()->inputs` contains `B` when the supported group includes the Loop. ## What this PR changes A single hunk in `onnxruntime/core/providers/partitioning_utils.cc::MakeComputeCapability`, immediately after the existing `for (const auto* input : node->InputDefs()) { ... }`: ```cpp // Region-bearing ops (Loop/If/Scan) reference outer-scope SSA values via // ImplicitInputDefs rather than InputDefs. When an EP claims the whole // control-flow op, those implicit captures must also be in MetaDef::inputs // so FinalizeFuseSubGraph can rewire the outer-scope edges onto the fused // node's InputDefs. Without this, plugin EPs that fuse Loop/If/Scan lose // the captures at the fused-node boundary and cannot resolve them at // Compute time. for (const auto* input : node->ImplicitInputDefs()) { if (!input->Exists()) { continue; } if (!Contains(node_outputs, input)) { if (!Contains(subgraph_inputs, input)) { subgraph_inputs.insert(input); ordered_subgraph_inputs.push_back(input); } } } ``` ## Risks / migration - **No ABI change.** `MakeComputeCapability` signature unchanged. `IndexedSubGraph::MetaDef` schema unchanged. - **No semantic regression for op groups without control flow.** The new loop only adds elements; for partitions that contain no `Loop` / `If` / `Scan`, `ImplicitInputDefs()` is empty on every node and the new loop is a no-op. - **Behavior change for plugin EPs that fuse Loop/If/Scan.** Their fused node's `InputDefs()` gains the captures. EPs that were silently fishing out captures via a workaround (e.g. walking the original Loop node's `ImplicitInputDefs()` themselves at Compile time) would see those names show up via the standard fused-node `InputDefs()` API. Audit above shows no in-tree EP that uses `partitioning_utils` had such a workaround — TRT / MIGraphX / etc. roll their own MetaDef without calling `MakeComputeCapability`. ## Validation - Verified the fix end-to-end against a downstream plugin EP that claims a `Loop` node as part of a fused partition (Loop body captures an outer-scope tensor): without this fix, the EP cannot resolve the captured tensor name at the fused-node boundary; with the fix the captured tensor appears in `fused_node.InputDefs()` and session initialization + the EP's Compile both succeed. - No `partitioning_utils.cc` changes between `origin/main` and the patch base, so it applies cleanly. - Existing `onnxruntime_test_all --gtest_filter=PartitioningUtilsTest.*` cases still pass (the fix only adds behavior for control-flow ops; non-control-flow partitions are byte-for-byte identical to before).

This pull request strengthens security checks around loading external tensor data in ONNX Runtime, particularly to prevent malicious models from referencing unsafe file paths or in-memory address markers that could lead to arbitrary file access or unsafe memory dereferencing. The changes introduce stricter validation for external data paths and add explicit rejections for ORT in-memory address markers found in model protobufs, along with new and improved regression tests to verify this behavior. **Security hardening for external data loading:** * Added `ValidateExternalFilePathForTensor` to enforce that external data paths are validated for all code paths loading external data (including those outside `Graph::Resolve`), rejecting absolute or directory-escaping paths and passing through only trusted in-memory markers. This is now called in `GetExtDataFromTensorProto` and `LoadExtDataToTensorFromTensorProto` to ensure defense-in-depth. [[1]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406R1568-R1596) [[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406R1760-R1762) * Updated the validation logic for sparse tensor sub-tensors with `ValidateSparseSubTensorExternalDataPath`, clarifying the handling of in-memory markers and ensuring only legitimate file paths are accepted. * Changed `SparseTensorProtoToDenseTensorProto` to use the new sparse sub-tensor validation for both values and indices. **Model loading and graph construction protections:** * In `Graph::Graph`, added explicit rejection of ORT in-memory address markers in sparse tensor attributes and initializers when loading from a protobuf, preventing attackers from crafting models that could cause unsafe memory access during sparse-to-dense conversion or initializer resolution. [[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1268-R1282) [[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1322-R1331) [[3]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1373-R1380) **Expanded and improved testing:** * Added new unit tests to verify that absolute and directory-escaping external paths are rejected even when loading tensors directly (not via graph resolution), and that in-memory address markers are not accepted in dense or sparse initializers loaded from protobufs. [[1]](diffhunk://#diff-d75ec5db9cc4642f78b6ff568aff6d10398fc211b0fb7c862d3ec88738e3eda6R1156-R1217) [[2]](diffhunk://#diff-1d3978c99d95a56af0f2603bdd0b10cf02bdc1cecbd4fe5db353a8c8388696efR1365-R1484) * Updated an optimizer initializer test to reflect the new error handling for invalid external data paths.

microsoft#28695) ## Description Add a flash attention-style tiled computation path to the CPU GroupQueryAttention operator for quantized KV cache (INT8/INT4). Instead of materializing the full `[B, N, S, T]` attention probability matrix, this processes K/V in L2-cache-sized blocks with online softmax — reducing peak memory from O(S×T) to O(S×Bc) per head where Bc is the KV block size. Additionally, implements **flash decoding** for the decode phase (S=1): when `batch×heads < threads`, idle threads are repurposed to partition the KV sequence across parallel workers. Each worker computes partial softmax statistics on its KV chunk, then a lightweight reduce phase merges the partials — achieving 2–5x decode speedup for long sequences. ### Motivation For long-sequence LLM inference with quantized KV cache on CPU: - **Prefill**: The full attention matrix allocation becomes a significant memory bottleneck. With 16 heads and S=4096, the naive path allocates ~1 GB for attention scores alone. The tiled approach reduces peak memory by 13–24x and latency by 1.2–2.7x. - **Decode**: When batch size is small relative to available threads, many threads sit idle. Flash decoding partitions the KV sequence across these idle threads, achieving 2–5x speedup for long KV lengths. ## Key Changes | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/flashattn_qkv.cpp` | MLAS kernel: tiled prefill with online softmax, flash decoding (two-phase KV partitioning), and reduce | | `onnxruntime/core/mlas/inc/mlas_qkv_quant.h` | `MlasFlashAttentionQuantizedKVArgs` struct with `flash_decoding_partials` and `kv_chunk_count` fields | | `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h` | `ApplyAttentionQuantizedFlash()` with L2-cache-aware block sizing, KV concat, flash decoding setup | | `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc` | Dispatch logic: activates flash path when no softcap/smooth softmax/output_qk | | `cmake/onnxruntime_mlas.cmake` | Added `flashattn_qkv.cpp` to the MLAS build | | `docs/contrib_ops/cpu/gqa.md` | Documentation with algorithm details, benchmark results, and reproduction steps | | `onnxruntime/test/mlas/bench/bench_qkv_quant.cpp` | MLAS-level C++ benchmark (`BM_GQA_Naive` vs `BM_GQA_Flash`) | | `onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py` | Operator-level Python benchmark | ## Algorithm ### Prefill (S > 1): Tiled Flash Attention Per (batch, head, q_block) tile: 1. **QK GEMM** — `MlasQKGemm` on a block slice of quantized K cache 2. **Causal + local window masking** — Set masked positions to -inf before softmax 3. **Online softmax** — Track running max `m` and sum `l`, rescale accumulated output with `exp(m_old - m_new)` 4. **SV accumulation** — Dequantize V block to FP32, then accumulate weighted V into output ### Decode (S = 1): Flash Decoding When `sequence_length == 1 && batch_size * num_heads < thread_count && kv_chunk_count > 1`: **Phase 1 — Parallel KV scan**: Each idle thread processes a disjoint KV chunk for a (batch, head) pair. For each chunk: compute QK dot products, find local max, compute local softmax sum, and accumulate partial weighted V output. Store per-chunk `(max_score, sum_exp, partial_output[head_size])` into a partials buffer. **Phase 2 — Reduce**: One thread per (batch, head) merges all chunk partials using the log-sum-exp trick: find global max, rescale each chunk's sum and partial output, then normalize by global sum. This is analogous to GPU flash decoding (Dao et al.) but adapted for CPU threading. ### Activation Conditions Flash path activates when ALL of: - `ORT_GQA_DISABLE_FLASH_ATTENTION` env var is not set - `total_sequence_length > 1` - No softcap, no smooth softmax, no output_qk (attention bias IS supported) Flash decoding additionally requires: - `sequence_length == 1` (decode phase) - `batch_size * num_heads < thread_count` (idle threads available) - `kv_chunk_count > 1` (enough KV to partition) ## Benchmark Results Measured on Intel Xeon Platinum 8480C, 96 CPUs, threads=8. MLAS-level C++ benchmark. ### Latency — Prefill (S = T) Shape: B=1, num_heads=16, kv_num_heads=8, head_size=128. | Seq Length | Naive (ms) | Flash (ms) | Speedup | Quant | |---:|---:|---:|---:|:---| | 512 | 9.9 | 8.1 | 1.2x | per-tensor | | 1024 | 44.4 | 27.0 | 1.6x | per-tensor | | 2048 | 190.9 | 116.9 | 1.6x | per-tensor | | 4096 | 1257.8 | 461.6 | 2.7x | per-tensor | | 512 | 10.7 | 10.8 | 1.0x | per-channel | | 1024 | 49.5 | 41.7 | 1.2x | per-channel | | 2048 | 212.1 | 164.1 | 1.3x | per-channel | | 4096 | 1223.9 | 607.8 | 2.0x | per-channel | ### Latency — Decode (S = 1, no flash decoding) Shape: B=1, num_heads=16, kv_num_heads=8, head_size=128. Flash decoding NOT active (batch×heads=16 > threads=8). | Total Seqlen | Naive (us) | Flash (us) | Speedup | Quant | |---:|---:|---:|---:|:---| | 512 | 32 | 22 | 1.4x | per-tensor | | 1024 | 71 | 47 | 1.5x | per-tensor | | 2048 | 120 | 87 | 1.4x | per-tensor | | 4096 | 210 | 174 | 1.2x | per-tensor | | 512 | 53 | 31 | 1.7x | per-channel | | 1024 | 86 | 52 | 1.7x | per-channel | | 2048 | 172 | 97 | 1.8x | per-channel | | 4096 | 299 | 191 | 1.6x | per-channel | ### Latency — Flash Decoding (S = 1, KV partitioned across threads) Shape: B=1, num_heads=4, kv_num_heads=4 (MHA), head_size=128. Flash decoding IS active (batch×heads=4 < threads=8). | Total Seqlen | Naive (us) | Flash (us) | Speedup | Quant | |---:|---:|---:|---:|:---| | 512 | 31 | 25 | 1.2x | per-tensor | | 1024 | 41 | 25 | 1.6x | per-tensor | | 2048 | 67 | 34 | 2.0x | per-tensor | | 4096 | 197 | 54 | 3.7x | per-tensor | | 512 | 25 | 28 | 0.9x | per-channel | | 1024 | 72 | 27 | 2.7x | per-channel | | 2048 | 144 | 37 | 3.9x | per-channel | | 4096 | 304 | 60 | 5.1x | per-channel | ### Peak Memory — Prefill | Seq Length | Naive Peak | Flash Peak | Memory Reduction | |---:|---:|---:|---:| | 2048 (N=16) | +294 MB | +44 MB | 6.7x | | 4096 (N=16) | +1107 MB | +82 MB | 13.5x | | 4096 (N=32) | +2131 MB | +87 MB | 24.5x | **Summary**: Prefill gains 1.2–2.7x latency + 7–24x memory reduction from tiled online softmax. Decode gains 1.2–1.8x from fused dequant+dot alone. Flash decoding adds 2–5x for long sequences when idle threads are available to partition the KV scan. ### How to Reproduce ```bash # Build ORT python tools/ci_build/build.py --build_dir build/cpu --config Release \ --parallel --build_wheel --skip_tests # MLAS-level C++ benchmark: cd build/cpu/Release ./onnxruntime_mlas_benchmark \ --benchmark_filter='BM_GQA_(Naive|Flash)' \ --benchmark_min_time=0.5s \ --benchmark_repetitions=3 \ --benchmark_report_aggregates_only=true ``` ## Testing - All 35 CPU `GroupQueryAttentionTest.*` tests pass (INT8/INT4, per-tensor/per-channel, multi-batch, large head, GQA ratio variants) - Set `ORT_GQA_DISABLE_FLASH_ATTENTION=1` to verify fallback path still works - End-to-end verified with `quantized_kv_cache_cpu_demo.py` - Numerical agreement between flash and naive paths: max diff < 1e-7

…t#28710) ### Description  Add a new `--paths` option to `compile_contributors.py` to limit git history queries using pathspecs. Apply the path filter to both base and target git log collection and log the active path filter in logs.txt. ### Motivation and Context  Allow `compile_contributors.py` to be used for releases where relevant changes are largely limited to a subset of the codebase. E.g., we can limit the paths to WebGPU EP-related files for the WebGPU plugin EP release. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

### Description Update CUDA version from 12.8 to 13.0 across all CUDA CI workflow files: - `linux_cuda_ci.yml` — `--cuda_version`, `--cuda_home`, `--cudnn_home` in build and test jobs - `linux_cuda_plugin_ci.yml` — same flags - `windows_cuda.yml` — SDK download URL, PATH entries, `--cuda_home` in build and test jobs - `windows_cuda_plugin.yml` — same as above ### Motivation and Context PRs that break CUDA 13 builds pass CI today because all pipelines target CUDA 12.8. This moves CI to CUDA 13.0 for build-time coverage. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

…28691) ### Description For ORT 1.27, GPU release artifacts (zip/tgz) now include an explicit CUDA major version suffix to distinguish between CUDA 12 and CUDA 13 builds. **Before:** `onnxruntime-linux-x64-gpu-1.27.0.tgz`, `onnxruntime-win-x64-gpu-1.27.0.zip` **After:** `onnxruntime-linux-x64-gpu_cuda12-1.27.0.tgz`, `onnxruntime-win-x64-gpu_cuda13-1.27.0.tgz`, etc. ### Motivation and Context  --------- Co-authored-by: Kusuma Padma Kavya Bandi <kusbandi@microsoft.com>

## Summary Unifies the NHWC-eligible op allowlist between the bundled CUDA EP and the CUDA plugin EP into a single shared header, adds kernel-miss diagnostics, and expands NHWC test coverage from 4 ops to 11. ## Motivation The bundled EP (`cuda_execution_provider.cc`) and the plugin EP (`plugin/cuda_ep.cc`) independently maintained their own copies of the NHWC allowlist. This created a maintenance hazard where ops could be added to one but not the other, leading to silent divergence. Additionally, there was no runtime diagnostic when the framework rewrote a node to the NHWC domain but the plugin EP lacked a matching kernel — failures were silent fallbacks to CPU. ## Key Changes ### Shared NHWC Allowlist (`cuda_nhwc_ops.h`) | Item | Detail | |------|--------| | New file | `onnxruntime/core/providers/cuda/cuda_nhwc_ops.h` | | Contents | `IsNhwcEligibleOnnxOp()`, `IsNhwcEligibleMsOp()`, `IsNhwcEligible()` inline functions | | Ops covered | AveragePool, BatchNormalization, Conv, ConvTranspose, DepthToSpace, GlobalAveragePool, GlobalMaxPool, GridSample, LRN, MaxPool, SpaceToDepth (+ MS-domain GridSample) | ### Bundled EP Refactor (`cuda_execution_provider.cc`) - Removed the static `std::unordered_set<std::string_view> cuda_nhwc_onnx_ops` and the inline domain check logic. - Replaced with a single call to `cuda::IsNhwcEligible(node_domain, node_op_type)`. ### Plugin EP Refactor & Diagnostics (`plugin/cuda_ep.cc`) - `ShouldConvertDataLayoutForOpImpl`: Replaced ~20 lines of static set + domain checks with a single `cuda::IsNhwcEligible()` call. - `GetCapabilityImpl`: Added a WARNING-level diagnostic in the `else` branch (kernel not found). When a node in the `com.ms.internal.nhwc` domain has no registered kernel, the log emits the op type, domain, version, and node name — making future NHWC registration gaps immediately visible at session creation. ### Expanded NHWC Test Coverage (`test_cuda_plugin_ep.py`) - Added `_assert_nhwc_domain_assigned()` helper that verifies NHWC layout transformation occurred by checking for framework-inserted Transpose nodes in the EP's assignment info. - Added `_run_nhwc_model_test()` helper combining domain assertion + numerical validation. - Updated 4 existing NHWC tests (Conv, BatchNormalization, MaxPool, AveragePool) to include structural assertions. - Added 7 new NHWC test methods: - `test_nhwc_conv_transpose` - `test_nhwc_global_max_pool` - `test_nhwc_global_average_pool` - `test_nhwc_depth_to_space` - `test_nhwc_space_to_depth` - `test_nhwc_lrn` - `test_nhwc_grid_sample` ## Testing Notes Run the full CUDA plugin EP test suite with NHWC enabled: ```bash bash .env/cuda13_plugin.sh --build --install --test_plugin ``` Or run only the NHWC tests directly: ```bash cd onnxruntime/test/python/transformers ORT_TEST_CUDA_PLUGIN_EP=1 python -m unittest \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_conv \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_batch_normalization \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_maxpool \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_avgpool \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_conv_transpose \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_global_max_pool \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_global_average_pool \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_depth_to_space \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_space_to_depth \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_lrn \ test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_grid_sample ``` All 86 tests in the suite pass (11 NHWC + 75 existing), with no regressions.

edgchen1 and others added 11 commits May 28, 2026 13:34

Merge remote-tracking branch 'origin/master' into sync_msft_30052026

a51d493

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel May 29, 2026 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 30052026#1113

Sync with Microsoft ONNX Runtime - 30052026#1113
ai-fw-intg wants to merge 11 commits into
ovep-developfrom
sync_msft_30052026

ai-fw-intg commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

ai-fw-intg commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants