Skip to content

Sync with Microsoft ONNX Runtime - 30052026#1113

Open
ai-fw-intg wants to merge 11 commits into
ovep-developfrom
sync_msft_30052026
Open

Sync with Microsoft ONNX Runtime - 30052026#1113
ai-fw-intg wants to merge 11 commits into
ovep-developfrom
sync_msft_30052026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

edgchen1 and others added 11 commits May 28, 2026 13:34
…ove shader key validation to nightly build (microsoft#28674)

### Description
<!-- Describe your changes. -->

Allow shader code to be dumped to the file specified in the
`ORT_WEBGPU_EP_SHADER_DUMP_FILE` environment variable. Previously,
shader code was only dumped by verbose logging.

Create new nightly CI pipeline to run shader key validation test. That
test is removed from the CI pipeline in microsoft#28642.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

More shader dump output options. Moving shader key validation test.
…u_inc/cub.cuh" wrapper. (microsoft#28705)

### Description
<!-- Describe your changes. -->

Replace direct inclusion of `<cub/cub.cuh>` with
`"core/providers/cuda/cu_inc/cub.cuh"` wrapper. The wrapper accounts for
a problematic macro definition which causes issues.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix pipeline build error.
…mic quantization (microsoft#28228)

## Summary
- Fix `quantize_dynamic(per_channel=True)` so weights quantized
per-channel produce a `DequantizeLinear` node with the correct `axis`
attribute.
- Stop dropping the channel axis when `quantize_weight_per_channel`
populates `QuantizedValue` (was hardcoded to `None`).
- Gate the scalar-scale assertion in `_dequantize_value` on `axis is
None` so per-channel scales (1-D tensors) are accepted.

## Motivation
Fixes microsoft#19997.

When a model is quantized with `quantize_dynamic(..., per_channel=True)`
and a per-channel weight reaches `_dequantize_value` (e.g. via
`_dequantize_outputs` when the weight is in the graph outputs), two bugs
surface:

1. `quantize_weight_per_channel` stores `QuantizedValue.axis = None`
even though it received a real `channel_axis`, so the per-channel
information is lost.
2. `_dequantize_value` (a) asserts `scale_init.size == 1`, which fails
for a 1-D per-channel scale, and (b) builds the `DequantizeLinear` node
without an `axis` attribute, producing an invalid ONNX node when the
model is consumed.

PR microsoft#22283 (Nov 2024) softened the assertion against `None`-typed scales
but left the underlying axis-propagation bug in place.

## Changes
- `onnxruntime/python/tools/quantization/onnx_quantizer.py`
- `quantize_weight_per_channel`: pass `channel_axis` (was `None`) into
`QuantizedValue`.
- `_dequantize_value`: only require a scalar scale on the per-tensor
path (`axis is None`); forward `axis=quantized_value.axis` to
`onnx.helper.make_node("DequantizeLinear", ...)`. `make_node` silently
omits the attribute when `axis` is `None`, so the per-tensor path is
unchanged.
- `onnxruntime/test/python/quantization/test_quant_issues.py`
- New regression test
`test_dynamic_quantize_per_channel_emits_axis_attribute` that builds a
minimal MatMul model with the weight routed to a graph output (to force
the `_dequantize_outputs` -> `_dequantize_value` path), runs
`quantize_dynamic(per_channel=True)`, and asserts the emitted
`DequantizeLinear` has the `axis` attribute and a 1-D multi-element
scale initializer.

## Test Plan
- `python -m pytest
onnxruntime/test/python/quantization/test_quant_issues.py -xvs` — new
test passes; existing test skipped as before.
- `python -m pytest
onnxruntime/test/python/quantization/test_op_matmul.py` — 7 passed, 8
skipped (no regression).
- `python -m pytest onnxruntime/test/python/quantization/test_qdq.py -k
per_channel` — 1 passed.
- `lintrunner -a` on changed files: clean.
…Def (microsoft#28608)

## Summary

`utils::MakeComputeCapability` is the shared helper used by
`utils::CreateSupportedPartitions` to build an
`IndexedSubGraph::MetaDef` from a group of supported nodes. When a
supported group contains a control-flow op (`Loop`, `If`, `Scan`),
`MakeComputeCapability` currently walks only `node->InputDefs()` and
silently drops the outer-scope captures (`node->ImplicitInputDefs()`).
The captures never enter `meta_def->inputs`, so after
`Graph::FinalizeFuseSubGraph` the fused node's `InputDefs()` is missing
them — the EP that owns the fused subgraph has no boundary value-info
for the captured tensors and cannot bind them at Compute time.

This PR adds a second loop in `MakeComputeCapability` that walks
`node->ImplicitInputDefs()` with the same "produced inside the partition
→ skip, otherwise add to subgraph inputs" semantics already applied to
`InputDefs()`.

## Why this is the right fix

`onnxruntime::Node` partitions inputs into two arrays by design:

- `InputDefs()` — formal operand list as declared in the op's ONNX
schema.
- `ImplicitInputDefs()` — outer-scope SSA values referenced from inside
body subgraphs of `Loop` / `If` / `Scan`. These are real boundary inputs
at runtime (the body kernel reads them) but they don't appear in the
op's formal operand list.

`Graph::FinalizeFuseSubGraph` consumes only `meta_def->inputs` to
populate the fused node's `InputDefs()` and rewire outer-scope edges. So
whatever `MakeComputeCapability` puts in `meta_def->inputs` is what ends
up at the fused-node boundary. Omitting `ImplicitInputDefs()` here means
the captures are unreachable downstream — there is no other place that
can patch them back in.

The fix is intentionally a mirror of the existing `InputDefs()` loop
(same `Contains(node_outputs, ...)` produced-inside check, same
`ordered_subgraph_inputs.push_back` ordering). The new loop runs after
the explicit loop so explicit-operand index ordering in
`meta_def->inputs` is preserved (EPs that have implicitly relied on
`meta_def->inputs[i].name == node.InputDefs()[i].name` for
non-control-flow op groups are not perturbed).

## Scope of impact

Only EPs that consume `utils::MakeComputeCapability` /
`utils::CreateSupportedPartitions` are affected. A quick audit:

| EP | Uses `partitioning_utils::MakeComputeCapability`? | Affected by
bug? |
|---|---|---|
| Plugin EPs (`EpGraphSupportInfo_AddNodesToFuse` →
`PluginExecutionProvider::GetCapability`) | yes, in
`onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc` |
**yes** |
| `internal_testing_ep` (used by ORT's own unit tests) | yes, in
`onnxruntime/test/internal_testing_ep/internal_testing_execution_provider.cc`
| **yes** |
| TensorRT, MIGraphX, NV-TRT-RTX, VitisAI | no — they build
`MetaDef::inputs` themselves and already walk `ImplicitInputDefs()`
(e.g. `tensorrt_execution_provider.cc:2084`,
`migraphx_execution_provider.cc:735`) | no |
| DML / CPU / CUDA / ROCm / OpenVINO / QNN / CANN / WebGPU / CoreML |
don't use it for Loop/If/Scan fusion paths | no |

So the impact is bounded to the plugin EP architecture (ORT 1.23+) and
the in-tree testing EP — both of which delegate boundary calculation to
this shared helper.

## Reproduction

The bug is reproducible against this repo's `internal_testing_ep`. No
external code required.

A minimal repro model with a Loop body that captures an outer-scope
tensor `B`:

```python
# build_repro.py — produces a ~1.5 KB onnx
import numpy as np, onnx
from onnx import TensorProto, helper as h, numpy_helper as nph

A   = h.make_tensor_value_info("A", TensorProto.FLOAT, ["N", 2, 2])
B   = h.make_tensor_value_info("B", TensorProto.FLOAT, [2, 2])
out = h.make_tensor_value_info("v_final", TensorProto.FLOAT, [2, 2])

acc_init  = nph.from_array(np.zeros((2, 2), np.float32), name="acc_init")
cond_init = nph.from_array(np.array([1], np.bool_), name="cond_init")
sq_ax     = nph.from_array(np.array([0], np.int64), name="sq_ax")

body = h.make_graph(
    nodes=[
        h.make_node("Gather", ["A", "iter"], ["slice"], axis=0),
        h.make_node("Add", ["slice", "B"], ["tmp"]),     # captures outer B
        h.make_node("Add", ["acc_in", "tmp"], ["acc_out"]),
        h.make_node("Identity", ["cond_in"], ["cond_out"]),
    ],
    name="loop_body",
    inputs=[h.make_tensor_value_info("iter", TensorProto.INT64, []),
            h.make_tensor_value_info("cond_in", TensorProto.BOOL, []),
            h.make_tensor_value_info("acc_in", TensorProto.FLOAT, [2, 2])],
    outputs=[h.make_tensor_value_info("cond_out", TensorProto.BOOL, []),
             h.make_tensor_value_info("acc_out", TensorProto.FLOAT, [2, 2])],
)

g = h.make_graph(
    nodes=[
        h.make_node("Shape", ["A"], ["M_1d"], start=0, end=1),
        h.make_node("Squeeze", ["M_1d", "sq_ax"], ["M"]),
        h.make_node("Loop", ["M", "cond_init", "acc_init"], ["v_final"], body=body),
    ],
    name="loop_with_outer_capture",
    inputs=[A, B], outputs=[out],
    initializer=[acc_init, cond_init, sq_ax],
)
onnx.save(h.make_model(g, opset_imports=[h.make_opsetid("", 16)]),
          "loop_with_outer_capture.onnx")
```

Observable bug path (against any EP using `CreateSupportedPartitions`,
e.g. `InternalTestingExecutionProvider`):

```cpp
// Claim every node (Shape/Squeeze/Constant/Loop) as compiled.
SessionOptions so;
InferenceSession session(so, env);
session.RegisterExecutionProvider(
    std::make_unique<InternalTestingExecutionProvider>(/*supported=*/{...}));
session.Load("loop_with_outer_capture.onnx");
session.Initialize();

// In EP::Compile, iterate fused_node.InputDefs():
//   for (const auto* in : fused_node.InputDefs()) std::cerr << in->Name() << "\n";
// BEFORE this fix: only "A" is printed (Shape(A) makes A explicit;
// B is consumed only via Loop's ImplicitInputDefs and gets dropped).
// AFTER this fix:  both "A" and "B" are printed.
```

A small unit-test fixture exercising the same path can be added to
`onnxruntime/test/providers/partitioning_utils_test.cc` following the
existing `CheckAllNodesProcessed` pattern, asserting that
`result[0]->sub_graph->GetMetaDef()->inputs` contains `B` when the
supported group includes the Loop.

## What this PR changes

A single hunk in
`onnxruntime/core/providers/partitioning_utils.cc::MakeComputeCapability`,
immediately after the existing `for (const auto* input :
node->InputDefs()) { ... }`:

```cpp
// Region-bearing ops (Loop/If/Scan) reference outer-scope SSA values via
// ImplicitInputDefs rather than InputDefs. When an EP claims the whole
// control-flow op, those implicit captures must also be in MetaDef::inputs
// so FinalizeFuseSubGraph can rewire the outer-scope edges onto the fused
// node's InputDefs. Without this, plugin EPs that fuse Loop/If/Scan lose
// the captures at the fused-node boundary and cannot resolve them at
// Compute time.
for (const auto* input : node->ImplicitInputDefs()) {
  if (!input->Exists()) {
    continue;
  }
  if (!Contains(node_outputs, input)) {
    if (!Contains(subgraph_inputs, input)) {
      subgraph_inputs.insert(input);
      ordered_subgraph_inputs.push_back(input);
    }
  }
}
```

## Risks / migration

- **No ABI change.** `MakeComputeCapability` signature unchanged.
`IndexedSubGraph::MetaDef` schema unchanged.
- **No semantic regression for op groups without control flow.** The new
loop only adds elements; for partitions that contain no `Loop` / `If` /
`Scan`, `ImplicitInputDefs()` is empty on every node and the new loop is
a no-op.
- **Behavior change for plugin EPs that fuse Loop/If/Scan.** Their fused
node's `InputDefs()` gains the captures. EPs that were silently fishing
out captures via a workaround (e.g. walking the original Loop node's
`ImplicitInputDefs()` themselves at Compile time) would see those names
show up via the standard fused-node `InputDefs()` API. Audit above shows
no in-tree EP that uses `partitioning_utils` had such a workaround — TRT
/ MIGraphX / etc. roll their own MetaDef without calling
`MakeComputeCapability`.

## Validation

- Verified the fix end-to-end against a downstream plugin EP that claims
a `Loop` node as part of a fused partition (Loop body captures an
outer-scope tensor): without this fix, the EP cannot resolve the
captured tensor name at the fused-node boundary; with the fix the
captured tensor appears in `fused_node.InputDefs()` and session
initialization + the EP's Compile both succeed.
- No `partitioning_utils.cc` changes between `origin/main` and the patch
base, so it applies cleanly.
- Existing `onnxruntime_test_all --gtest_filter=PartitioningUtilsTest.*`
cases still pass (the fix only adds behavior for control-flow ops;
non-control-flow partitions are byte-for-byte identical to before).
This pull request strengthens security checks around loading external
tensor data in ONNX Runtime, particularly to prevent malicious models
from referencing unsafe file paths or in-memory address markers that
could lead to arbitrary file access or unsafe memory dereferencing. The
changes introduce stricter validation for external data paths and add
explicit rejections for ORT in-memory address markers found in model
protobufs, along with new and improved regression tests to verify this
behavior.

**Security hardening for external data loading:**

* Added `ValidateExternalFilePathForTensor` to enforce that external
data paths are validated for all code paths loading external data
(including those outside `Graph::Resolve`), rejecting absolute or
directory-escaping paths and passing through only trusted in-memory
markers. This is now called in `GetExtDataFromTensorProto` and
`LoadExtDataToTensorFromTensorProto` to ensure defense-in-depth.
[[1]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406R1568-R1596)
[[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406R1760-R1762)
* Updated the validation logic for sparse tensor sub-tensors with
`ValidateSparseSubTensorExternalDataPath`, clarifying the handling of
in-memory markers and ensuring only legitimate file paths are accepted.
* Changed `SparseTensorProtoToDenseTensorProto` to use the new sparse
sub-tensor validation for both values and indices.

**Model loading and graph construction protections:**

* In `Graph::Graph`, added explicit rejection of ORT in-memory address
markers in sparse tensor attributes and initializers when loading from a
protobuf, preventing attackers from crafting models that could cause
unsafe memory access during sparse-to-dense conversion or initializer
resolution.
[[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1268-R1282)
[[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1322-R1331)
[[3]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR1373-R1380)

**Expanded and improved testing:**

* Added new unit tests to verify that absolute and directory-escaping
external paths are rejected even when loading tensors directly (not via
graph resolution), and that in-memory address markers are not accepted
in dense or sparse initializers loaded from protobufs.
[[1]](diffhunk://#diff-d75ec5db9cc4642f78b6ff568aff6d10398fc211b0fb7c862d3ec88738e3eda6R1156-R1217)
[[2]](diffhunk://#diff-1d3978c99d95a56af0f2603bdd0b10cf02bdc1cecbd4fe5db353a8c8388696efR1365-R1484)
* Updated an optimizer initializer test to reflect the new error
handling for invalid external data paths.
microsoft#28695)

## Description

Add a flash attention-style tiled computation path to the CPU
GroupQueryAttention operator for quantized KV cache (INT8/INT4). Instead
of materializing the full `[B, N, S, T]` attention probability matrix,
this processes K/V in L2-cache-sized blocks with online softmax —
reducing peak memory from O(S×T) to O(S×Bc) per head where Bc is the KV
block size.

Additionally, implements **flash decoding** for the decode phase (S=1):
when `batch×heads < threads`, idle threads are repurposed to partition
the KV sequence across parallel workers. Each worker computes partial
softmax statistics on its KV chunk, then a lightweight reduce phase
merges the partials — achieving 2–5x decode speedup for long sequences.

### Motivation

For long-sequence LLM inference with quantized KV cache on CPU:
- **Prefill**: The full attention matrix allocation becomes a
significant memory bottleneck. With 16 heads and S=4096, the naive path
allocates ~1 GB for attention scores alone. The tiled approach reduces
peak memory by 13–24x and latency by 1.2–2.7x.
- **Decode**: When batch size is small relative to available threads,
many threads sit idle. Flash decoding partitions the KV sequence across
these idle threads, achieving 2–5x speedup for long KV lengths.

## Key Changes

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/flashattn_qkv.cpp` | MLAS kernel: tiled
prefill with online softmax, flash decoding (two-phase KV partitioning),
and reduce |
| `onnxruntime/core/mlas/inc/mlas_qkv_quant.h` |
`MlasFlashAttentionQuantizedKVArgs` struct with
`flash_decoding_partials` and `kv_chunk_count` fields |
| `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h` |
`ApplyAttentionQuantizedFlash()` with L2-cache-aware block sizing, KV
concat, flash decoding setup |
| `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc` | Dispatch
logic: activates flash path when no softcap/smooth softmax/output_qk |
| `cmake/onnxruntime_mlas.cmake` | Added `flashattn_qkv.cpp` to the MLAS
build |
| `docs/contrib_ops/cpu/gqa.md` | Documentation with algorithm details,
benchmark results, and reproduction steps |
| `onnxruntime/test/mlas/bench/bench_qkv_quant.cpp` | MLAS-level C++
benchmark (`BM_GQA_Naive` vs `BM_GQA_Flash`) |
| `onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py` |
Operator-level Python benchmark |

## Algorithm

### Prefill (S > 1): Tiled Flash Attention

Per (batch, head, q_block) tile:
1. **QK GEMM** — `MlasQKGemm` on a block slice of quantized K cache
2. **Causal + local window masking** — Set masked positions to -inf
before softmax
3. **Online softmax** — Track running max `m` and sum `l`, rescale
accumulated output with `exp(m_old - m_new)`
4. **SV accumulation** — Dequantize V block to FP32, then accumulate
weighted V into output

### Decode (S = 1): Flash Decoding

When `sequence_length == 1 && batch_size * num_heads < thread_count &&
kv_chunk_count > 1`:

**Phase 1 — Parallel KV scan**: Each idle thread processes a disjoint KV
chunk for a (batch, head) pair. For each chunk: compute QK dot products,
find local max, compute local softmax sum, and accumulate partial
weighted V output. Store per-chunk `(max_score, sum_exp,
partial_output[head_size])` into a partials buffer.

**Phase 2 — Reduce**: One thread per (batch, head) merges all chunk
partials using the log-sum-exp trick: find global max, rescale each
chunk's sum and partial output, then normalize by global sum.

This is analogous to GPU flash decoding (Dao et al.) but adapted for CPU
threading.

### Activation Conditions

Flash path activates when ALL of:
- `ORT_GQA_DISABLE_FLASH_ATTENTION` env var is not set
- `total_sequence_length > 1`
- No softcap, no smooth softmax, no output_qk (attention bias IS
supported)

Flash decoding additionally requires:
- `sequence_length == 1` (decode phase)
- `batch_size * num_heads < thread_count` (idle threads available)
- `kv_chunk_count > 1` (enough KV to partition)

## Benchmark Results

Measured on Intel Xeon Platinum 8480C, 96 CPUs, threads=8. MLAS-level
C++ benchmark.

### Latency — Prefill (S = T)

Shape: B=1, num_heads=16, kv_num_heads=8, head_size=128.

| Seq Length | Naive (ms) | Flash (ms) | Speedup | Quant |
|---:|---:|---:|---:|:---|
| 512 | 9.9 | 8.1 | 1.2x | per-tensor |
| 1024 | 44.4 | 27.0 | 1.6x | per-tensor |
| 2048 | 190.9 | 116.9 | 1.6x | per-tensor |
| 4096 | 1257.8 | 461.6 | 2.7x | per-tensor |
| 512 | 10.7 | 10.8 | 1.0x | per-channel |
| 1024 | 49.5 | 41.7 | 1.2x | per-channel |
| 2048 | 212.1 | 164.1 | 1.3x | per-channel |
| 4096 | 1223.9 | 607.8 | 2.0x | per-channel |

### Latency — Decode (S = 1, no flash decoding)

Shape: B=1, num_heads=16, kv_num_heads=8, head_size=128. Flash decoding
NOT active (batch×heads=16 > threads=8).

| Total Seqlen | Naive (us) | Flash (us) | Speedup | Quant |
|---:|---:|---:|---:|:---|
| 512 | 32 | 22 | 1.4x | per-tensor |
| 1024 | 71 | 47 | 1.5x | per-tensor |
| 2048 | 120 | 87 | 1.4x | per-tensor |
| 4096 | 210 | 174 | 1.2x | per-tensor |
| 512 | 53 | 31 | 1.7x | per-channel |
| 1024 | 86 | 52 | 1.7x | per-channel |
| 2048 | 172 | 97 | 1.8x | per-channel |
| 4096 | 299 | 191 | 1.6x | per-channel |

### Latency — Flash Decoding (S = 1, KV partitioned across threads)

Shape: B=1, num_heads=4, kv_num_heads=4 (MHA), head_size=128. Flash
decoding IS active (batch×heads=4 < threads=8).

| Total Seqlen | Naive (us) | Flash (us) | Speedup | Quant |
|---:|---:|---:|---:|:---|
| 512 | 31 | 25 | 1.2x | per-tensor |
| 1024 | 41 | 25 | 1.6x | per-tensor |
| 2048 | 67 | 34 | 2.0x | per-tensor |
| 4096 | 197 | 54 | 3.7x | per-tensor |
| 512 | 25 | 28 | 0.9x | per-channel |
| 1024 | 72 | 27 | 2.7x | per-channel |
| 2048 | 144 | 37 | 3.9x | per-channel |
| 4096 | 304 | 60 | 5.1x | per-channel |

### Peak Memory — Prefill

| Seq Length | Naive Peak | Flash Peak | Memory Reduction |
|---:|---:|---:|---:|
| 2048 (N=16) | +294 MB | +44 MB | 6.7x |
| 4096 (N=16) | +1107 MB | +82 MB | 13.5x |
| 4096 (N=32) | +2131 MB | +87 MB | 24.5x |

**Summary**: Prefill gains 1.2–2.7x latency + 7–24x memory reduction
from tiled online softmax. Decode gains 1.2–1.8x from fused dequant+dot
alone. Flash decoding adds 2–5x for long sequences when idle threads are
available to partition the KV scan.

### How to Reproduce

```bash
# Build ORT
python tools/ci_build/build.py --build_dir build/cpu --config Release \
  --parallel --build_wheel --skip_tests

# MLAS-level C++ benchmark:
cd build/cpu/Release
./onnxruntime_mlas_benchmark \
  --benchmark_filter='BM_GQA_(Naive|Flash)' \
  --benchmark_min_time=0.5s \
  --benchmark_repetitions=3 \
  --benchmark_report_aggregates_only=true
```

## Testing

- All 35 CPU `GroupQueryAttentionTest.*` tests pass (INT8/INT4,
per-tensor/per-channel, multi-batch, large head, GQA ratio variants)
- Set `ORT_GQA_DISABLE_FLASH_ATTENTION=1` to verify fallback path still
works
- End-to-end verified with `quantized_kv_cache_cpu_demo.py`
- Numerical agreement between flash and naive paths: max diff < 1e-7
…t#28710)

### Description
<!-- Describe your changes. -->

Add a new `--paths` option to `compile_contributors.py` to limit git
history queries using pathspecs.

Apply the path filter to both base and target git log collection and log
the active path filter in logs.txt.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Allow `compile_contributors.py` to be used for releases where relevant
changes are largely limited to a subset of the codebase. E.g., we can
limit the paths to WebGPU EP-related files for the WebGPU plugin EP
release.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description

Update CUDA version from 12.8 to 13.0 across all CUDA CI workflow files:

- `linux_cuda_ci.yml` — `--cuda_version`, `--cuda_home`, `--cudnn_home`
in build and test jobs
- `linux_cuda_plugin_ci.yml` — same flags
- `windows_cuda.yml` — SDK download URL, PATH entries, `--cuda_home` in
build and test jobs
- `windows_cuda_plugin.yml` — same as above

### Motivation and Context

PRs that break CUDA 13 builds pass CI today because all pipelines target
CUDA 12.8. This moves CI to CUDA 13.0 for build-time coverage.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
…28691)

### Description

For ORT 1.27, GPU release artifacts (zip/tgz) now include an explicit
CUDA major version suffix to distinguish between CUDA 12 and CUDA 13
builds.

**Before:** `onnxruntime-linux-x64-gpu-1.27.0.tgz`,
`onnxruntime-win-x64-gpu-1.27.0.zip`
**After:** `onnxruntime-linux-x64-gpu_cuda12-1.27.0.tgz`,
`onnxruntime-win-x64-gpu_cuda13-1.27.0.tgz`, etc.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Kusuma Padma Kavya Bandi <kusbandi@microsoft.com>
## Summary

Unifies the NHWC-eligible op allowlist between the bundled CUDA EP and
the CUDA plugin EP into a single shared header, adds kernel-miss
diagnostics, and expands NHWC test coverage from 4 ops to 11.

## Motivation

The bundled EP (`cuda_execution_provider.cc`) and the plugin EP
(`plugin/cuda_ep.cc`) independently maintained their own copies of the
NHWC allowlist. This created a maintenance hazard where ops could be
added to one but not the other, leading to silent divergence.
Additionally, there was no runtime diagnostic when the framework rewrote
a node to the NHWC domain but the plugin EP lacked a matching kernel —
failures were silent fallbacks to CPU.

## Key Changes

### Shared NHWC Allowlist (`cuda_nhwc_ops.h`)

| Item | Detail |
|------|--------|
| New file | `onnxruntime/core/providers/cuda/cuda_nhwc_ops.h` |
| Contents | `IsNhwcEligibleOnnxOp()`, `IsNhwcEligibleMsOp()`,
`IsNhwcEligible()` inline functions |
| Ops covered | AveragePool, BatchNormalization, Conv, ConvTranspose,
DepthToSpace, GlobalAveragePool, GlobalMaxPool, GridSample, LRN,
MaxPool, SpaceToDepth (+ MS-domain GridSample) |

### Bundled EP Refactor (`cuda_execution_provider.cc`)

- Removed the static `std::unordered_set<std::string_view>
cuda_nhwc_onnx_ops` and the inline domain check logic.
- Replaced with a single call to `cuda::IsNhwcEligible(node_domain,
node_op_type)`.

### Plugin EP Refactor & Diagnostics (`plugin/cuda_ep.cc`)

- `ShouldConvertDataLayoutForOpImpl`: Replaced ~20 lines of static set +
domain checks with a single `cuda::IsNhwcEligible()` call.
- `GetCapabilityImpl`: Added a WARNING-level diagnostic in the `else`
branch (kernel not found). When a node in the `com.ms.internal.nhwc`
domain has no registered kernel, the log emits the op type, domain,
version, and node name — making future NHWC registration gaps
immediately visible at session creation.

### Expanded NHWC Test Coverage (`test_cuda_plugin_ep.py`)

- Added `_assert_nhwc_domain_assigned()` helper that verifies NHWC
layout transformation occurred by checking for framework-inserted
Transpose nodes in the EP's assignment info.
- Added `_run_nhwc_model_test()` helper combining domain assertion +
numerical validation.
- Updated 4 existing NHWC tests (Conv, BatchNormalization, MaxPool,
AveragePool) to include structural assertions.
- Added 7 new NHWC test methods:
  - `test_nhwc_conv_transpose`
  - `test_nhwc_global_max_pool`
  - `test_nhwc_global_average_pool`
  - `test_nhwc_depth_to_space`
  - `test_nhwc_space_to_depth`
  - `test_nhwc_lrn`
  - `test_nhwc_grid_sample`

## Testing Notes

Run the full CUDA plugin EP test suite with NHWC enabled:

```bash
bash .env/cuda13_plugin.sh --build --install --test_plugin
```

Or run only the NHWC tests directly:

```bash
cd onnxruntime/test/python/transformers
ORT_TEST_CUDA_PLUGIN_EP=1 python -m unittest \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_conv \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_batch_normalization \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_maxpool \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_avgpool \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_conv_transpose \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_global_max_pool \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_global_average_pool \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_depth_to_space \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_space_to_depth \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_lrn \
  test_cuda_plugin_ep.TestCudaPluginEP.test_nhwc_grid_sample
```

All 86 tests in the suite pass (11 NHWC + 75 existing), with no
regressions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants