Skip to content

Sync with Microsoft ONNX Runtime - 07042026#1028

Open
ai-fw-intg wants to merge 10 commits intoovep-developfrom
sync_msft_07042026
Open

Sync with Microsoft ONNX Runtime - 07042026#1028
ai-fw-intg wants to merge 10 commits intoovep-developfrom
sync_msft_07042026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

hariharans29 and others added 10 commits April 6, 2026 00:42
This pull request introduces a new documentation page,
`PartitioningWithAnnotationsAndMemoryConstraints.md`, which explains
advanced ONNX Runtime features for partitioning model graphs across
devices with explicit control. The doc covers how to annotate model
layers for device assignment, collect per-node memory statistics, and
enforce GPU memory budgets during partitioning. These features enable
precise control over device placement and memory usage for large models.

The most important changes are:

**New Documentation: Advanced Partitioning Features**
* Adds a comprehensive guide
(`PartitioningWithAnnotationsAndMemoryConstraints.md`) describing how to
use ONNX Runtime’s layer annotation and memory constraint features for
graph partitioning.

**Layer Assignment via Annotations**
* Explains how to annotate ONNX model nodes with `layer_ann` metadata,
including manual annotation and automated annotation using Olive’s
`CaptureLayerAnnotations` pass.
* Provides configuration examples for mapping annotation patterns to
devices at runtime using the `session.layer_assignment_settings` session
option.

**Capacity-Aware Partitioning**
* Details a two-phase workflow for profiling per-node memory usage and
then enforcing a memory budget with the
`session.resource_cuda_partitioning_settings` session option.
* Covers both profiling-based and ad-hoc (estimation-only) approaches
for memory-constrained partitioning.
([docs/annotated_partitioning/PartitioningWithAnnotationsAndMemoryConstraints.mdR1-R267](diffhunk://#diff-10b3051b9e36eccfc7ca0f2d44ce78a9980ca573cde0f931ffd1456da2c681daR1-R267)

This is a follow up for
microsoft#27595
### Description
Increase version number to 1.26.0. The rel-1.25.0 release branch has
been cut.

### Changes
- VERSION_NUMBER: 1.25.0 → 1.26.0
- ORT_API_VERSION: 25 → 26 (header + C API struct rename)
- Python, JS, docs version strings updated via update_version.py
- C# NativeTrainingMethods ORT_API_VERSION: 23 → 26
- samples/cxx/README.md example paths updated
- docs/Versioning.md example updated

### Motivation and Context
Per release process: bump main branch version immediately after cutting
the release branch.
Proposal for CausalConvWithState and LinearAttention onnxruntime custom
operator.
This follows the proposal in onnx/onnx#7767.
…osoft#27901)

Add ORT_ENFORCE checks in the SVMRegressor constructor to validate that
coefficients, support_vectors, and rho attribute array sizes are
consistent with the declared n_supports dimension. Without this
validation, a crafted model with undersized arrays causes the GEMM inner
loop to read past buffer boundaries.

This mirrors the existing validation already present in SVMClassifier.

- Validate rho is non-empty (accessed as rho_[0] in LINEAR mode, passed
to GEMM as bias in SVC mode)
- Validate coefficients.size() >= vector_count_ in SVC mode
- Validate feature_count_ > 0 after support_vectors division
- Add two unit tests for undersized coefficients and support_vectors

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…alues (microsoft#27789)

### Description
Fixes a heap out-of-bounds write (underflow) in the `Attention` contrib
operator's `PrepareMask` function. Negative values in the 1D
`mask_index` tensor were used directly as loop start indices without
bounds checking, allowing writes at negative offsets before the `p_mask`
buffer.

In `PrepareMask()` (`attention_helper.h`), `end_position` is read from
`mask_index[b_i]` and used as the starting index in a write loop with no
lower-bound validation. When `end_position` is negative, the loop writes
`mask_filter_value` at negative offsets — a heap buffer underflow. In
contrast, `start_position` had partial clamping via `std::min()` but
lacked a lower bound as well.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…27752)

for webgpu ep:
+ onnx rotary-embedding op
+ onnx rmsnorm
+ reshape-> opset-25
+ transpose -> opset-24
### Description

Extends the CUDA Transpose kernel registration from opset 23 to opset
25.

- **`transpose.cc`**: Cap existing opset 23 kernel to versioned `(23,
24)`, add new non-versioned kernel at opset 25
- **`cuda_execution_provider.cc`**: Update forward declarations and
`BuildKernelCreateInfo` entries to match; add new `// Opset 25` section
- **`docs/OperatorKernels.md`**: Update CUDA Transpose entry from `23+`
to `25+` with new `[23, 24]` versioned range

No functional or type constraint changes — the kernel implementation is
identical across these opsets.

### Motivation and Context

CUDA EP's Transpose registration stopped at opset 23 while the ONNX spec
defines it through opset 25. This is one of the P1 gaps tracked in
microsoft#27729, following the same pattern as microsoft#27728.

### Limitation

This PR does not add support of new data type for Transpose:
- int2 (opset 25)
- float8e8m0 (opset 24)
- float4e2m1 (opset 23)
- float8e4m3fn,float8e4m3fnuz, float8e5m2, float8e5m2fnuz, uint4, int4
(opset 21)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants