Skip to content

[MLAS] Update the NHWC sans transposes path to also support Depthwise convolutions#28565

Open
orlmon01 wants to merge 3 commits into
microsoft:mainfrom
orlmon01:depthwise
Open

[MLAS] Update the NHWC sans transposes path to also support Depthwise convolutions#28565
orlmon01 wants to merge 3 commits into
microsoft:mainfrom
orlmon01:depthwise

Conversation

@orlmon01
Copy link
Copy Markdown
Contributor

Description

A path for MLAS to support NHWC Convolutions without the need for transposes was added in PR: #26834
This PR expands those changes to also support Depthwise Convolutions via the same pathway

What changed:

  • The shared NHWC capability gate in onnxruntime/core/mlas/lib/convolve.cpp:1348 stopped requiring GroupCount == 1. It now allows GroupCount > 1 only when the op is true depthwise, meaning filters_per_group ==
    1.
  • The NHWC transformer in onnxruntime/core/optimizer/nhwc_transformer.cc:162 was updated to pass the real group value and compute filter_count per group instead of hard-coding group 1. That is what lets grouped
    depthwise Conv/FusedConv nodes get rewritten to com.microsoft.NhwcFusedConv.
  • The KleidiAI execution path in onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp:553 learned how to handle grouped NHWC tensors by:
    • gathering one group’s channels out of interleaved NHWC input into a temporary contiguous buffer,
    • running the existing per-group kernel,
    • scattering that group’s output channels back into interleaved NHWC output.
  • Tests were added for a working NHWC depthwise case in onnxruntime/test/contrib_ops/fused_conv_test.cc:466, and transformer tests were updated to verify both the new positive case and the expected skip cases in
    onnxruntime/test/optimizer/nhwc_transformer_test.cc:416.

Added performance benchmark tests to allow for comparison between the new NHWC path and the old NCHW default.
Sample output:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                                     Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:1/Cpg:64/Fpg:64/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time                508509 ns       508507 ns         1374
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:1/Cpg:128/Fpg:128/I:28/28/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time              700573 ns       700386 ns          997
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:64/Cpg:1/Fpg:1/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time                6471094 ns      6471114 ns          132
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:72/Cpg:1/Fpg:1/I:48/80/K:3/3/P:1/1/1/1/S:2/2/D:1/1/real_time                3768969 ns      3767797 ns          217
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:1/Cpg:64/Fpg:64/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time       414198 ns       414197 ns         1688
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:1/Cpg:128/Fpg:128/I:28/28/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time     652454 ns       652454 ns         1074
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:64/Cpg:1/Fpg:1/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time       6032947 ns      6032940 ns          117
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:72/Cpg:1/Fpg:1/I:48/80/K:3/3/P:1/1/1/1/S:2/2/D:1/1/real_time       3022041 ns      3018352 ns          227

orlmon01 added 2 commits May 19, 2026 14:30
* Allow for NHWC Depthwise convolutions when groups are values other than 1
* Added verification tests
* Changed the fallback / skip tests to now check for asymettric padding, non-depthwise grouped conv, and multiplier > 1

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
@orlmon01
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree company="Arm"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Expands the existing MLAS/KleidiAI NHWC “no-transpose” convolution fast path to support true depthwise convolutions (grouped conv where filters-per-group == 1), and wires that capability through the NHWC transformer plus adds test/benchmark coverage.

Changes:

  • Relax MLAS NHWC capability gating to allow GroupCount > 1 only for true depthwise (FilterCount-per-group == 1).
  • Update NHWC transformer filtering to pass the real group count and compute per-group filter count.
  • Extend KleidiAI NHWC execution to handle grouped NHWC tensors via per-group gather/compute/scatter, plus add unit tests and a benchmark comparison.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
onnxruntime/core/mlas/lib/convolve.cpp Updates NHWC capability gate to allow depthwise grouped convs.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Implements grouped-NHWC handling by gathering/scattering channels per group.
onnxruntime/core/optimizer/nhwc_transformer.cc Passes group count + per-group filter count into the NHWC fast-path capability check.
onnxruntime/core/providers/cpu/nn/conv.h Broadens KleidiAI fast-path compilation guard to MLAS_TARGET_ARM64.
onnxruntime/core/providers/cpu/nn/conv.cc Same guard update for KleidiAI fast-path code.
onnxruntime/test/optimizer/nhwc_transformer_test.cc Adds/updates tests validating depthwise enablement and expected skip cases.
onnxruntime/test/contrib_ops/fused_conv_test.cc Adds an NHWC depthwise FusedConv correctness test (conditionally enabled).
onnxruntime/test/mlas/bench/bench_sconv.cpp Adds benchmark cases comparing NCHW baseline vs NHWC KleidiAI fast path, including depthwise shapes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 193 to 198
const auto group = node.GetAttributeInt("group").value_or(1);
if (group != 1) {
if (group <= 0) {
return false;
}
const auto group_count = narrow<size_t>(group);

Comment on lines 559 to +567
for (size_t g = 0; g < groups; ++g) {
const float* input_group = in;
std::vector<float> input_group_buffer;
if (grouped_channels_last) {
input_group_buffer.resize(ih * iw * ci);
for (size_t pixel = 0; pixel < ih * iw; ++pixel) {
const float* src = input_base + pixel * input_channels_total + g * ci;
std::copy_n(src, ci, input_group_buffer.data() + pixel * ci);
}
Comment on lines +156 to +160
if (rank <= 0) throw std::invalid_argument("Kernel rank must greater than 0!");
if (batch_size <= 0) throw std::invalid_argument("Batch size must greater than 0!");
if (groups <= 0) throw std::invalid_argument("Group count must greater than 0!");
if (input_channels_per_group <= 0) throw std::invalid_argument("input_channels_per_group must greater than 0!");
if (output_channels_per_group <= 0) throw std::invalid_argument("output_channels_per_group must greater than 0!");
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants