Add managed surface for AVX-512 BF16 hardware intrinsics (draft) by jamesburton · Pull Request #129326 · dotnet/runtime

jamesburton · 2026-06-12T13:33:18Z

This is a DRAFT. Filed early to confirm ISA grouping and shape with @tannergooding via #129323 before completing the JIT-side codegen wiring. Please see the issue for the full design rationale, hardware table (including Strix Halo as an AMD reference device), and the A/B/C options I outlined for the grouping.

Closes #129323 (will be left open until confirmed).

What this adds

Wires up the public Avx512Bf16 / Avx512Bf16.VL / Avx512Bf16.X64 classes in System.Runtime.Intrinsics.X86, exposing VDPBF16PS, VCVTNE2PS2BF16, and VCVTNEPS2BF16 on the classic AVX-512-BF16 gate rather than only via AVX10v1. Direct sibling of the AvxVnni.V512 work landed in PR #128365.

Key design choices (mirrors #128365 patterns)

New ISA group AVX512_BF16 with implication AVX512_BF16 -> AVX512v3 and implication AVX10v1 -> AVX512_BF16. This deliberately does not fold BF16 into AVX512v3 because v3-without-BF16 is a real shipping configuration (Intel Tiger Lake / Rocket Lake / Ice Lake-U). Folding would regress those parts' v3 surface. Preserves Tanner's "broadest overall support" principle. AMD Zen 5 (Strix Halo) and Sapphire Rapids+ enable the classic gate; AVX10v1 hardware picks up BF16 via the AVX10v1 -> AVX512_BF16 implication.
BFloat16 element type from existing System.BFloat16 BCL primitive. No Vector*<ushort> interim.
Existing R2R IDs (72, 73) preserved — re-pointed from AVX10v1 to AVX512_BF16. R2R image compatibility preserved.
JIT dispatch in Compiler::lookupInstructionSet (returns InstructionSet_AVX512_BF16 for the Avx512Bf16 class-name). VLVersionOfIsa and X64VersionOfIsa cases added. Mirrors AVXVNNIINT / AVXVNNIINT_V512 pattern Tanner endorsed.
HARDWARE_INTRINSIC rows under AVX512_BF16 with simdSize=-1 and HW_Flag_SpecialCodeGen|HW_Flag_SpecialImport matching the AVXVNNIINT precedent — avoids the fixed-simdSize=64 trap Copilot kept flagging on Add AvxVnni.V512 hardware intrinsics #128365.
CPUID bit detection in cpufeatures.c (subleaf 1 EAX bit 5) gated on AVX512v3 being enabled, emitting a new XArchIntrinsicConstants_Avx512Bf16 constant.
New configs EXTERNAL_EnableAVX512_BF16 (retail) / EnableAVX512_BF16 (JIT). Existing config-comment tokens preserved.
codeman.cpp gate — plain shape, no fallback branches (Tanner explicitly rejected codeman-side fallbacks on the Add AvxVnni.V512 hardware intrinsics #128365 codeman.cpp:1430 thread).
JIT-EE GUID regenerated via gen.bat: 2d316351-72be-474a-9a53-d204621127e2.
Tests: CpuId.cs and SmokeTests/HardwareIntrinsics/Program.cs — only the Avx512Bf16 lines uncommented (Avx512Bitalg / Avx512Vpopcntdq / Avx512Fp16 left commented because their managed types still don't exist — direct application of the Add AvxVnni.V512 hardware intrinsics #128365 lesson where over-eager uncommenting caused 11 NativeAOT build failures).

Still TODO (after the issue confirms direction)

JIT-side special codegen for VDPBF16PS / VCVTNE2PS2BF16 / VCVTNEPS2BF16 in hwintrinsiccodegenxarch.cpp and lsraxarch.cpp (needed before the intrinsic actually executes — currently the HARDWARE_INTRINSIC rows use INS_invalid slots).
Hand-written tests in src/tests/JIT/HardwareIntrinsics/X86_Avx/Avx512Bf16/.

These are deliberately out of scope for the draft. Once @tannergooding confirms the ISA grouping in #129323 they can be wired in a follow-up commit on this branch.

Files touched

File	Why
`InstructionSetDesc.txt`	Fill in managed name, re-point parent from `AVX10v1` to `AVX512_BF16`, add implications
`hwintrinsic.cpp` (IsaRangeArray)	New entry to match shifted enum
`hwintrinsiclistxarch.h`	`HARDWARE_INTRINSIC` rows with `SpecialCodeGen` flags
`hwintrinsicxarch.cpp`	`lookupInstructionSet` dispatch; `VLVersionOfIsa` / `X64VersionOfIsa` cases
`cpufeatures.h` / `cpufeatures.c`	New constant + CPUID detection
`HardwareIntrinsicHelpers.cs`	AOT-path constant + InstructionSet mapping
`codeman.cpp`	Gate the new InstructionSet on the constant + config
`clrconfigvalues.h` / `jitconfigvalues.h`	New config flags
`Avx512Bf16.cs` / `Avx512Bf16.PlatformNotSupported.cs`	Public API surface
`System.Private.CoreLib.Shared.projitems`	Include new files
`System.Runtime.Intrinsics.cs` (ref)	Public surface in the ref assembly
`CpuId.cs`, `SmokeTests/Program.cs`	Tests
`corinfoinstructionset.h`, `CorInfoInstructionSet.cs`, `ReadyToRunInstructionSetHelper.cs`, `jiteeversionguid.h`	Generated by `gen.bat` — do not hand-edit

🤖 Generated with Claude Code

@tannergooding

Wires up the public Avx512Bf16 / Avx512Bf16.VL / Avx512Bf16.X64 classes in System.Runtime.Intrinsics.X86, exposing VDPBF16PS, VCVTNE2PS2BF16, and VCVTNEPS2BF16 on the classic AVX-512-BF16 gate rather than only via AVX10v1. This is a direct sibling of the AvxVnni.V512 work landed in PR dotnet#128365 — the issue and reasoning are tracked at dotnet#129323 and the design follows the patterns @tannergooding endorsed on the VNNI redesign. Key choices, called out for review: - New ISA group `AVX512_BF16` with `implication AVX512_BF16 -> AVX512v3` and `implication AVX10v1 -> AVX512_BF16`. This deliberately does NOT fold BF16 into AVX512v3 because v3-without-BF16 is a real shipping configuration on Intel Tiger Lake / Rocket Lake / Ice Lake-U; folding would regress those parts' v3 surface. The implication chain preserves Tanner's "broadest overall support" principle. AMD Zen 5 (Strix Halo) and Sapphire Rapids+ enable the classic gate directly; AVX10v1 hardware picks up BF16 via the AVX10v1 -> AVX512_BF16 implication. - `BFloat16` element type from `System.BFloat16` (already in the BCL via `BitConverter.GetBytes(BFloat16)` etc.). No `Vector*<ushort>` interim. - The two existing R2R IDs (72, 73 for `Avx512Bf16` / `Avx512Bf16_VL`) are kept and re-pointed from `AVX10v1` to `AVX512_BF16` in `InstructionSetDesc.txt`. R2R image compatibility preserved. - JIT dispatch in `Compiler::lookupInstructionSet` returns `InstructionSet_AVX512_BF16` for the `Avx512Bf16` class-name (was `AVX10v1`); `VLVersionOfIsa` and `X64VersionOfIsa` cases added. - HARDWARE_INTRINSIC rows added under `AVX512_BF16` with `simdSize=-1` and `HW_Flag_SpecialCodeGen|HW_Flag_SpecialImport` matching the AVXVNNIINT precedent, so the special-codegen path will handle base-type dispatch from the BFloat16 args. - CPUID bit detection added in `cpufeatures.c` (subleaf 1 EAX bit 5) gated on AVX512v3 already being enabled, emitting a new `XArchIntrinsicConstants_Avx512Bf16` bit. - New `EXTERNAL_EnableAVX512_BF16` / `EnableAVX512_BF16` configs in `clrconfigvalues.h` and `jitconfigvalues.h`. - `codeman.cpp` gates `InstructionSet_AVX512_BF16` on the new CPUID bit and config. No fallback branches (Tanner rejected codeman-side fallbacks on dotnet#128365). - Generated files (`corinfoinstructionset.h`, `CorInfoInstructionSet.cs`, `ReadyToRunInstructionSetHelper.cs`) regenerated via `gen.bat`; fresh JIT-EE GUID `2d316351-72be-474a-9a53-d204621127e2`. - Tests in `CpuId.cs` and `SmokeTests/HardwareIntrinsics/Program.cs` uncommented for the new managed type (only those — Avx512Bitalg / Avx512Vpopcntdq / Avx512Fp16 left commented because their managed types still do not exist, per the dotnet#128365 lesson). DRAFT: the JIT-side special codegen for VDPBF16PS / VCVTNE2PS2BF16 / VCVTNEPS2BF16 in `hwintrinsiccodegenxarch.cpp` is not yet wired and is the next step. Filing as draft to confirm the ISA grouping and shape with @tannergooding via dotnet#129323 before that work. Related: dotnet#129323

dotnet-policy-service · 2026-06-12T13:34:38Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR enables plumbing for the AVX512-BF16 instruction set across the runtime stack (feature detection, JIT ISA exposure, and public intrinsics surface), and unblocks tests to validate support reporting.

Changes:

Expose System.Runtime.Intrinsics.X86.Avx512Bf16 (including VL and X64 nested types) in ref + CoreLib, and wire it into instruction-set tables.
Detect and surface AVX512-BF16 CPU capability via minipal → CoreCLR CPU flags → JIT instruction set flags / R2R mapping.
Enable/extend JIT + NativeAOT smoke tests to validate AVX512-BF16 support and CPUID bit correctness.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Bf16.cs	Adds the AVX512-BF16 intrinsic type + APIs (supported implementation).
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Bf16.PlatformNotSupported.cs	Adds PlatformNotSupported stubs for AVX512-BF16 APIs.
src/libraries/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs	Adds public ref surface for `Avx512Bf16`, `VL`, and `X64`.
src/native/minipal/cpufeatures.h / cpufeatures.c	Adds AVX512-BF16 feature flag + CPUID-based detection.
src/coreclr/vm/codeman.cpp	Enables the ISA for JIT via config gating + detected CPU feature.
src/coreclr/jit/, src/coreclr/inc/, tools/Common/*	Adds ISA enum values, mappings, implications, intrinsic lists, and R2R plumbing for AVX512_BF16.
src/tests/nativeaot/SmokeTests/HardwareIntrinsics/Program.cs	Turns on AVX512-BF16 smoke checks.
src/tests/JIT/HardwareIntrinsics/X86/X86Base/CpuId.cs	Validates CPUID bit for AVX512-BF16 against `Avx512Bf16.IsSupported`.

+        /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
+        /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
+        /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
+        public static new bool IsSupported { get => IsSupported; }


+            /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
+            /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
+            /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
+            public static new bool IsSupported { get => IsSupported; }


+        ///   <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para>
+        ///   <para>  VDPBF16PS zmm1, zmm2, zmm3/m512</para>
+        /// </summary>
+        public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<BFloat16> left, Vector512<BFloat16> right) => MultiplyWideningAndAdd(addend, left, right);


+        ///   <para>  VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para>
+        ///   <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector.</para>
+        /// </summary>
+        public static Vector512<BFloat16> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper);


+        ///   <para>  VCVTNEPS2BF16 ymm1, zmm2/m512</para>
+        ///   <para>Round-to-nearest-even conversion of a packed FP32 vector into a packed BF16 vector half-width.</para>
+        /// </summary>
+        public static Vector256<BFloat16> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value);


+            ///   <para>__m128bh _mm_cvtneps_pbh (__m128 a)</para>
+            ///   <para>  VCVTNEPS2BF16 xmm1, xmm2/m128</para>
+            /// </summary>
+            public static Vector128<BFloat16> ConvertToBFloat16(Vector128<float> value) => ConvertToBFloat16(value);


+            ///   <para>__m128bh _mm256_cvtneps_pbh (__m256 a)</para>
+            ///   <para>  VCVTNEPS2BF16 xmm1, ymm2/m256</para>
+            /// </summary>
+            public static Vector128<BFloat16> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value);


+                case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64):
+                case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64):
+                {
+                    var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
+                    if (type != null)
+                    {
+                        yield return type;
+                        if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
+                        {
+                            var nestedType = type.GetNestedType("X64"u8);
+                            if (nestedType != null)
+                            {
+                                yield return nestedType;
+                            }
+                        }
+                    }
+                }
+                {
+                    var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
+                    if (parentType != null)
+                    {
+                        yield return parentType;
+                        var nestedType = parentType.GetNestedType("VL"u8);
+                        if (nestedType != null)
+                        {
+                            yield return nestedType;
+                            if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
+                            {
+                                var nestedType64 = parentType.GetNestedType("VL_X64"u8);
+                                if (nestedType64 != null)
+                                {
+                                    yield return nestedType64;
+                                }
+                            }
+                        }
+                    }
+                }
+                break;


+                case (InstructionSet.X86_AVX512_BF16, TargetArchitecture.X86):
+                {
+                    var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
+                    if (type != null)
+                    {
+                        yield return type;
+                    }
+                }
+                {
+                    var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
+                    if (parentType != null)
+                    {
+                        yield return parentType;
+                        var nestedType = parentType.GetNestedType("VL"u8);
+                        if (nestedType != null)
+                        {
+                            yield return nestedType;
+                        }
+                    }
+                }
+                break;


 #define XArchIntrinsicConstants_WaitPkg (1 << 16)
 #define XArchIntrinsicConstants_X86Serialize (1 << 17)
 #define XArchIntrinsicConstants_AVX512Bmm (1 << 18)
+#define XArchIntrinsicConstants_Avx512Bf16 (1 << 19)


BFloat16 is declared in System.Numerics (System/Numerics/BFloat16.cs). The ref assembly referenced System.BFloat16 (12 sites, producing 12x CS0234) and the impl files used unqualified BFloat16 without a using directive (would CS0246 once compiled). Fix: - src/libraries/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs: s/System.BFloat16/System.Numerics.BFloat16/ (12 sites) - Avx512Bf16.cs and Avx512Bf16.PlatformNotSupported.cs: added 'using System.Numerics;'

The JIT asserts (hwintrinsic.cpp:1120) that HARDWARE_INTRINSIC entries within an ISA range are sorted alphabetically by method name. The AVX512_BF16 block had MultiplyWideningAndAdd before ConvertToBFloat16, which fails strcmp ordering and crashes crossgen2 during corelib R2R generation. Reorder so ConvertToBFloat16 is first (and update the FIRST_NI / LAST_NI markers accordingly). Caught by a clean dev-branch build that combined this PR with dotnet#128365.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 15 comments.

+        /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
+        /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
+        /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
+        public static new bool IsSupported { get => IsSupported; }


+        ///   <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para>
+        ///   <para>  VDPBF16PS zmm1, zmm2, zmm3/m512</para>
+        /// </summary>
+        public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<BFloat16> left, Vector512<BFloat16> right) => MultiplyWideningAndAdd(addend, left, right);


+        ///   <para>  VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para>
+        ///   <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector.</para>
+        /// </summary>
+        public static Vector512<BFloat16> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper);


+        ///   <para>  VCVTNEPS2BF16 ymm1, zmm2/m512</para>
+        ///   <para>Round-to-nearest-even conversion of a packed FP32 vector into a packed BF16 vector half-width.</para>
+        /// </summary>
+        public static Vector256<BFloat16> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value);


+            /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
+            /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
+            /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
+            public static new bool IsSupported { get => IsSupported; }


+            ///   <para>__m128bh _mm256_cvtneps_pbh (__m256 a)</para>
+            ///   <para>  VCVTNEPS2BF16 xmm1, ymm2/m256</para>
+            /// </summary>
+            public static Vector128<BFloat16> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value);


    <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512F$(NotSupportedOnMono).cs" />
    <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi$(NotSupportedOnMono).cs" />
    <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi2$(NotSupportedOnMono).cs" />
+    <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Bf16$(NotSupportedOnMono).cs" />


    <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\AvxVnniInt16.PlatformNotSupported.cs" />
    <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi.PlatformNotSupported.cs" />
    <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi2.PlatformNotSupported.cs" />
+    <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Bf16.PlatformNotSupported.cs" />


+                case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64):
+                case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64):
+                {
+                    var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
+                    if (type != null)
+                    {
+                        yield return type;
+                        if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
+                        {
+                            var nestedType = type.GetNestedType("X64"u8);
+                            if (nestedType != null)
+                            {
+                                yield return nestedType;
+                            }
+                        }
+                    }
+                }
+                {
+                    var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
+                    if (parentType != null)
+                    {
+                        yield return parentType;
+                        var nestedType = parentType.GetNestedType("VL"u8);
+                        if (nestedType != null)
+                        {
+                            yield return nestedType;
+                            if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
+                            {
+                                var nestedType64 = parentType.GetNestedType("VL_X64"u8);
+                                if (nestedType64 != null)
+                                {
+                                    yield return nestedType64;
+                                }
+                            }
+                        }
+                    }
+                }
+                break;


 #define XArchIntrinsicConstants_WaitPkg (1 << 16)
 #define XArchIntrinsicConstants_X86Serialize (1 << 17)
 #define XArchIntrinsicConstants_AVX512Bmm (1 << 18)
+#define XArchIntrinsicConstants_Avx512Bf16 (1 << 19)


@tannergooding

@tannergooding's review on the API proposal (dotnet#129323): the strongly-typed BFloat16 element shape requires extensive JIT/VM/ Debugger/ABI work to make Vector128<BFloat16>.IsSupported report true, which is the same reason F16C/FP16 have not landed yet. The cheaper path he endorsed is to ship the initial signatures as Vector*<ushort> (raw bf16 bit pattern) and add Vector*<BFloat16> overloads later when the primitive-type plumbing is ready. This commit applies that pivot: - Avx512Bf16.cs / Avx512Bf16.PlatformNotSupported.cs: all BFloat16 parameter / return types switched to ushort. - System.Runtime.Intrinsics.cs (ref): same. Method names (ConvertToBFloat16) are unchanged — they refer to the operation, not the parameter type. Existing JIT plumbing (InstructionSetDesc, codeman gate, lookupInstructionSet dispatch, HARDWARE_INTRINSIC rows, CPUID detection, configs) is unaffected.

- MultiplyWideningAndAdd: drop SpecialCodeGen/SpecialImport, move INS_vdpbf16ps from FLOAT slot to USHORT slot (BaseTypeFromSecondArg resolves to USHORT). Standard table-driven 3-arg case dispatches to genHWIntrinsic_R_R_R_RM, matching AVXVNNI MultiplyWideningAndAdd. - ConvertToBFloat16: keep SpecialCodeGen, add handler in genAvxFamilyIntrinsic that picks INS_vcvtne2ps2bf16 (2-arg) vs INS_vcvtneps2bf16 (1-arg) by node operand count. - LSRA: add NI_AVX512_BF16_MultiplyWideningAndAdd alongside AVXVNNI for RMW operand-use pattern (op1 = accumulator/target). Unblocks downstream consumer (BF16 quant kernel on Zen5 Strix Halo) reporting NotImplementedException on all three V512 intrinsics. ISA grouping (option-a / AVX10v1-lie / AVX512v4) still pending maintainer direction on dotnet#129323 — codegen is robust to either path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The genHWIntrinsic per-ISA switch (hwintrinsiccodegenxarch.cpp:996) routes SpecialCodeGen-flagged intrinsics to their family handler. AVX512_BF16 was missing — non-table-driven BF16 intrinsics (ConvertToBFloat16) fell through to default:unreached(), which calls fatal(CORJIT_RECOVERABLEERROR) in Release builds and surfaces to the runtime as InvalidProgramException. MultiplyWideningAndAdd is table-driven (no SpecialCodeGen) so it took the generic code path and worked. ConvertToBFloat16 needs SpecialCodeGen to dispatch by arg count (1-arg → VCVTNEPS2BF16, 2-arg → VCVTNE2PS2BF16), so it requires this fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.

+                case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64):
+                case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64):
+                {
+                    var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
+                    if (type != null)
+                    {
+                        yield return type;
+                        if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
+                        {
+                            var nestedType = type.GetNestedType("X64"u8);
+                            if (nestedType != null)
+                            {
+                                yield return nestedType;
+                            }
+                        }
+                    }
+                }
+                {
+                    var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
+                    if (parentType != null)
+                    {
+                        yield return parentType;
+                        var nestedType = parentType.GetNestedType("VL"u8);
+                        if (nestedType != null)
+                        {
+                            yield return nestedType;
+                            if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
+                            {
+                                var nestedType64 = parentType.GetNestedType("VL_X64"u8);
+                                if (nestedType64 != null)
+                                {
+                                    yield return nestedType64;
+                                }
+                            }
+                        }
+                    }
+                }


+                        if (resultflags.HasInstructionSet(InstructionSet.X64_AVX512v3))
+                            resultflags.AddInstructionSet(InstructionSet.X64_AVX512_BF16);
+                        if (resultflags.HasInstructionSet(InstructionSet.X64_AVX512_BF16))
+                            resultflags.AddInstructionSet(InstructionSet.X64_AVX10v1);


+                        if (resultflags.HasInstructionSet(InstructionSet.X86_AVX512v3))
+                            resultflags.AddInstructionSet(InstructionSet.X86_AVX512_BF16);
+                        if (resultflags.HasInstructionSet(InstructionSet.X86_AVX512_BF16))
+                            resultflags.AddInstructionSet(InstructionSet.X86_AVX10v1);


+        /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
+        /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
+        /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
+        public static new bool IsSupported { get => IsSupported; }


+            internal X64() { }
+
+            /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
+            public static new bool IsSupported { get => IsSupported; }


+        ///   <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para>
+        ///   <para>  VDPBF16PS zmm1, zmm2, zmm3/m512</para>
+        /// </summary>
+        public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<ushort> left, Vector512<ushort> right) => MultiplyWideningAndAdd(addend, left, right);


+        ///   <para>  VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para>
+        ///   <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector (returned as <see cref="Vector512{T}"/> of <see cref="ushort"/>).</para>
+        /// </summary>
+        public static Vector512<ushort> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper);


+        ///   <para>  VCVTNEPS2BF16 ymm1, zmm2/m512</para>
+        ///   <para>Round-to-nearest-even conversion of a packed FP32 vector into a half-width packed BF16 vector.</para>
+        /// </summary>
+        public static Vector256<ushort> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value);


+            public static new bool IsSupported { get => IsSupported; }
+
+            /// <summary>__m128 _mm_dpbf16_ps (__m128 src, __m128bh a, __m128bh b) — VDPBF16PS xmm</summary>
+            public static Vector128<float> MultiplyWideningAndAdd(Vector128<float> addend, Vector128<ushort> left, Vector128<ushort> right) => MultiplyWideningAndAdd(addend, left, right);
+
+            /// <summary>__m256 _mm256_dpbf16_ps (__m256 src, __m256bh a, __m256bh b) — VDPBF16PS ymm</summary>
+            public static Vector256<float> MultiplyWideningAndAdd(Vector256<float> addend, Vector256<ushort> left, Vector256<ushort> right) => MultiplyWideningAndAdd(addend, left, right);
+
+            /// <summary>__m128bh _mm_cvtne2ps_pbh (__m128 a, __m128 b) — VCVTNE2PS2BF16 xmm</summary>
+            public static Vector128<ushort> ConvertToBFloat16(Vector128<float> lower, Vector128<float> upper) => ConvertToBFloat16(lower, upper);
+
+            /// <summary>__m256bh _mm256_cvtne2ps_pbh (__m256 a, __m256 b) — VCVTNE2PS2BF16 ymm</summary>
+            public static Vector256<ushort> ConvertToBFloat16(Vector256<float> lower, Vector256<float> upper) => ConvertToBFloat16(lower, upper);
+
+            /// <summary>__m128bh _mm_cvtneps_pbh (__m128 a) — VCVTNEPS2BF16 xmm</summary>
+            public static Vector128<ushort> ConvertToBFloat16(Vector128<float> value) => ConvertToBFloat16(value);
+
+            /// <summary>__m128bh _mm256_cvtneps_pbh (__m256 a) — VCVTNEPS2BF16 xmm from ymm</summary>
+            public static Vector128<ushort> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value);


+        case NI_AVX512_BF16_ConvertToBFloat16:
+        {
+            assert(baseType == TYP_FLOAT);
+            if (numArgs == 2)
+            {
+                genHWIntrinsic_R_R_RM(node, INS_vcvtne2ps2bf16, attr, instOptions);
+            }
+            else
+            {
+                assert(numArgs == 1);
+                genHWIntrinsic_R_RM(node, INS_vcvtneps2bf16, attr, targetReg, op1, instOptions);
+            }
+            break;
+        }


The codegen path for MultiplyWideningAndAdd calls emitIns_SIMD_R_R_R_R(targetReg, op1Reg, op2Reg, op3Reg) which then dispatches on Is3OpRmwInstruction: - true -> emit movaps + emitIns_R_R_R (proper 3-operand RMW encoding) - false -> fall through to a 4-operand blendv encoding emitting 7 bytes with the 4th operand garbage-filled into the IS4 byte INS_vdpbf16ps was not in the Is3OpRmwInstruction table, so it took the blendv path and emitted bytes like `62 F2 7E 48 52 C4 70` (7 bytes) for what should have been `62 F2 7E 48 52 C4` (6 bytes). MinOpts happened to land on operand layouts where the garbage byte didn't trip up the CPU; FullOpts (with register pressure pushing operands into zmm16-31 and varied register triples) produced sequences the CPU rejected as #UD. Tier-0/MinOpts probes (top-level statements) never exercised this path which is why bf16probe passed both before and after the dispatch-case fix — the bug is specifically the FullOpts MultiplyWideningAndAdd path. After this fix: CONTROL (4 acc, FullOpts): STRESS_OK FAITHFUL (24 acc, FullOpts): STRESS_OK MINOPTS (24 acc): STRESS_OK Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 12, 2026 13:33

jamesburton mentioned this pull request Jun 12, 2026

[API Proposal]: Add managed surface for AVX-512 BF16 hardware intrinsics #129323

Open

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 12, 2026

dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label Jun 12, 2026

Copilot AI reviewed Jun 12, 2026

View reviewed changes

jamesburton added 2 commits June 12, 2026 15:07

Copilot AI review requested due to automatic review settings June 12, 2026 14:14

Copilot AI reviewed Jun 12, 2026

View reviewed changes

build-analysis Bot mentioned this pull request Jun 12, 2026

[wasm] WasmBuildTests Blazor apps fail to boot: 404 on _framework Blazor bootstrapper (blazor.webassembly.js / blazor.web.js) #129182

Open

jamesburton and others added 2 commits June 12, 2026 22:04

Copilot AI review requested due to automatic review settings June 12, 2026 21:42

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add managed surface for AVX-512 BF16 hardware intrinsics (draft)#129326

Add managed surface for AVX-512 BF16 hardware intrinsics (draft)#129326
jamesburton wants to merge 7 commits into
dotnet:mainfrom
jamesburton:feature/avx512bf16

jamesburton commented Jun 12, 2026

Uh oh!

dotnet-policy-service Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesburton commented Jun 12, 2026

What this adds

Key design choices (mirrors #128365 patterns)

Still TODO (after the issue confirms direction)

Files touched

Uh oh!

dotnet-policy-service Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants