Skip to content

Add managed surface for AVX-512 BF16 hardware intrinsics (draft)#129326

Draft
jamesburton wants to merge 7 commits into
dotnet:mainfrom
jamesburton:feature/avx512bf16
Draft

Add managed surface for AVX-512 BF16 hardware intrinsics (draft)#129326
jamesburton wants to merge 7 commits into
dotnet:mainfrom
jamesburton:feature/avx512bf16

Conversation

@jamesburton

Copy link
Copy Markdown

This is a DRAFT. Filed early to confirm ISA grouping and shape with @tannergooding via #129323 before completing the JIT-side codegen wiring. Please see the issue for the full design rationale, hardware table (including Strix Halo as an AMD reference device), and the A/B/C options I outlined for the grouping.

Closes #129323 (will be left open until confirmed).

What this adds

Wires up the public Avx512Bf16 / Avx512Bf16.VL / Avx512Bf16.X64 classes in System.Runtime.Intrinsics.X86, exposing VDPBF16PS, VCVTNE2PS2BF16, and VCVTNEPS2BF16 on the classic AVX-512-BF16 gate rather than only via AVX10v1. Direct sibling of the AvxVnni.V512 work landed in PR #128365.

Key design choices (mirrors #128365 patterns)

  • New ISA group AVX512_BF16 with implication AVX512_BF16 -> AVX512v3 and implication AVX10v1 -> AVX512_BF16. This deliberately does not fold BF16 into AVX512v3 because v3-without-BF16 is a real shipping configuration (Intel Tiger Lake / Rocket Lake / Ice Lake-U). Folding would regress those parts' v3 surface. Preserves Tanner's "broadest overall support" principle. AMD Zen 5 (Strix Halo) and Sapphire Rapids+ enable the classic gate; AVX10v1 hardware picks up BF16 via the AVX10v1 -> AVX512_BF16 implication.
  • BFloat16 element type from existing System.BFloat16 BCL primitive. No Vector*<ushort> interim.
  • Existing R2R IDs (72, 73) preserved — re-pointed from AVX10v1 to AVX512_BF16. R2R image compatibility preserved.
  • JIT dispatch in Compiler::lookupInstructionSet (returns InstructionSet_AVX512_BF16 for the Avx512Bf16 class-name). VLVersionOfIsa and X64VersionOfIsa cases added. Mirrors AVXVNNIINT / AVXVNNIINT_V512 pattern Tanner endorsed.
  • HARDWARE_INTRINSIC rows under AVX512_BF16 with simdSize=-1 and HW_Flag_SpecialCodeGen|HW_Flag_SpecialImport matching the AVXVNNIINT precedent — avoids the fixed-simdSize=64 trap Copilot kept flagging on Add AvxVnni.V512 hardware intrinsics #128365.
  • CPUID bit detection in cpufeatures.c (subleaf 1 EAX bit 5) gated on AVX512v3 being enabled, emitting a new XArchIntrinsicConstants_Avx512Bf16 constant.
  • New configs EXTERNAL_EnableAVX512_BF16 (retail) / EnableAVX512_BF16 (JIT). Existing config-comment tokens preserved.
  • codeman.cpp gate — plain shape, no fallback branches (Tanner explicitly rejected codeman-side fallbacks on the Add AvxVnni.V512 hardware intrinsics #128365 codeman.cpp:1430 thread).
  • JIT-EE GUID regenerated via gen.bat: 2d316351-72be-474a-9a53-d204621127e2.
  • Tests: CpuId.cs and SmokeTests/HardwareIntrinsics/Program.cs — only the Avx512Bf16 lines uncommented (Avx512Bitalg / Avx512Vpopcntdq / Avx512Fp16 left commented because their managed types still don't exist — direct application of the Add AvxVnni.V512 hardware intrinsics #128365 lesson where over-eager uncommenting caused 11 NativeAOT build failures).

Still TODO (after the issue confirms direction)

  • JIT-side special codegen for VDPBF16PS / VCVTNE2PS2BF16 / VCVTNEPS2BF16 in hwintrinsiccodegenxarch.cpp and lsraxarch.cpp (needed before the intrinsic actually executes — currently the HARDWARE_INTRINSIC rows use INS_invalid slots).
  • Hand-written tests in src/tests/JIT/HardwareIntrinsics/X86_Avx/Avx512Bf16/.

These are deliberately out of scope for the draft. Once @tannergooding confirms the ISA grouping in #129323 they can be wired in a follow-up commit on this branch.

Files touched

File Why
InstructionSetDesc.txt Fill in managed name, re-point parent from AVX10v1 to AVX512_BF16, add implications
hwintrinsic.cpp (IsaRangeArray) New entry to match shifted enum
hwintrinsiclistxarch.h HARDWARE_INTRINSIC rows with SpecialCodeGen flags
hwintrinsicxarch.cpp lookupInstructionSet dispatch; VLVersionOfIsa / X64VersionOfIsa cases
cpufeatures.h / cpufeatures.c New constant + CPUID detection
HardwareIntrinsicHelpers.cs AOT-path constant + InstructionSet mapping
codeman.cpp Gate the new InstructionSet on the constant + config
clrconfigvalues.h / jitconfigvalues.h New config flags
Avx512Bf16.cs / Avx512Bf16.PlatformNotSupported.cs Public API surface
System.Private.CoreLib.Shared.projitems Include new files
System.Runtime.Intrinsics.cs (ref) Public surface in the ref assembly
CpuId.cs, SmokeTests/Program.cs Tests
corinfoinstructionset.h, CorInfoInstructionSet.cs, ReadyToRunInstructionSetHelper.cs, jiteeversionguid.h Generated by gen.bat — do not hand-edit

🤖 Generated with Claude Code

Wires up the public Avx512Bf16 / Avx512Bf16.VL / Avx512Bf16.X64 classes
in System.Runtime.Intrinsics.X86, exposing VDPBF16PS,
VCVTNE2PS2BF16, and VCVTNEPS2BF16 on the classic AVX-512-BF16 gate
rather than only via AVX10v1. This is a direct sibling of the
AvxVnni.V512 work landed in PR dotnet#128365 — the issue and reasoning are
tracked at dotnet#129323 and the design follows the patterns
@tannergooding endorsed on the VNNI redesign.

Key choices, called out for review:

- New ISA group `AVX512_BF16` with `implication AVX512_BF16 -> AVX512v3`
  and `implication AVX10v1 -> AVX512_BF16`. This deliberately does NOT
  fold BF16 into AVX512v3 because v3-without-BF16 is a real shipping
  configuration on Intel Tiger Lake / Rocket Lake / Ice Lake-U; folding
  would regress those parts' v3 surface. The implication chain preserves
  Tanner's "broadest overall support" principle. AMD Zen 5 (Strix Halo)
  and Sapphire Rapids+ enable the classic gate directly; AVX10v1 hardware
  picks up BF16 via the AVX10v1 -> AVX512_BF16 implication.
- `BFloat16` element type from `System.BFloat16` (already in the BCL via
  `BitConverter.GetBytes(BFloat16)` etc.). No `Vector*<ushort>` interim.
- The two existing R2R IDs (72, 73 for `Avx512Bf16` / `Avx512Bf16_VL`)
  are kept and re-pointed from `AVX10v1` to `AVX512_BF16` in
  `InstructionSetDesc.txt`. R2R image compatibility preserved.
- JIT dispatch in `Compiler::lookupInstructionSet` returns
  `InstructionSet_AVX512_BF16` for the `Avx512Bf16` class-name (was
  `AVX10v1`); `VLVersionOfIsa` and `X64VersionOfIsa` cases added.
- HARDWARE_INTRINSIC rows added under `AVX512_BF16` with `simdSize=-1`
  and `HW_Flag_SpecialCodeGen|HW_Flag_SpecialImport` matching the
  AVXVNNIINT precedent, so the special-codegen path will handle base-type
  dispatch from the BFloat16 args.
- CPUID bit detection added in `cpufeatures.c` (subleaf 1 EAX bit 5)
  gated on AVX512v3 already being enabled, emitting a new
  `XArchIntrinsicConstants_Avx512Bf16` bit.
- New `EXTERNAL_EnableAVX512_BF16` / `EnableAVX512_BF16` configs
  in `clrconfigvalues.h` and `jitconfigvalues.h`.
- `codeman.cpp` gates `InstructionSet_AVX512_BF16` on the new CPUID bit
  and config. No fallback branches (Tanner rejected codeman-side
  fallbacks on dotnet#128365).
- Generated files (`corinfoinstructionset.h`,
  `CorInfoInstructionSet.cs`, `ReadyToRunInstructionSetHelper.cs`)
  regenerated via `gen.bat`; fresh JIT-EE GUID
  `2d316351-72be-474a-9a53-d204621127e2`.
- Tests in `CpuId.cs` and `SmokeTests/HardwareIntrinsics/Program.cs`
  uncommented for the new managed type (only those — Avx512Bitalg /
  Avx512Vpopcntdq / Avx512Fp16 left commented because their managed
  types still do not exist, per the dotnet#128365 lesson).

DRAFT: the JIT-side special codegen for VDPBF16PS / VCVTNE2PS2BF16 /
VCVTNEPS2BF16 in `hwintrinsiccodegenxarch.cpp` is not yet wired and is
the next step. Filing as draft to confirm the ISA grouping and shape
with @tannergooding via dotnet#129323 before that work.

Related: dotnet#129323
Copilot AI review requested due to automatic review settings June 12, 2026 13:33
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 12, 2026
@dotnet-policy-service dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label Jun 12, 2026
@dotnet-policy-service

Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR enables plumbing for the AVX512-BF16 instruction set across the runtime stack (feature detection, JIT ISA exposure, and public intrinsics surface), and unblocks tests to validate support reporting.

Changes:

  • Expose System.Runtime.Intrinsics.X86.Avx512Bf16 (including VL and X64 nested types) in ref + CoreLib, and wire it into instruction-set tables.
  • Detect and surface AVX512-BF16 CPU capability via minipal → CoreCLR CPU flags → JIT instruction set flags / R2R mapping.
  • Enable/extend JIT + NativeAOT smoke tests to validate AVX512-BF16 support and CPUID bit correctness.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Bf16.cs Adds the AVX512-BF16 intrinsic type + APIs (supported implementation).
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Bf16.PlatformNotSupported.cs Adds PlatformNotSupported stubs for AVX512-BF16 APIs.
src/libraries/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs Adds public ref surface for Avx512Bf16, VL, and X64.
src/native/minipal/cpufeatures.h / cpufeatures.c Adds AVX512-BF16 feature flag + CPUID-based detection.
src/coreclr/vm/codeman.cpp Enables the ISA for JIT via config gating + detected CPU feature.
src/coreclr/jit/, src/coreclr/inc/, tools/Common/* Adds ISA enum values, mappings, implications, intrinsic lists, and R2R plumbing for AVX512_BF16.
src/tests/nativeaot/SmokeTests/HardwareIntrinsics/Program.cs Turns on AVX512-BF16 smoke checks.
src/tests/JIT/HardwareIntrinsics/X86/X86Base/CpuId.cs Validates CPUID bit for AVX512-BF16 against Avx512Bf16.IsSupported.

/// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
/// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
/// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
public static new bool IsSupported { get => IsSupported; }
/// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
/// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
/// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
public static new bool IsSupported { get => IsSupported; }
/// <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para>
/// <para> VDPBF16PS zmm1, zmm2, zmm3/m512</para>
/// </summary>
public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<BFloat16> left, Vector512<BFloat16> right) => MultiplyWideningAndAdd(addend, left, right);
/// <para> VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para>
/// <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector.</para>
/// </summary>
public static Vector512<BFloat16> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper);
/// <para> VCVTNEPS2BF16 ymm1, zmm2/m512</para>
/// <para>Round-to-nearest-even conversion of a packed FP32 vector into a packed BF16 vector half-width.</para>
/// </summary>
public static Vector256<BFloat16> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value);
/// <para>__m128bh _mm_cvtneps_pbh (__m128 a)</para>
/// <para> VCVTNEPS2BF16 xmm1, xmm2/m128</para>
/// </summary>
public static Vector128<BFloat16> ConvertToBFloat16(Vector128<float> value) => ConvertToBFloat16(value);
/// <para>__m128bh _mm256_cvtneps_pbh (__m256 a)</para>
/// <para> VCVTNEPS2BF16 xmm1, ymm2/m256</para>
/// </summary>
public static Vector128<BFloat16> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value);
Comment on lines +2742 to +2779
case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64):
case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64):
{
var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
if (type != null)
{
yield return type;
if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
{
var nestedType = type.GetNestedType("X64"u8);
if (nestedType != null)
{
yield return nestedType;
}
}
}
}
{
var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
if (parentType != null)
{
yield return parentType;
var nestedType = parentType.GetNestedType("VL"u8);
if (nestedType != null)
{
yield return nestedType;
if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
{
var nestedType64 = parentType.GetNestedType("VL_X64"u8);
if (nestedType64 != null)
{
yield return nestedType64;
}
}
}
}
}
break;
Comment on lines +3294 to +3314
case (InstructionSet.X86_AVX512_BF16, TargetArchitecture.X86):
{
var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
if (type != null)
{
yield return type;
}
}
{
var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
if (parentType != null)
{
yield return parentType;
var nestedType = parentType.GetNestedType("VL"u8);
if (nestedType != null)
{
yield return nestedType;
}
}
}
break;
#define XArchIntrinsicConstants_WaitPkg (1 << 16)
#define XArchIntrinsicConstants_X86Serialize (1 << 17)
#define XArchIntrinsicConstants_AVX512Bmm (1 << 18)
#define XArchIntrinsicConstants_Avx512Bf16 (1 << 19)
BFloat16 is declared in System.Numerics (System/Numerics/BFloat16.cs).
The ref assembly referenced System.BFloat16 (12 sites, producing 12x
CS0234) and the impl files used unqualified BFloat16 without a using
directive (would CS0246 once compiled). Fix:

- src/libraries/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs:
  s/System.BFloat16/System.Numerics.BFloat16/ (12 sites)
- Avx512Bf16.cs and Avx512Bf16.PlatformNotSupported.cs:
  added 'using System.Numerics;'
The JIT asserts (hwintrinsic.cpp:1120) that HARDWARE_INTRINSIC entries
within an ISA range are sorted alphabetically by method name. The
AVX512_BF16 block had MultiplyWideningAndAdd before ConvertToBFloat16,
which fails strcmp ordering and crashes crossgen2 during corelib R2R
generation. Reorder so ConvertToBFloat16 is first (and update the
FIRST_NI / LAST_NI markers accordingly). Caught by a clean dev-branch
build that combined this PR with dotnet#128365.
Copilot AI review requested due to automatic review settings June 12, 2026 14:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 15 comments.

/// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
/// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
/// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
public static new bool IsSupported { get => IsSupported; }
/// <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para>
/// <para> VDPBF16PS zmm1, zmm2, zmm3/m512</para>
/// </summary>
public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<BFloat16> left, Vector512<BFloat16> right) => MultiplyWideningAndAdd(addend, left, right);
/// <para> VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para>
/// <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector.</para>
/// </summary>
public static Vector512<BFloat16> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper);
/// <para> VCVTNEPS2BF16 ymm1, zmm2/m512</para>
/// <para>Round-to-nearest-even conversion of a packed FP32 vector into a packed BF16 vector half-width.</para>
/// </summary>
public static Vector256<BFloat16> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value);
/// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
/// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
/// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
public static new bool IsSupported { get => IsSupported; }
/// <para>__m128bh _mm256_cvtneps_pbh (__m256 a)</para>
/// <para> VCVTNEPS2BF16 xmm1, ymm2/m256</para>
/// </summary>
public static Vector128<BFloat16> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value);
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512F$(NotSupportedOnMono).cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi$(NotSupportedOnMono).cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi2$(NotSupportedOnMono).cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Bf16$(NotSupportedOnMono).cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\AvxVnniInt16.PlatformNotSupported.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi.PlatformNotSupported.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi2.PlatformNotSupported.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Bf16.PlatformNotSupported.cs" />
Comment on lines +2742 to +2779
case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64):
case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64):
{
var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
if (type != null)
{
yield return type;
if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
{
var nestedType = type.GetNestedType("X64"u8);
if (nestedType != null)
{
yield return nestedType;
}
}
}
}
{
var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
if (parentType != null)
{
yield return parentType;
var nestedType = parentType.GetNestedType("VL"u8);
if (nestedType != null)
{
yield return nestedType;
if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
{
var nestedType64 = parentType.GetNestedType("VL_X64"u8);
if (nestedType64 != null)
{
yield return nestedType64;
}
}
}
}
}
break;
#define XArchIntrinsicConstants_WaitPkg (1 << 16)
#define XArchIntrinsicConstants_X86Serialize (1 << 17)
#define XArchIntrinsicConstants_AVX512Bmm (1 << 18)
#define XArchIntrinsicConstants_Avx512Bf16 (1 << 19)
@tannergooding's review on the API proposal (dotnet#129323):
the strongly-typed BFloat16 element shape requires extensive JIT/VM/
Debugger/ABI work to make Vector128<BFloat16>.IsSupported report true,
which is the same reason F16C/FP16 have not landed yet. The cheaper
path he endorsed is to ship the initial signatures as Vector*<ushort>
(raw bf16 bit pattern) and add Vector*<BFloat16> overloads later when
the primitive-type plumbing is ready.

This commit applies that pivot:
- Avx512Bf16.cs / Avx512Bf16.PlatformNotSupported.cs: all BFloat16
  parameter / return types switched to ushort.
- System.Runtime.Intrinsics.cs (ref): same.

Method names (ConvertToBFloat16) are unchanged — they refer to the
operation, not the parameter type. Existing JIT plumbing
(InstructionSetDesc, codeman gate, lookupInstructionSet dispatch,
HARDWARE_INTRINSIC rows, CPUID detection, configs) is unaffected.
jamesburton and others added 2 commits June 12, 2026 22:04
- MultiplyWideningAndAdd: drop SpecialCodeGen/SpecialImport, move
  INS_vdpbf16ps from FLOAT slot to USHORT slot (BaseTypeFromSecondArg
  resolves to USHORT). Standard table-driven 3-arg case dispatches to
  genHWIntrinsic_R_R_R_RM, matching AVXVNNI MultiplyWideningAndAdd.
- ConvertToBFloat16: keep SpecialCodeGen, add handler in
  genAvxFamilyIntrinsic that picks INS_vcvtne2ps2bf16 (2-arg) vs
  INS_vcvtneps2bf16 (1-arg) by node operand count.
- LSRA: add NI_AVX512_BF16_MultiplyWideningAndAdd alongside AVXVNNI for
  RMW operand-use pattern (op1 = accumulator/target).

Unblocks downstream consumer (BF16 quant kernel on Zen5 Strix Halo)
reporting NotImplementedException on all three V512 intrinsics. ISA
grouping (option-a / AVX10v1-lie / AVX512v4) still pending maintainer
direction on dotnet#129323 — codegen is robust to either path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The genHWIntrinsic per-ISA switch (hwintrinsiccodegenxarch.cpp:996)
routes SpecialCodeGen-flagged intrinsics to their family handler.
AVX512_BF16 was missing — non-table-driven BF16 intrinsics
(ConvertToBFloat16) fell through to default:unreached(), which calls
fatal(CORJIT_RECOVERABLEERROR) in Release builds and surfaces to the
runtime as InvalidProgramException.

MultiplyWideningAndAdd is table-driven (no SpecialCodeGen) so it took
the generic code path and worked. ConvertToBFloat16 needs SpecialCodeGen
to dispatch by arg count (1-arg → VCVTNEPS2BF16, 2-arg → VCVTNE2PS2BF16),
so it requires this fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 12, 2026 21:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.

Comment on lines +2742 to +2778
case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64):
case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64):
{
var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
if (type != null)
{
yield return type;
if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
{
var nestedType = type.GetNestedType("X64"u8);
if (nestedType != null)
{
yield return nestedType;
}
}
}
}
{
var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false);
if (parentType != null)
{
yield return parentType;
var nestedType = parentType.GetNestedType("VL"u8);
if (nestedType != null)
{
yield return nestedType;
if (instructionSet == InstructionSet.X64_AVX512_BF16_X64)
{
var nestedType64 = parentType.GetNestedType("VL_X64"u8);
if (nestedType64 != null)
{
yield return nestedType64;
}
}
}
}
}
Comment on lines +987 to +990
if (resultflags.HasInstructionSet(InstructionSet.X64_AVX512v3))
resultflags.AddInstructionSet(InstructionSet.X64_AVX512_BF16);
if (resultflags.HasInstructionSet(InstructionSet.X64_AVX512_BF16))
resultflags.AddInstructionSet(InstructionSet.X64_AVX10v1);
Comment on lines +1056 to +1059
if (resultflags.HasInstructionSet(InstructionSet.X86_AVX512v3))
resultflags.AddInstructionSet(InstructionSet.X86_AVX512_BF16);
if (resultflags.HasInstructionSet(InstructionSet.X86_AVX512_BF16))
resultflags.AddInstructionSet(InstructionSet.X86_AVX10v1);
/// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
/// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value>
/// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks>
public static new bool IsSupported { get => IsSupported; }
internal X64() { }

/// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary>
public static new bool IsSupported { get => IsSupported; }
/// <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para>
/// <para> VDPBF16PS zmm1, zmm2, zmm3/m512</para>
/// </summary>
public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<ushort> left, Vector512<ushort> right) => MultiplyWideningAndAdd(addend, left, right);
/// <para> VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para>
/// <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector (returned as <see cref="Vector512{T}"/> of <see cref="ushort"/>).</para>
/// </summary>
public static Vector512<ushort> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper);
/// <para> VCVTNEPS2BF16 ymm1, zmm2/m512</para>
/// <para>Round-to-nearest-even conversion of a packed FP32 vector into a half-width packed BF16 vector.</para>
/// </summary>
public static Vector256<ushort> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value);
Comment on lines +62 to +80
public static new bool IsSupported { get => IsSupported; }

/// <summary>__m128 _mm_dpbf16_ps (__m128 src, __m128bh a, __m128bh b) — VDPBF16PS xmm</summary>
public static Vector128<float> MultiplyWideningAndAdd(Vector128<float> addend, Vector128<ushort> left, Vector128<ushort> right) => MultiplyWideningAndAdd(addend, left, right);

/// <summary>__m256 _mm256_dpbf16_ps (__m256 src, __m256bh a, __m256bh b) — VDPBF16PS ymm</summary>
public static Vector256<float> MultiplyWideningAndAdd(Vector256<float> addend, Vector256<ushort> left, Vector256<ushort> right) => MultiplyWideningAndAdd(addend, left, right);

/// <summary>__m128bh _mm_cvtne2ps_pbh (__m128 a, __m128 b) — VCVTNE2PS2BF16 xmm</summary>
public static Vector128<ushort> ConvertToBFloat16(Vector128<float> lower, Vector128<float> upper) => ConvertToBFloat16(lower, upper);

/// <summary>__m256bh _mm256_cvtne2ps_pbh (__m256 a, __m256 b) — VCVTNE2PS2BF16 ymm</summary>
public static Vector256<ushort> ConvertToBFloat16(Vector256<float> lower, Vector256<float> upper) => ConvertToBFloat16(lower, upper);

/// <summary>__m128bh _mm_cvtneps_pbh (__m128 a) — VCVTNEPS2BF16 xmm</summary>
public static Vector128<ushort> ConvertToBFloat16(Vector128<float> value) => ConvertToBFloat16(value);

/// <summary>__m128bh _mm256_cvtneps_pbh (__m256 a) — VCVTNEPS2BF16 xmm from ymm</summary>
public static Vector128<ushort> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value);
Comment on lines +3654 to +3667
case NI_AVX512_BF16_ConvertToBFloat16:
{
assert(baseType == TYP_FLOAT);
if (numArgs == 2)
{
genHWIntrinsic_R_R_RM(node, INS_vcvtne2ps2bf16, attr, instOptions);
}
else
{
assert(numArgs == 1);
genHWIntrinsic_R_RM(node, INS_vcvtneps2bf16, attr, targetReg, op1, instOptions);
}
break;
}
The codegen path for MultiplyWideningAndAdd calls
emitIns_SIMD_R_R_R_R(targetReg, op1Reg, op2Reg, op3Reg) which then
dispatches on Is3OpRmwInstruction:

  - true  -> emit movaps + emitIns_R_R_R (proper 3-operand RMW encoding)
  - false -> fall through to a 4-operand blendv encoding emitting 7 bytes
             with the 4th operand garbage-filled into the IS4 byte

INS_vdpbf16ps was not in the Is3OpRmwInstruction table, so it took the
blendv path and emitted bytes like `62 F2 7E 48 52 C4 70` (7 bytes) for
what should have been `62 F2 7E 48 52 C4` (6 bytes). MinOpts happened to
land on operand layouts where the garbage byte didn't trip up the CPU;
FullOpts (with register pressure pushing operands into zmm16-31 and
varied register triples) produced sequences the CPU rejected as #UD.

Tier-0/MinOpts probes (top-level statements) never exercised this path
which is why bf16probe passed both before and after the dispatch-case
fix — the bug is specifically the FullOpts MultiplyWideningAndAdd path.

After this fix:
  CONTROL  (4 acc, FullOpts):  STRESS_OK
  FAITHFUL (24 acc, FullOpts): STRESS_OK
  MINOPTS  (24 acc):           STRESS_OK

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[API Proposal]: Add managed surface for AVX-512 BF16 hardware intrinsics

2 participants