Add managed surface for AVX-512 BF16 hardware intrinsics (draft)#129326
Draft
jamesburton wants to merge 7 commits into
Draft
Add managed surface for AVX-512 BF16 hardware intrinsics (draft)#129326jamesburton wants to merge 7 commits into
jamesburton wants to merge 7 commits into
Conversation
Wires up the public Avx512Bf16 / Avx512Bf16.VL / Avx512Bf16.X64 classes in System.Runtime.Intrinsics.X86, exposing VDPBF16PS, VCVTNE2PS2BF16, and VCVTNEPS2BF16 on the classic AVX-512-BF16 gate rather than only via AVX10v1. This is a direct sibling of the AvxVnni.V512 work landed in PR dotnet#128365 — the issue and reasoning are tracked at dotnet#129323 and the design follows the patterns @tannergooding endorsed on the VNNI redesign. Key choices, called out for review: - New ISA group `AVX512_BF16` with `implication AVX512_BF16 -> AVX512v3` and `implication AVX10v1 -> AVX512_BF16`. This deliberately does NOT fold BF16 into AVX512v3 because v3-without-BF16 is a real shipping configuration on Intel Tiger Lake / Rocket Lake / Ice Lake-U; folding would regress those parts' v3 surface. The implication chain preserves Tanner's "broadest overall support" principle. AMD Zen 5 (Strix Halo) and Sapphire Rapids+ enable the classic gate directly; AVX10v1 hardware picks up BF16 via the AVX10v1 -> AVX512_BF16 implication. - `BFloat16` element type from `System.BFloat16` (already in the BCL via `BitConverter.GetBytes(BFloat16)` etc.). No `Vector*<ushort>` interim. - The two existing R2R IDs (72, 73 for `Avx512Bf16` / `Avx512Bf16_VL`) are kept and re-pointed from `AVX10v1` to `AVX512_BF16` in `InstructionSetDesc.txt`. R2R image compatibility preserved. - JIT dispatch in `Compiler::lookupInstructionSet` returns `InstructionSet_AVX512_BF16` for the `Avx512Bf16` class-name (was `AVX10v1`); `VLVersionOfIsa` and `X64VersionOfIsa` cases added. - HARDWARE_INTRINSIC rows added under `AVX512_BF16` with `simdSize=-1` and `HW_Flag_SpecialCodeGen|HW_Flag_SpecialImport` matching the AVXVNNIINT precedent, so the special-codegen path will handle base-type dispatch from the BFloat16 args. - CPUID bit detection added in `cpufeatures.c` (subleaf 1 EAX bit 5) gated on AVX512v3 already being enabled, emitting a new `XArchIntrinsicConstants_Avx512Bf16` bit. - New `EXTERNAL_EnableAVX512_BF16` / `EnableAVX512_BF16` configs in `clrconfigvalues.h` and `jitconfigvalues.h`. - `codeman.cpp` gates `InstructionSet_AVX512_BF16` on the new CPUID bit and config. No fallback branches (Tanner rejected codeman-side fallbacks on dotnet#128365). - Generated files (`corinfoinstructionset.h`, `CorInfoInstructionSet.cs`, `ReadyToRunInstructionSetHelper.cs`) regenerated via `gen.bat`; fresh JIT-EE GUID `2d316351-72be-474a-9a53-d204621127e2`. - Tests in `CpuId.cs` and `SmokeTests/HardwareIntrinsics/Program.cs` uncommented for the new managed type (only those — Avx512Bitalg / Avx512Vpopcntdq / Avx512Fp16 left commented because their managed types still do not exist, per the dotnet#128365 lesson). DRAFT: the JIT-side special codegen for VDPBF16PS / VCVTNE2PS2BF16 / VCVTNEPS2BF16 in `hwintrinsiccodegenxarch.cpp` is not yet wired and is the next step. Filing as draft to confirm the ISA grouping and shape with @tannergooding via dotnet#129323 before that work. Related: dotnet#129323
Contributor
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR enables plumbing for the AVX512-BF16 instruction set across the runtime stack (feature detection, JIT ISA exposure, and public intrinsics surface), and unblocks tests to validate support reporting.
Changes:
- Expose
System.Runtime.Intrinsics.X86.Avx512Bf16(includingVLandX64nested types) in ref + CoreLib, and wire it into instruction-set tables. - Detect and surface AVX512-BF16 CPU capability via
minipal→ CoreCLR CPU flags → JIT instruction set flags / R2R mapping. - Enable/extend JIT + NativeAOT smoke tests to validate AVX512-BF16 support and CPUID bit correctness.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Bf16.cs | Adds the AVX512-BF16 intrinsic type + APIs (supported implementation). |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512Bf16.PlatformNotSupported.cs | Adds PlatformNotSupported stubs for AVX512-BF16 APIs. |
| src/libraries/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs | Adds public ref surface for Avx512Bf16, VL, and X64. |
| src/native/minipal/cpufeatures.h / cpufeatures.c | Adds AVX512-BF16 feature flag + CPUID-based detection. |
| src/coreclr/vm/codeman.cpp | Enables the ISA for JIT via config gating + detected CPU feature. |
| src/coreclr/jit/, src/coreclr/inc/, tools/Common/* | Adds ISA enum values, mappings, implications, intrinsic lists, and R2R plumbing for AVX512_BF16. |
| src/tests/nativeaot/SmokeTests/HardwareIntrinsics/Program.cs | Turns on AVX512-BF16 smoke checks. |
| src/tests/JIT/HardwareIntrinsics/X86/X86Base/CpuId.cs | Validates CPUID bit for AVX512-BF16 against Avx512Bf16.IsSupported. |
| /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary> | ||
| /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value> | ||
| /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks> | ||
| public static new bool IsSupported { get => IsSupported; } |
| /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary> | ||
| /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value> | ||
| /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks> | ||
| public static new bool IsSupported { get => IsSupported; } |
| /// <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para> | ||
| /// <para> VDPBF16PS zmm1, zmm2, zmm3/m512</para> | ||
| /// </summary> | ||
| public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<BFloat16> left, Vector512<BFloat16> right) => MultiplyWideningAndAdd(addend, left, right); |
| /// <para> VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para> | ||
| /// <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector.</para> | ||
| /// </summary> | ||
| public static Vector512<BFloat16> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper); |
| /// <para> VCVTNEPS2BF16 ymm1, zmm2/m512</para> | ||
| /// <para>Round-to-nearest-even conversion of a packed FP32 vector into a packed BF16 vector half-width.</para> | ||
| /// </summary> | ||
| public static Vector256<BFloat16> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value); |
| /// <para>__m128bh _mm_cvtneps_pbh (__m128 a)</para> | ||
| /// <para> VCVTNEPS2BF16 xmm1, xmm2/m128</para> | ||
| /// </summary> | ||
| public static Vector128<BFloat16> ConvertToBFloat16(Vector128<float> value) => ConvertToBFloat16(value); |
| /// <para>__m128bh _mm256_cvtneps_pbh (__m256 a)</para> | ||
| /// <para> VCVTNEPS2BF16 xmm1, ymm2/m256</para> | ||
| /// </summary> | ||
| public static Vector128<BFloat16> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value); |
Comment on lines
+2742
to
+2779
| case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64): | ||
| case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64): | ||
| { | ||
| var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false); | ||
| if (type != null) | ||
| { | ||
| yield return type; | ||
| if (instructionSet == InstructionSet.X64_AVX512_BF16_X64) | ||
| { | ||
| var nestedType = type.GetNestedType("X64"u8); | ||
| if (nestedType != null) | ||
| { | ||
| yield return nestedType; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| { | ||
| var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false); | ||
| if (parentType != null) | ||
| { | ||
| yield return parentType; | ||
| var nestedType = parentType.GetNestedType("VL"u8); | ||
| if (nestedType != null) | ||
| { | ||
| yield return nestedType; | ||
| if (instructionSet == InstructionSet.X64_AVX512_BF16_X64) | ||
| { | ||
| var nestedType64 = parentType.GetNestedType("VL_X64"u8); | ||
| if (nestedType64 != null) | ||
| { | ||
| yield return nestedType64; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| break; |
Comment on lines
+3294
to
+3314
| case (InstructionSet.X86_AVX512_BF16, TargetArchitecture.X86): | ||
| { | ||
| var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false); | ||
| if (type != null) | ||
| { | ||
| yield return type; | ||
| } | ||
| } | ||
| { | ||
| var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false); | ||
| if (parentType != null) | ||
| { | ||
| yield return parentType; | ||
| var nestedType = parentType.GetNestedType("VL"u8); | ||
| if (nestedType != null) | ||
| { | ||
| yield return nestedType; | ||
| } | ||
| } | ||
| } | ||
| break; |
| #define XArchIntrinsicConstants_WaitPkg (1 << 16) | ||
| #define XArchIntrinsicConstants_X86Serialize (1 << 17) | ||
| #define XArchIntrinsicConstants_AVX512Bmm (1 << 18) | ||
| #define XArchIntrinsicConstants_Avx512Bf16 (1 << 19) |
BFloat16 is declared in System.Numerics (System/Numerics/BFloat16.cs). The ref assembly referenced System.BFloat16 (12 sites, producing 12x CS0234) and the impl files used unqualified BFloat16 without a using directive (would CS0246 once compiled). Fix: - src/libraries/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs: s/System.BFloat16/System.Numerics.BFloat16/ (12 sites) - Avx512Bf16.cs and Avx512Bf16.PlatformNotSupported.cs: added 'using System.Numerics;'
The JIT asserts (hwintrinsic.cpp:1120) that HARDWARE_INTRINSIC entries within an ISA range are sorted alphabetically by method name. The AVX512_BF16 block had MultiplyWideningAndAdd before ConvertToBFloat16, which fails strcmp ordering and crashes crossgen2 during corelib R2R generation. Reorder so ConvertToBFloat16 is first (and update the FIRST_NI / LAST_NI markers accordingly). Caught by a clean dev-branch build that combined this PR with dotnet#128365.
| /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary> | ||
| /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value> | ||
| /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks> | ||
| public static new bool IsSupported { get => IsSupported; } |
| /// <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para> | ||
| /// <para> VDPBF16PS zmm1, zmm2, zmm3/m512</para> | ||
| /// </summary> | ||
| public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<BFloat16> left, Vector512<BFloat16> right) => MultiplyWideningAndAdd(addend, left, right); |
| /// <para> VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para> | ||
| /// <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector.</para> | ||
| /// </summary> | ||
| public static Vector512<BFloat16> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper); |
| /// <para> VCVTNEPS2BF16 ymm1, zmm2/m512</para> | ||
| /// <para>Round-to-nearest-even conversion of a packed FP32 vector into a packed BF16 vector half-width.</para> | ||
| /// </summary> | ||
| public static Vector256<BFloat16> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value); |
| /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary> | ||
| /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value> | ||
| /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks> | ||
| public static new bool IsSupported { get => IsSupported; } |
| /// <para>__m128bh _mm256_cvtneps_pbh (__m256 a)</para> | ||
| /// <para> VCVTNEPS2BF16 xmm1, ymm2/m256</para> | ||
| /// </summary> | ||
| public static Vector128<BFloat16> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value); |
| <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512F$(NotSupportedOnMono).cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi$(NotSupportedOnMono).cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi2$(NotSupportedOnMono).cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Bf16$(NotSupportedOnMono).cs" /> |
| <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\AvxVnniInt16.PlatformNotSupported.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi.PlatformNotSupported.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Vbmi2.PlatformNotSupported.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Runtime\Intrinsics\X86\Avx512Bf16.PlatformNotSupported.cs" /> |
Comment on lines
+2742
to
+2779
| case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64): | ||
| case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64): | ||
| { | ||
| var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false); | ||
| if (type != null) | ||
| { | ||
| yield return type; | ||
| if (instructionSet == InstructionSet.X64_AVX512_BF16_X64) | ||
| { | ||
| var nestedType = type.GetNestedType("X64"u8); | ||
| if (nestedType != null) | ||
| { | ||
| yield return nestedType; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| { | ||
| var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false); | ||
| if (parentType != null) | ||
| { | ||
| yield return parentType; | ||
| var nestedType = parentType.GetNestedType("VL"u8); | ||
| if (nestedType != null) | ||
| { | ||
| yield return nestedType; | ||
| if (instructionSet == InstructionSet.X64_AVX512_BF16_X64) | ||
| { | ||
| var nestedType64 = parentType.GetNestedType("VL_X64"u8); | ||
| if (nestedType64 != null) | ||
| { | ||
| yield return nestedType64; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| break; |
| #define XArchIntrinsicConstants_WaitPkg (1 << 16) | ||
| #define XArchIntrinsicConstants_X86Serialize (1 << 17) | ||
| #define XArchIntrinsicConstants_AVX512Bmm (1 << 18) | ||
| #define XArchIntrinsicConstants_Avx512Bf16 (1 << 19) |
@tannergooding's review on the API proposal (dotnet#129323): the strongly-typed BFloat16 element shape requires extensive JIT/VM/ Debugger/ABI work to make Vector128<BFloat16>.IsSupported report true, which is the same reason F16C/FP16 have not landed yet. The cheaper path he endorsed is to ship the initial signatures as Vector*<ushort> (raw bf16 bit pattern) and add Vector*<BFloat16> overloads later when the primitive-type plumbing is ready. This commit applies that pivot: - Avx512Bf16.cs / Avx512Bf16.PlatformNotSupported.cs: all BFloat16 parameter / return types switched to ushort. - System.Runtime.Intrinsics.cs (ref): same. Method names (ConvertToBFloat16) are unchanged — they refer to the operation, not the parameter type. Existing JIT plumbing (InstructionSetDesc, codeman gate, lookupInstructionSet dispatch, HARDWARE_INTRINSIC rows, CPUID detection, configs) is unaffected.
- MultiplyWideningAndAdd: drop SpecialCodeGen/SpecialImport, move INS_vdpbf16ps from FLOAT slot to USHORT slot (BaseTypeFromSecondArg resolves to USHORT). Standard table-driven 3-arg case dispatches to genHWIntrinsic_R_R_R_RM, matching AVXVNNI MultiplyWideningAndAdd. - ConvertToBFloat16: keep SpecialCodeGen, add handler in genAvxFamilyIntrinsic that picks INS_vcvtne2ps2bf16 (2-arg) vs INS_vcvtneps2bf16 (1-arg) by node operand count. - LSRA: add NI_AVX512_BF16_MultiplyWideningAndAdd alongside AVXVNNI for RMW operand-use pattern (op1 = accumulator/target). Unblocks downstream consumer (BF16 quant kernel on Zen5 Strix Halo) reporting NotImplementedException on all three V512 intrinsics. ISA grouping (option-a / AVX10v1-lie / AVX512v4) still pending maintainer direction on dotnet#129323 — codegen is robust to either path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The genHWIntrinsic per-ISA switch (hwintrinsiccodegenxarch.cpp:996) routes SpecialCodeGen-flagged intrinsics to their family handler. AVX512_BF16 was missing — non-table-driven BF16 intrinsics (ConvertToBFloat16) fell through to default:unreached(), which calls fatal(CORJIT_RECOVERABLEERROR) in Release builds and surfaces to the runtime as InvalidProgramException. MultiplyWideningAndAdd is table-driven (no SpecialCodeGen) so it took the generic code path and worked. ConvertToBFloat16 needs SpecialCodeGen to dispatch by arg count (1-arg → VCVTNEPS2BF16, 2-arg → VCVTNE2PS2BF16), so it requires this fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines
+2742
to
+2778
| case (InstructionSet.X64_AVX512_BF16, TargetArchitecture.X64): | ||
| case (InstructionSet.X64_AVX512_BF16_X64, TargetArchitecture.X64): | ||
| { | ||
| var type = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false); | ||
| if (type != null) | ||
| { | ||
| yield return type; | ||
| if (instructionSet == InstructionSet.X64_AVX512_BF16_X64) | ||
| { | ||
| var nestedType = type.GetNestedType("X64"u8); | ||
| if (nestedType != null) | ||
| { | ||
| yield return nestedType; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| { | ||
| var parentType = context.SystemModule.GetType("System.Runtime.Intrinsics.X86"u8, "Avx512Bf16"u8, false); | ||
| if (parentType != null) | ||
| { | ||
| yield return parentType; | ||
| var nestedType = parentType.GetNestedType("VL"u8); | ||
| if (nestedType != null) | ||
| { | ||
| yield return nestedType; | ||
| if (instructionSet == InstructionSet.X64_AVX512_BF16_X64) | ||
| { | ||
| var nestedType64 = parentType.GetNestedType("VL_X64"u8); | ||
| if (nestedType64 != null) | ||
| { | ||
| yield return nestedType64; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } |
Comment on lines
+987
to
+990
| if (resultflags.HasInstructionSet(InstructionSet.X64_AVX512v3)) | ||
| resultflags.AddInstructionSet(InstructionSet.X64_AVX512_BF16); | ||
| if (resultflags.HasInstructionSet(InstructionSet.X64_AVX512_BF16)) | ||
| resultflags.AddInstructionSet(InstructionSet.X64_AVX10v1); |
Comment on lines
+1056
to
+1059
| if (resultflags.HasInstructionSet(InstructionSet.X86_AVX512v3)) | ||
| resultflags.AddInstructionSet(InstructionSet.X86_AVX512_BF16); | ||
| if (resultflags.HasInstructionSet(InstructionSet.X86_AVX512_BF16)) | ||
| resultflags.AddInstructionSet(InstructionSet.X86_AVX10v1); |
| /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary> | ||
| /// <value><see langword="true" /> if the APIs are supported; otherwise, <see langword="false" />.</value> | ||
| /// <remarks>A value of <see langword="false" /> indicates that the APIs will throw <see cref="PlatformNotSupportedException" />.</remarks> | ||
| public static new bool IsSupported { get => IsSupported; } |
| internal X64() { } | ||
|
|
||
| /// <summary>Gets a value that indicates whether the APIs in this class are supported.</summary> | ||
| public static new bool IsSupported { get => IsSupported; } |
| /// <para>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)</para> | ||
| /// <para> VDPBF16PS zmm1, zmm2, zmm3/m512</para> | ||
| /// </summary> | ||
| public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<ushort> left, Vector512<ushort> right) => MultiplyWideningAndAdd(addend, left, right); |
| /// <para> VCVTNE2PS2BF16 zmm1, zmm2, zmm3/m512</para> | ||
| /// <para>Round-to-nearest-even conversion of two packed FP32 vectors into a single packed BF16 vector (returned as <see cref="Vector512{T}"/> of <see cref="ushort"/>).</para> | ||
| /// </summary> | ||
| public static Vector512<ushort> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper) => ConvertToBFloat16(lower, upper); |
| /// <para> VCVTNEPS2BF16 ymm1, zmm2/m512</para> | ||
| /// <para>Round-to-nearest-even conversion of a packed FP32 vector into a half-width packed BF16 vector.</para> | ||
| /// </summary> | ||
| public static Vector256<ushort> ConvertToBFloat16(Vector512<float> value) => ConvertToBFloat16(value); |
Comment on lines
+62
to
+80
| public static new bool IsSupported { get => IsSupported; } | ||
|
|
||
| /// <summary>__m128 _mm_dpbf16_ps (__m128 src, __m128bh a, __m128bh b) — VDPBF16PS xmm</summary> | ||
| public static Vector128<float> MultiplyWideningAndAdd(Vector128<float> addend, Vector128<ushort> left, Vector128<ushort> right) => MultiplyWideningAndAdd(addend, left, right); | ||
|
|
||
| /// <summary>__m256 _mm256_dpbf16_ps (__m256 src, __m256bh a, __m256bh b) — VDPBF16PS ymm</summary> | ||
| public static Vector256<float> MultiplyWideningAndAdd(Vector256<float> addend, Vector256<ushort> left, Vector256<ushort> right) => MultiplyWideningAndAdd(addend, left, right); | ||
|
|
||
| /// <summary>__m128bh _mm_cvtne2ps_pbh (__m128 a, __m128 b) — VCVTNE2PS2BF16 xmm</summary> | ||
| public static Vector128<ushort> ConvertToBFloat16(Vector128<float> lower, Vector128<float> upper) => ConvertToBFloat16(lower, upper); | ||
|
|
||
| /// <summary>__m256bh _mm256_cvtne2ps_pbh (__m256 a, __m256 b) — VCVTNE2PS2BF16 ymm</summary> | ||
| public static Vector256<ushort> ConvertToBFloat16(Vector256<float> lower, Vector256<float> upper) => ConvertToBFloat16(lower, upper); | ||
|
|
||
| /// <summary>__m128bh _mm_cvtneps_pbh (__m128 a) — VCVTNEPS2BF16 xmm</summary> | ||
| public static Vector128<ushort> ConvertToBFloat16(Vector128<float> value) => ConvertToBFloat16(value); | ||
|
|
||
| /// <summary>__m128bh _mm256_cvtneps_pbh (__m256 a) — VCVTNEPS2BF16 xmm from ymm</summary> | ||
| public static Vector128<ushort> ConvertToBFloat16(Vector256<float> value) => ConvertToBFloat16(value); |
Comment on lines
+3654
to
+3667
| case NI_AVX512_BF16_ConvertToBFloat16: | ||
| { | ||
| assert(baseType == TYP_FLOAT); | ||
| if (numArgs == 2) | ||
| { | ||
| genHWIntrinsic_R_R_RM(node, INS_vcvtne2ps2bf16, attr, instOptions); | ||
| } | ||
| else | ||
| { | ||
| assert(numArgs == 1); | ||
| genHWIntrinsic_R_RM(node, INS_vcvtneps2bf16, attr, targetReg, op1, instOptions); | ||
| } | ||
| break; | ||
| } |
The codegen path for MultiplyWideningAndAdd calls
emitIns_SIMD_R_R_R_R(targetReg, op1Reg, op2Reg, op3Reg) which then
dispatches on Is3OpRmwInstruction:
- true -> emit movaps + emitIns_R_R_R (proper 3-operand RMW encoding)
- false -> fall through to a 4-operand blendv encoding emitting 7 bytes
with the 4th operand garbage-filled into the IS4 byte
INS_vdpbf16ps was not in the Is3OpRmwInstruction table, so it took the
blendv path and emitted bytes like `62 F2 7E 48 52 C4 70` (7 bytes) for
what should have been `62 F2 7E 48 52 C4` (6 bytes). MinOpts happened to
land on operand layouts where the garbage byte didn't trip up the CPU;
FullOpts (with register pressure pushing operands into zmm16-31 and
varied register triples) produced sequences the CPU rejected as #UD.
Tier-0/MinOpts probes (top-level statements) never exercised this path
which is why bf16probe passed both before and after the dispatch-case
fix — the bug is specifically the FullOpts MultiplyWideningAndAdd path.
After this fix:
CONTROL (4 acc, FullOpts): STRESS_OK
FAITHFUL (24 acc, FullOpts): STRESS_OK
MINOPTS (24 acc): STRESS_OK
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a DRAFT. Filed early to confirm ISA grouping and shape with @tannergooding via #129323 before completing the JIT-side codegen wiring. Please see the issue for the full design rationale, hardware table (including Strix Halo as an AMD reference device), and the A/B/C options I outlined for the grouping.
Closes #129323 (will be left open until confirmed).
What this adds
Wires up the public
Avx512Bf16/Avx512Bf16.VL/Avx512Bf16.X64classes inSystem.Runtime.Intrinsics.X86, exposingVDPBF16PS,VCVTNE2PS2BF16, andVCVTNEPS2BF16on the classicAVX-512-BF16gate rather than only viaAVX10v1. Direct sibling of theAvxVnni.V512work landed in PR #128365.Key design choices (mirrors #128365 patterns)
AVX512_BF16withimplication AVX512_BF16 -> AVX512v3andimplication AVX10v1 -> AVX512_BF16. This deliberately does not fold BF16 into AVX512v3 because v3-without-BF16 is a real shipping configuration (Intel Tiger Lake / Rocket Lake / Ice Lake-U). Folding would regress those parts' v3 surface. Preserves Tanner's "broadest overall support" principle. AMD Zen 5 (Strix Halo) and Sapphire Rapids+ enable the classic gate; AVX10v1 hardware picks up BF16 via the AVX10v1 -> AVX512_BF16 implication.BFloat16element type from existingSystem.BFloat16BCL primitive. NoVector*<ushort>interim.AVX10v1toAVX512_BF16. R2R image compatibility preserved.Compiler::lookupInstructionSet(returnsInstructionSet_AVX512_BF16for theAvx512Bf16class-name).VLVersionOfIsaandX64VersionOfIsacases added. Mirrors AVXVNNIINT / AVXVNNIINT_V512 pattern Tanner endorsed.HARDWARE_INTRINSICrows underAVX512_BF16withsimdSize=-1andHW_Flag_SpecialCodeGen|HW_Flag_SpecialImportmatching the AVXVNNIINT precedent — avoids the fixed-simdSize=64trap Copilot kept flagging on Add AvxVnni.V512 hardware intrinsics #128365.cpufeatures.c(subleaf 1 EAX bit 5) gated on AVX512v3 being enabled, emitting a newXArchIntrinsicConstants_Avx512Bf16constant.EXTERNAL_EnableAVX512_BF16(retail) /EnableAVX512_BF16(JIT). Existing config-comment tokens preserved.codeman.cppgate — plain shape, no fallback branches (Tanner explicitly rejected codeman-side fallbacks on the Add AvxVnni.V512 hardware intrinsics #128365codeman.cpp:1430thread).gen.bat:2d316351-72be-474a-9a53-d204621127e2.CpuId.csandSmokeTests/HardwareIntrinsics/Program.cs— only theAvx512Bf16lines uncommented (Avx512Bitalg / Avx512Vpopcntdq / Avx512Fp16 left commented because their managed types still don't exist — direct application of the Add AvxVnni.V512 hardware intrinsics #128365 lesson where over-eager uncommenting caused 11 NativeAOT build failures).Still TODO (after the issue confirms direction)
VDPBF16PS/VCVTNE2PS2BF16/VCVTNEPS2BF16inhwintrinsiccodegenxarch.cppandlsraxarch.cpp(needed before the intrinsic actually executes — currently the HARDWARE_INTRINSIC rows useINS_invalidslots).src/tests/JIT/HardwareIntrinsics/X86_Avx/Avx512Bf16/.These are deliberately out of scope for the draft. Once @tannergooding confirms the ISA grouping in #129323 they can be wired in a follow-up commit on this branch.
Files touched
InstructionSetDesc.txtAVX10v1toAVX512_BF16, add implicationshwintrinsic.cpp(IsaRangeArray)hwintrinsiclistxarch.hHARDWARE_INTRINSICrows withSpecialCodeGenflagshwintrinsicxarch.cpplookupInstructionSetdispatch;VLVersionOfIsa/X64VersionOfIsacasescpufeatures.h/cpufeatures.cHardwareIntrinsicHelpers.cscodeman.cppclrconfigvalues.h/jitconfigvalues.hAvx512Bf16.cs/Avx512Bf16.PlatformNotSupported.csSystem.Private.CoreLib.Shared.projitemsSystem.Runtime.Intrinsics.cs(ref)CpuId.cs,SmokeTests/Program.cscorinfoinstructionset.h,CorInfoInstructionSet.cs,ReadyToRunInstructionSetHelper.cs,jiteeversionguid.hgen.bat— do not hand-edit🤖 Generated with Claude Code