Skip to content

JIT: opportunistically lower value & ((1 << k) - 1) to BMI2 BZHI #129368

@AndyAyersMS

Description

@AndyAyersMS

Note

This issue was drafted with Copilot CLI assistance.

The JIT already recognizes most BMI1/BMI2 scalar patterns opportunistically (andn, blsi, blsr, blsmsk, shlx/shrx/sarx, rorx, mulx, tzcnt, lzcnt, popcnt), but the BMI2 bzhi instruction is only emitted when user code explicitly calls Bmi2.ZeroHighBits. The natural C# expression value & ((1 << k) - 1) — used wherever code needs to keep the low k bits of a value with a variable count — still lowers to a 4-instruction sequence even on hardware that has BZHI.

Current codegen for value & ((1u << k) - 1):

mov      r,  1
shlx     r,  r,  k        ; (or shl r, cl without BMI2)
dec      r
and      value, r

Possible BZHI codegen (with a safety mask to preserve IL shift semantics when k >= operandSize):

and      k,  31           ; (or 63 for 64-bit; skippable if IntegralRange proves k in range)
bzhi     value, value, k

Semantic note

BZHI reads only bits[7:0] of its index operand. When that value is >= operandSize, BZHI returns the source unchanged. The IL shift, however, masks the count to (operandSize - 1) bits, so (1 << 32) - 1 evaluates to 0 on a 32-bit shift — meaning value & ((1 << k) - 1) == 0 for k = 32, while bzhi(value, 32) == value. To preserve semantics we either need to mask k explicitly or prove k is in [0, operandSize) from range analysis.

Prototype

Branch: https://github.com/AndyAyersMS/runtime/tree/fix-bzhi-recognition

Adds Lowering::TryLowerAndOpToZeroHighBits in lowerxarch.cpp that:

  • Recognizes AND(value, SUB(LSH(1, k), 1)) and the morph-canonicalized ADD(..., -1) form (commutatively).
  • Inserts an explicit AND(k, operandSize - 1) unless the IR pattern AND(k_inner, const) with const < operandSize proves the count is bounded (covers the common case where LowerShift has already stripped morph's & 31/& 63 from the shift count).
  • Bails out if anything depends on flags from the original AND/SUB/LSH.

SPMI x64 results (805,338 contexts; collections: aspnet2, benchmarks.run, benchmarks.run_pgo, realworld.run, libraries.pmi, libraries.crossgen2)

Collection Contexts Δ bytes Δ PerfScore in diffs
benchmarks.run 7 −46 +0.13%
realworld.run 4 −32 +0.05%
libraries.pmi 16 −104 −3.07%
libraries.crossgen2 9 −56 −0.10%
total 36 −238 net win

All 36 diff contexts are size improvements; 0 size regressions. Of 36 affected contexts: 33 PerfScore improvements (down to −25%, −8.93% on hot methods), 3 small PerfScore regressions (≤ +1.36%) all attributable to the JIT perf model preferring and mem, reg (RMW) over bzhi mem; mov mem (load + bzhi + store) — actual µarch behavior is essentially identical.

Sample diffs

System.Collections.BitArray.ClearHighExtraBits (realworld):

-       mov      edx, 1
-       shlx     eax, edx, eax
-       dec      eax
-       and      dword ptr [rcx], eax
+       and      eax, 31
+       bzhi     eax, dword ptr [rcx], eax
+       mov      dword ptr [rcx], eax

RealParser.AssembleFloatingPointValue (libraries.pmi, 64-bit BZHI):

-       mov      ecx, 1
-       shlx     rcx, rcx, rax
-       dec      rcx
-       and      rcx, rsi
-       mov      rsi, rcx
+       and      eax, 63
+       bzhi     rsi, rsi, rax

BFloat16.RoundMidpointToEven<int> (libraries.pmi, −8.93% PerfScore on this method):

-       mov      r8d, 1
-       shlx     r8d, r8d, eax
-       dec      r8d
-       and      r8d, ecx
+       mov      r8d, eax
+       and      r8d, 31
+       bzhi     r8d, ecx, r8d

V8.Crypto.BigInteger.fromString (benchmarks.run):

-       mov      r9d, 1
-       shlx     r10d, r9d, r10d
-       dec      r10d
-       and      r10d, eax
+       and      r10d, 31
+       bzhi     r10d, eax, r10d

Real-world hits

BitArray.ClearHighExtraBits, BigInteger.{fromString, fromByteArray, toString, modPow, toByteArray}, BFloat16.RoundMidpointToEven, RealParser.{ConvertDecimalToFloatingPointBits, AssembleFloatingPointValue}, InflaterManaged.DecodeBlock, InputBuffer.GetBits, DeflaterHuffman.CompressBlock, RegularExpressions.Symbolic.BitVector.ClearRemainderBits, AsnWriter.CheckValidLastByte, Microsoft.CodeAnalysis.CSharp symbol-flag helpers.

Potential follow-ups

  1. Suppress the safety and k, 31/and k, 63 when IntegralRange can prove the count is in range (e.g., when LowerShift strips a morph-inserted mask that BZHI could have used).
  2. Decide whether bzhi mem; mov mem vs and mem, reg (RMW) should be preferred for the few mem-RMW shapes — possibly skip the transform when the value is a contained memory operand that would otherwise be an and mem RMW.
  3. Consider extending the recognition to the ~((-1) << k) & value and value << (size - k) >> (size - k) formulations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions