You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue was drafted with Copilot CLI assistance.
The JIT already recognizes most BMI1/BMI2 scalar patterns opportunistically (andn, blsi, blsr, blsmsk, shlx/shrx/sarx, rorx, mulx, tzcnt, lzcnt, popcnt), but the BMI2 bzhi instruction is only emitted when user code explicitly calls Bmi2.ZeroHighBits. The natural C# expression value & ((1 << k) - 1) — used wherever code needs to keep the low k bits of a value with a variable count — still lowers to a 4-instruction sequence even on hardware that has BZHI.
Current codegen for value & ((1u << k) - 1):
mov r,1shlx r, r, k ; (or shl r, cl without BMI2)dec rand value, r
Possible BZHI codegen (with a safety mask to preserve IL shift semantics when k >= operandSize):
and k,31 ; (or 63 for 64-bit; skippable if IntegralRange proves k in range)bzhi value, value, k
Semantic note
BZHI reads only bits[7:0] of its index operand. When that value is >= operandSize, BZHI returns the source unchanged. The IL shift, however, masks the count to (operandSize - 1) bits, so (1 << 32) - 1 evaluates to 0 on a 32-bit shift — meaning value & ((1 << k) - 1) == 0 for k = 32, while bzhi(value, 32) == value. To preserve semantics we either need to mask k explicitly or prove k is in [0, operandSize) from range analysis.
Adds Lowering::TryLowerAndOpToZeroHighBits in lowerxarch.cpp that:
Recognizes AND(value, SUB(LSH(1, k), 1)) and the morph-canonicalized ADD(..., -1) form (commutatively).
Inserts an explicit AND(k, operandSize - 1) unless the IR pattern AND(k_inner, const) with const < operandSize proves the count is bounded (covers the common case where LowerShift has already stripped morph's & 31/& 63 from the shift count).
Bails out if anything depends on flags from the original AND/SUB/LSH.
All 36 diff contexts are size improvements; 0 size regressions. Of 36 affected contexts: 33 PerfScore improvements (down to −25%, −8.93% on hot methods), 3 small PerfScore regressions (≤ +1.36%) all attributable to the JIT perf model preferring and mem, reg (RMW) over bzhi mem; mov mem (load + bzhi + store) — actual µarch behavior is essentially identical.
Suppress the safety and k, 31/and k, 63 when IntegralRange can prove the count is in range (e.g., when LowerShift strips a morph-inserted mask that BZHI could have used).
Decide whether bzhi mem; mov mem vs and mem, reg (RMW) should be preferred for the few mem-RMW shapes — possibly skip the transform when the value is a contained memory operand that would otherwise be an and mem RMW.
Consider extending the recognition to the ~((-1) << k) & value and value << (size - k) >> (size - k) formulations.
Note
This issue was drafted with Copilot CLI assistance.
The JIT already recognizes most BMI1/BMI2 scalar patterns opportunistically (
andn,blsi,blsr,blsmsk,shlx/shrx/sarx,rorx,mulx,tzcnt,lzcnt,popcnt), but the BMI2bzhiinstruction is only emitted when user code explicitly callsBmi2.ZeroHighBits. The natural C# expressionvalue & ((1 << k) - 1)— used wherever code needs to keep the lowkbits of a value with a variable count — still lowers to a 4-instruction sequence even on hardware that has BZHI.Current codegen for
value & ((1u << k) - 1):Possible BZHI codegen (with a safety mask to preserve IL shift semantics when
k >= operandSize):Semantic note
BZHI reads only bits[7:0] of its index operand. When that value is
>= operandSize, BZHI returns the source unchanged. The IL shift, however, masks the count to(operandSize - 1)bits, so(1 << 32) - 1evaluates to0on a 32-bit shift — meaningvalue & ((1 << k) - 1) == 0fork = 32, whilebzhi(value, 32) == value. To preserve semantics we either need to maskkexplicitly or provekis in[0, operandSize)from range analysis.Prototype
Branch: https://github.com/AndyAyersMS/runtime/tree/fix-bzhi-recognition
Adds
Lowering::TryLowerAndOpToZeroHighBitsinlowerxarch.cppthat:AND(value, SUB(LSH(1, k), 1))and the morph-canonicalizedADD(..., -1)form (commutatively).AND(k, operandSize - 1)unless the IR patternAND(k_inner, const)withconst < operandSizeproves the count is bounded (covers the common case whereLowerShifthas already stripped morph's& 31/& 63from the shift count).AND/SUB/LSH.SPMI x64 results (805,338 contexts; collections: aspnet2, benchmarks.run, benchmarks.run_pgo, realworld.run, libraries.pmi, libraries.crossgen2)
All 36 diff contexts are size improvements; 0 size regressions. Of 36 affected contexts: 33 PerfScore improvements (down to −25%, −8.93% on hot methods), 3 small PerfScore regressions (≤ +1.36%) all attributable to the JIT perf model preferring
and mem, reg(RMW) overbzhi mem; mov mem(load + bzhi + store) — actual µarch behavior is essentially identical.Sample diffs
System.Collections.BitArray.ClearHighExtraBits(realworld):RealParser.AssembleFloatingPointValue(libraries.pmi, 64-bit BZHI):BFloat16.RoundMidpointToEven<int>(libraries.pmi, −8.93% PerfScore on this method):V8.Crypto.BigInteger.fromString(benchmarks.run):Real-world hits
BitArray.ClearHighExtraBits,BigInteger.{fromString, fromByteArray, toString, modPow, toByteArray},BFloat16.RoundMidpointToEven,RealParser.{ConvertDecimalToFloatingPointBits, AssembleFloatingPointValue},InflaterManaged.DecodeBlock,InputBuffer.GetBits,DeflaterHuffman.CompressBlock,RegularExpressions.Symbolic.BitVector.ClearRemainderBits,AsnWriter.CheckValidLastByte,Microsoft.CodeAnalysis.CSharpsymbol-flag helpers.Potential follow-ups
and k, 31/and k, 63whenIntegralRangecan prove the count is in range (e.g., whenLowerShiftstrips a morph-inserted mask that BZHI could have used).bzhi mem; mov memvsand mem, reg(RMW) should be preferred for the few mem-RMW shapes — possibly skip the transform when the value is a contained memory operand that would otherwise be anand memRMW.~((-1) << k) & valueandvalue << (size - k) >> (size - k)formulations.