fix: make AVX-512 conditional; fix std::round→nearbyint for numpy banker's rounding

peng.li24 · peng.li24 · commit 82f4867a9067 · 2026-06-02T15:50:06.000+08:00
- Makefile: -mavx512f -mfma now conditional (make AVX512=1), -msse4.1 added
  unconditionally for einsum SSE intrinsics
- module.cpp: add _has_avx512_svml() compile-time detection
- test_all.py: use compile-time flag to skip ALL transcendental tests
  when SVML bridge not compiled; not just large sizes
- core.h: std::round → std::nearbyint to match numpy's round-half-to-even
  (banker's rounding), fixing float32 mismatch at exact .5 values
- README: updated compiler flags section, test count 449→460

Fixes CI SIGILL on non-AVX-512 runners (GitHub Actions ubuntu-22.04).
Without -mavx512f, __AVX512F__ is not defined → SVML bridge uses std::
fallbacks → no AVX-512 intrinsics → safe on any x86_64 CPU.
Transcendental tests auto-skip when SVML is unavailable.
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ We created `numpycpp` to keep NumPy's familiar usage patterns while letting C++
 
 `numpycpp` is a **header-only C++ library** implementing numpy's core API (`numpy.*`, `numpy.linalg.*`, `numpy.einsum`) with **bit-level precision alignment**. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.
 
-All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (449 tests, float64 + float32).
+All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (460 tests, float64 + float32).
 
 **Bit-exact math** is achieved via an SVML bridge that resolves numpy's own transcendental functions (`__svml_exp8`, `__svml_sin8`, etc.) from the loaded `_multiarray_umath.so` at runtime. This guarantees that `exp`, `log`, `sin`, `cos`, `tan`, and all other transcendental functions produce the exact same bits as numpy. On platforms without AVX-512, the bridge falls back to `std::` (1‑ULP).
 
@@ -80,12 +80,12 @@ Add `-Ipath/to/numpycpp` to your compiler flags and include the headers directly
 ### Testing
 
 The test suite verifies **bit-level precision alignment** between every C++ function and Python numpy.
-No tolerance, no `atol`/`rtol` — raw IEEE 754 bits must match exactly. 449 tests, float64 + float32.
+No tolerance, no `atol`/`rtol` — raw IEEE 754 bits must match exactly. 460 tests, float64 + float32.
 
 ```bash
 cd tests
 make                    # compile C++ test module
-make test               # run all 449 tests (silent mode: only failures print)
+make test               # run all 460 tests (silent mode: only failures print)
 ```
 
 To run with verbose output:
@@ -101,34 +101,43 @@ The Makefile applies the following flags:
 
 ```makefile
 CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp                \
-            -ffp-contract=off -ffloat-store               \
-            -mavx512f -mfma                                \
+            -ffp-contract=off -ffloat-store -msse4.1      \
             -fno-builtin-exp    -fno-builtin-log           \
             -fno-builtin-sin    -fno-builtin-cos           \
             -fno-builtin-tan    -fno-builtin-pow           \
             -fno-builtin-sqrt   -fno-builtin-atan2         \
             -fno-builtin-log2   -fno-builtin-log10         \
             -fno-builtin-asin   -fno-builtin-acos          \
             -fno-builtin-atan   -fno-builtin-exp2
+# Optional: enable AVX-512 for bit-exact transcendental math.
+# Requires AVX-512 hardware. Usage: make AVX512=1
+ifdef AVX512
+CXXFLAGS += -mavx512f -mfma
+endif
 LDFLAGS   = -shared -ldl
 ```
 
 | Flag | Purpose |
 |------|---------|
 | `-ffp-contract=off` | Disable FMA contraction — numpy does not contract |
 | `-ffloat-store` | Prevent excess x87 precision in registers |
-| `-mavx512f -mfma` | Enable AVX-512 so the SVML bridge resolves numpy's own vector math library |
+| `-msse4.1` | Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
+| `-mavx512f -mfma` | **Optional** (`make AVX512=1`): enable SVML bridge for bit-exact transcendental math. Requires AVX-512 hardware. Without this, transcendental functions fall back to `std::` (1‑ULP difference) |
 | `-fno-builtin-<func>` | **Critical**: prevent GCC from replacing math calls with its built-in implementations. Without these, `exp`/`log`/`sin`/etc. will use libm instead of the SVML bridge, breaking bit-exact alignment |
 | `-ldl` | Required for `dlsym` at runtime to resolve SVML symbols from `_multiarray_umath.so` |
 
 > **Why `-fno-builtin` matters**: GCC's built-in math functions produce different last-bits than numpy's SVML.
 > Even `std::exp` vs `__svml_exp8` differ by 1‑2 ULP for some inputs.
 > These flags ensure the SVML bridge intercepts every transcendental call, guaranteeing bit-identical output.
+> 
+> **AVX-512 is optional**: The test suite auto-detects whether the module was compiled with `-mavx512f`
+> and skips transcendental tests when it was not. Non-AVX-512 builds are safe for CI and machines
+> without AVX-512 hardware — only the transcendental tests are skipped; all other 350+ tests still run.
 
 ### Alignment status
 
 The table below reflects the current bit-level parity between `numpycpp` C++ and Python numpy.
-All 449 tests pass under strict IEEE 754 bit comparison (float64 + float32).
+All 460 tests pass under strict IEEE 754 bit comparison (float64 + float32).
 
 ✅ = bit-exact on AVX-512 (SVML bridge active).  
 🔶 = 1-ULP on non-AVX-512 (falls back to `std::` math).
@@ -176,7 +185,7 @@ numpycpp/
 │   └── einsum_py.h
 ├── tests/              # bit-level precision tests + test module
 │   ├── module.cpp      # pybind11 module for testing
-│   ├── test_all.py     # single entry — all APIs, 449 tests, float64+float32
+│   ├── test_all.py     # single entry — all APIs, 460 tests, float64+float32
 │   ├── conftest.py     # silent-mode output suppression
 │   └── Makefile
 ├── CMakeLists.txt      # build & .deb packaging
diff --git a/numpy/core.h b/numpy/core.h
@@ -158,7 +158,7 @@ inline void arctan(const T* src, T* dst, size_t n) {
 /// numpy.round(a, decimals=0, out=None)
 template<typename T>
 inline void round(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = std::round(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::nearbyint(src[i]));
 }
 
 /// numpy.floor(x, /, out=None, *, where=True, ...)
diff --git a/tests/Makefile b/tests/Makefile
@@ -5,13 +5,17 @@
 
 CXX      ?= g++
 CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp -ffp-contract=off \
-	-ffloat-store \
-	-mavx512f -mfma \
+	-ffloat-store -msse4.1 \
 	-fno-builtin-exp -fno-builtin-log -fno-builtin-sin \
 	-fno-builtin-cos -fno-builtin-tan -fno-builtin-pow \
 	-fno-builtin-sqrt -fno-builtin-atan2 -fno-builtin-log2 \
 	-fno-builtin-log10 -fno-builtin-asin -fno-builtin-acos \
 	-fno-builtin-atan -fno-builtin-exp2
+# Optional: enable AVX-512 for SVML bridge bit-exact transcendental math.
+# Requires AVX-512 hardware. Usage: make AVX512=1
+ifdef AVX512
+CXXFLAGS += -mavx512f -mfma
+endif
 INCLUDES  = -I.. -I../pycpp $(shell python3 -m pybind11 --includes) $(shell pkg-config --cflags eigen3 2>/dev/null || echo)
 LDFLAGS   = -shared -ldl
 
diff --git a/tests/module.cpp b/tests/module.cpp
@@ -239,4 +239,13 @@ PYBIND11_MODULE(numpycpp, m) {
     // -- Einsum ------------------------------------------------------------
     m.def("einsum", static_cast<py::array_t<double>(*)(const std::string&, const py::array_t<double>&, const py::array_t<double>&)>(&numpy::einsum));
     m.def("einsum", static_cast<py::array_t<float>(*)(const std::string&, const py::array_t<float>&, const py::array_t<float>&)>(&numpy::einsum));
+
+	// -- Compile-time capability detection -------------------------------
+	m.def("_has_avx512_svml", []() -> bool {
+#ifdef __AVX512F__
+		return true;
+#else
+		return false;
+#endif
+	});
 }
diff --git a/tests/test_all.py b/tests/test_all.py
@@ -136,48 +136,70 @@ def dtype(request):
     return request.param
 
 
+# ---------------------------------------------------------------------------
+# AVX-512 SVML detection — bit-exact transcendental math requires BOTH
+# compile-time -mavx512f (which defines __AVX512F__) and runtime AVX-512
+# hardware. If the C++ module was compiled without -mavx512f, the SVML bridge
+# falls back to std:: (1-ULP difference from numpy). In that case ALL
+# transcendental tests are skipped — not just large sizes.
+# ---------------------------------------------------------------------------
+
+def _has_avx512_svml():
+    """Check if C++ module was compiled with AVX-512 SVML bridge (__AVX512F__)."""
+    try:
+        return get_cpp_module()._has_avx512_svml()
+    except Exception:
+        return False
+
+
 # ============================================================================
 # 1. Data-driven element-wise unary math
 # ============================================================================
-# Each entry: (cpp_fn_name, np_fn, input_prep, sizes)
+# Each entry: (cpp_fn_name, np_fn, input_prep, sizes, transcendental)
 #   input_prep: None → random_array directly; else callable: prep(a) → input
 #   sizes: list of (size, seed) tuples
+#   transcendental: True → requires AVX-512 SVML bridge for bit-exact match;
+#                   test is skipped when module compiled without -mavx512f.
 
 _UNARY_MATH = [
-    ("sqrt",       np.sqrt,       lambda a: np.abs(a),                [(10, 42), (10000, 7), (100000, 7)]),
-    ("abs",        np.abs,        None,                               [(10, 42), (10000, 7), (100000, 7)]),
-    ("exp",        np.exp,        None,                               [(10, 1),  (1000, 7), (10000, 7), (100000, 7)]),
-    ("log",        np.log,        lambda a: np.abs(a) + 0.1,          [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("sin",        np.sin,        None,                               [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("cos",        np.cos,        None,                               [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("tan",        np.tan,        lambda a: a * 0.5,                  [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("log10",      np.log10,      lambda a: np.abs(a) + 0.1,          [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("log2",       np.log2,       lambda a: np.abs(a) + 0.1,          [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("arcsin",     np.arcsin,     lambda a: np.clip(a * 0.5, -1, 1),  [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("arccos",     np.arccos,     lambda a: np.clip(a * 0.5, -1, 1),  [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("arctan",     np.arctan,     None,                               [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
-    ("round",      np.round,      lambda a: a * 100,                  [(10, 42), (10000, 7)]),
-    ("floor",      np.floor,      lambda a: a * 100,                  [(10, 42), (10000, 7)]),
-    ("ceil",       np.ceil,       lambda a: a * 100,                  [(10, 42), (10000, 7)]),
-    ("degrees",    np.degrees,    None,                               [(10, 42), (10000, 7)]),
-    ("radians",    np.radians,    None,                               [(10, 42), (10000, 7)]),
-    ("sign",       np.sign,       None,                               [(10, 42), (10000, 7)]),
+    # Pure C++ element-wise — always bit-exact at any size
+    ("sqrt",       np.sqrt,       lambda a: np.abs(a),                [(10, 42), (10000, 7), (100000, 7)], False),
+    ("abs",        np.abs,        None,                               [(10, 42), (10000, 7), (100000, 7)], False),
+    ("round",      np.round,      lambda a: a * 100,                  [(10, 42), (10000, 7), (100000, 7)], False),
+    ("floor",      np.floor,      lambda a: a * 100,                  [(10, 42), (10000, 7), (100000, 7)], False),
+    ("ceil",       np.ceil,       lambda a: a * 100,                  [(10, 42), (10000, 7), (100000, 7)], False),
+    ("degrees",    np.degrees,    None,                               [(10, 42), (10000, 7), (100000, 7)], False),
+    ("radians",    np.radians,    None,                               [(10, 42), (10000, 7), (100000, 7)], False),
+    ("sign",       np.sign,       None,                               [(10, 42), (10000, 7), (100000, 7)], False),
+    # Transcendental — bit-exact only when compiled with -mavx512f (SVML bridge)
+    ("exp",        np.exp,        None,                               [(10, 1), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("log",        np.log,        lambda a: np.abs(a) + 0.1,          [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("sin",        np.sin,        None,                               [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("cos",        np.cos,        None,                               [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("tan",        np.tan,        lambda a: a * 0.5,                  [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("log10",      np.log10,      lambda a: np.abs(a) + 0.1,          [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("log2",       np.log2,       lambda a: np.abs(a) + 0.1,          [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("arcsin",     np.arcsin,     lambda a: np.clip(a * 0.5, -1, 1),  [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("arccos",     np.arccos,     lambda a: np.clip(a * 0.5, -1, 1),  [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
+    ("arctan",     np.arctan,     None,                               [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
 ]
 
 
 # Build parametrize tables at module level
 _UNARY_ARGS = []
 _UNARY_IDS = []
-for fn, npf, prep, sizes in _UNARY_MATH:
+for fn, npf, prep, sizes, trans in _UNARY_MATH:
     for size, seed in sizes:
         tag = f"{fn}_{size}"
         if seed != 42: tag += f"_s{seed}"
         _UNARY_IDS.append(tag)
-        _UNARY_ARGS.append(pytest.param(fn, npf, prep, size, seed, id=tag))
+        _UNARY_ARGS.append(pytest.param(fn, npf, prep, size, seed, trans, id=tag))
 
 
-@pytest.mark.parametrize("fn_name, np_fn, prep, size, seed", _UNARY_ARGS)
-def test_unary_math(fn_name, np_fn, prep, size, seed, cpp, dtype):
+@pytest.mark.parametrize("fn_name, np_fn, prep, size, seed, trans", _UNARY_ARGS)
+def test_unary_math(fn_name, np_fn, prep, size, seed, trans, cpp, dtype):
+    if trans and not cpp._has_avx512_svml():
+        pytest.skip("SVML bridge not compiled with AVX-512")
     a = random_array((size,), seed=seed, dtype=dtype)
     inp = prep(a) if prep else a
     cpp_fn = getattr(cpp, fn_name)
@@ -200,6 +222,8 @@ def test_sign_zero(cpp, dtype):
     (2.0, 10000, 7), (3.0, 10000, 7), (0.5, 10000, 7), (-1.0, 10000, 7),
 ])
 def test_power(expval, size, seed, cpp, dtype):
+    if not cpp._has_avx512_svml():
+        pytest.skip("SVML bridge not compiled with AVX-512")
     e = dtype(expval)
     a = np.abs(random_array((size,), seed=seed, dtype=dtype)) + dtype(0.01)
     assert_bit_aligned(cpp.power(a, e), np.power(a, e), f"power({expval})_{size}")
@@ -296,15 +320,21 @@ def test_minimum_large(cpp, dtype):
     assert_bit_aligned(cpp.minimum(a, b), np.minimum(a, b), "minimum large")
 
 def test_arctan2_array(cpp, dtype):
+    if not cpp._has_avx512_svml():
+        pytest.skip("SVML bridge not compiled with AVX-512")
     a = random_array((10,), dtype=dtype)
     b = np.abs(random_array((10,), dtype=dtype)) + dtype(0.1)
     assert_bit_aligned(cpp.arctan2(a, b), np.arctan2(a, b), "arctan2(a,b)")
 
 def test_arctan2_scalar(cpp, dtype):
+    if not cpp._has_avx512_svml():
+        pytest.skip("SVML bridge not compiled with AVX-512")
     a = random_array((10,), dtype=dtype)
     assert_bit_aligned(cpp.arctan2(a, dtype(1.0)), np.arctan2(a, dtype(1.0)), "arctan2(a,1)")
 
 def test_arctan2_large(cpp, dtype):
+    if not cpp._has_avx512_svml():
+        pytest.skip("SVML bridge not compiled with AVX-512")
     a = random_array((10000,), seed=7, dtype=dtype)
     b = np.abs(random_array((10000,), seed=99, dtype=dtype)) + dtype(0.1)
     assert_bit_aligned(cpp.arctan2(a, b), np.arctan2(a, b), "arctan2 large")

Original file line number	Diff line number	Diff line change
`@@ -158,7 +158,7 @@ inline void arctan(const T* src, T* dst, size_t n) {`
`158`	`158`	`/// numpy.round(a, decimals=0, out=None)`
`159`	`159`	`template<typename T>`
`160`	`160`	`inline void round(const T* src, T* dst, size_t n) {`
`161`		`- NUMPY_UNROLL4(i, dst[i] = std::round(src[i]));`
	`161`	`+ NUMPY_UNROLL4(i, dst[i] = std::nearbyint(src[i]));`
`162`	`162`	`}`
`163`	`163`
`164`	`164`	`/// numpy.floor(x, /, out=None, *, where=True, ...)`