Skip to content

Commit 82f4867

Browse files
author
peng.li24
committed
fix: make AVX-512 conditional; fix std::round→nearbyint for numpy banker's rounding
- Makefile: -mavx512f -mfma now conditional (make AVX512=1), -msse4.1 added unconditionally for einsum SSE intrinsics - module.cpp: add _has_avx512_svml() compile-time detection - test_all.py: use compile-time flag to skip ALL transcendental tests when SVML bridge not compiled; not just large sizes - core.h: std::round → std::nearbyint to match numpy's round-half-to-even (banker's rounding), fixing float32 mismatch at exact .5 values - README: updated compiler flags section, test count 449→460 Fixes CI SIGILL on non-AVX-512 runners (GitHub Actions ubuntu-22.04). Without -mavx512f, __AVX512F__ is not defined → SVML bridge uses std:: fallbacks → no AVX-512 intrinsics → safe on any x86_64 CPU. Transcendental tests auto-skip when SVML is unavailable.
1 parent ccebb6a commit 82f4867

5 files changed

Lines changed: 86 additions & 34 deletions

File tree

README.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ We created `numpycpp` to keep NumPy's familiar usage patterns while letting C++
1515

1616
`numpycpp` is a **header-only C++ library** implementing numpy's core API (`numpy.*`, `numpy.linalg.*`, `numpy.einsum`) with **bit-level precision alignment**. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.
1717

18-
All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (449 tests, float64 + float32).
18+
All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (460 tests, float64 + float32).
1919

2020
**Bit-exact math** is achieved via an SVML bridge that resolves numpy's own transcendental functions (`__svml_exp8`, `__svml_sin8`, etc.) from the loaded `_multiarray_umath.so` at runtime. This guarantees that `exp`, `log`, `sin`, `cos`, `tan`, and all other transcendental functions produce the exact same bits as numpy. On platforms without AVX-512, the bridge falls back to `std::` (1‑ULP).
2121

@@ -80,12 +80,12 @@ Add `-Ipath/to/numpycpp` to your compiler flags and include the headers directly
8080
### Testing
8181

8282
The test suite verifies **bit-level precision alignment** between every C++ function and Python numpy.
83-
No tolerance, no `atol`/`rtol` — raw IEEE 754 bits must match exactly. 449 tests, float64 + float32.
83+
No tolerance, no `atol`/`rtol` — raw IEEE 754 bits must match exactly. 460 tests, float64 + float32.
8484

8585
```bash
8686
cd tests
8787
make # compile C++ test module
88-
make test # run all 449 tests (silent mode: only failures print)
88+
make test # run all 460 tests (silent mode: only failures print)
8989
```
9090

9191
To run with verbose output:
@@ -101,34 +101,43 @@ The Makefile applies the following flags:
101101

102102
```makefile
103103
CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp \
104-
-ffp-contract=off -ffloat-store \
105-
-mavx512f -mfma \
104+
-ffp-contract=off -ffloat-store -msse4.1 \
106105
-fno-builtin-exp -fno-builtin-log \
107106
-fno-builtin-sin -fno-builtin-cos \
108107
-fno-builtin-tan -fno-builtin-pow \
109108
-fno-builtin-sqrt -fno-builtin-atan2 \
110109
-fno-builtin-log2 -fno-builtin-log10 \
111110
-fno-builtin-asin -fno-builtin-acos \
112111
-fno-builtin-atan -fno-builtin-exp2
112+
# Optional: enable AVX-512 for bit-exact transcendental math.
113+
# Requires AVX-512 hardware. Usage: make AVX512=1
114+
ifdef AVX512
115+
CXXFLAGS += -mavx512f -mfma
116+
endif
113117
LDFLAGS = -shared -ldl
114118
```
115119

116120
| Flag | Purpose |
117121
|------|---------|
118122
| `-ffp-contract=off` | Disable FMA contraction — numpy does not contract |
119123
| `-ffloat-store` | Prevent excess x87 precision in registers |
120-
| `-mavx512f -mfma` | Enable AVX-512 so the SVML bridge resolves numpy's own vector math library |
124+
| `-msse4.1` | Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
125+
| `-mavx512f -mfma` | **Optional** (`make AVX512=1`): enable SVML bridge for bit-exact transcendental math. Requires AVX-512 hardware. Without this, transcendental functions fall back to `std::` (1‑ULP difference) |
121126
| `-fno-builtin-<func>` | **Critical**: prevent GCC from replacing math calls with its built-in implementations. Without these, `exp`/`log`/`sin`/etc. will use libm instead of the SVML bridge, breaking bit-exact alignment |
122127
| `-ldl` | Required for `dlsym` at runtime to resolve SVML symbols from `_multiarray_umath.so` |
123128

124129
> **Why `-fno-builtin` matters**: GCC's built-in math functions produce different last-bits than numpy's SVML.
125130
> Even `std::exp` vs `__svml_exp8` differ by 1‑2 ULP for some inputs.
126131
> These flags ensure the SVML bridge intercepts every transcendental call, guaranteeing bit-identical output.
132+
>
133+
> **AVX-512 is optional**: The test suite auto-detects whether the module was compiled with `-mavx512f`
134+
> and skips transcendental tests when it was not. Non-AVX-512 builds are safe for CI and machines
135+
> without AVX-512 hardware — only the transcendental tests are skipped; all other 350+ tests still run.
127136
128137
### Alignment status
129138

130139
The table below reflects the current bit-level parity between `numpycpp` C++ and Python numpy.
131-
All 449 tests pass under strict IEEE 754 bit comparison (float64 + float32).
140+
All 460 tests pass under strict IEEE 754 bit comparison (float64 + float32).
132141

133142
✅ = bit-exact on AVX-512 (SVML bridge active).
134143
🔶 = 1-ULP on non-AVX-512 (falls back to `std::` math).
@@ -176,7 +185,7 @@ numpycpp/
176185
│ └── einsum_py.h
177186
├── tests/ # bit-level precision tests + test module
178187
│ ├── module.cpp # pybind11 module for testing
179-
│ ├── test_all.py # single entry — all APIs, 449 tests, float64+float32
188+
│ ├── test_all.py # single entry — all APIs, 460 tests, float64+float32
180189
│ ├── conftest.py # silent-mode output suppression
181190
│ └── Makefile
182191
├── CMakeLists.txt # build & .deb packaging

numpy/core.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ inline void arctan(const T* src, T* dst, size_t n) {
158158
/// numpy.round(a, decimals=0, out=None)
159159
template<typename T>
160160
inline void round(const T* src, T* dst, size_t n) {
161-
NUMPY_UNROLL4(i, dst[i] = std::round(src[i]));
161+
NUMPY_UNROLL4(i, dst[i] = std::nearbyint(src[i]));
162162
}
163163

164164
/// numpy.floor(x, /, out=None, *, where=True, ...)

tests/Makefile

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,17 @@
55

66
CXX ?= g++
77
CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp -ffp-contract=off \
8-
-ffloat-store \
9-
-mavx512f -mfma \
8+
-ffloat-store -msse4.1 \
109
-fno-builtin-exp -fno-builtin-log -fno-builtin-sin \
1110
-fno-builtin-cos -fno-builtin-tan -fno-builtin-pow \
1211
-fno-builtin-sqrt -fno-builtin-atan2 -fno-builtin-log2 \
1312
-fno-builtin-log10 -fno-builtin-asin -fno-builtin-acos \
1413
-fno-builtin-atan -fno-builtin-exp2
14+
# Optional: enable AVX-512 for SVML bridge bit-exact transcendental math.
15+
# Requires AVX-512 hardware. Usage: make AVX512=1
16+
ifdef AVX512
17+
CXXFLAGS += -mavx512f -mfma
18+
endif
1519
INCLUDES = -I.. -I../pycpp $(shell python3 -m pybind11 --includes) $(shell pkg-config --cflags eigen3 2>/dev/null || echo)
1620
LDFLAGS = -shared -ldl
1721

tests/module.cpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,4 +239,13 @@ PYBIND11_MODULE(numpycpp, m) {
239239
// -- Einsum ------------------------------------------------------------
240240
m.def("einsum", static_cast<py::array_t<double>(*)(const std::string&, const py::array_t<double>&, const py::array_t<double>&)>(&numpy::einsum));
241241
m.def("einsum", static_cast<py::array_t<float>(*)(const std::string&, const py::array_t<float>&, const py::array_t<float>&)>(&numpy::einsum));
242+
243+
// -- Compile-time capability detection -------------------------------
244+
m.def("_has_avx512_svml", []() -> bool {
245+
#ifdef __AVX512F__
246+
return true;
247+
#else
248+
return false;
249+
#endif
250+
});
242251
}

tests/test_all.py

Lines changed: 53 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -136,48 +136,70 @@ def dtype(request):
136136
return request.param
137137

138138

139+
# ---------------------------------------------------------------------------
140+
# AVX-512 SVML detection — bit-exact transcendental math requires BOTH
141+
# compile-time -mavx512f (which defines __AVX512F__) and runtime AVX-512
142+
# hardware. If the C++ module was compiled without -mavx512f, the SVML bridge
143+
# falls back to std:: (1-ULP difference from numpy). In that case ALL
144+
# transcendental tests are skipped — not just large sizes.
145+
# ---------------------------------------------------------------------------
146+
147+
def _has_avx512_svml():
148+
"""Check if C++ module was compiled with AVX-512 SVML bridge (__AVX512F__)."""
149+
try:
150+
return get_cpp_module()._has_avx512_svml()
151+
except Exception:
152+
return False
153+
154+
139155
# ============================================================================
140156
# 1. Data-driven element-wise unary math
141157
# ============================================================================
142-
# Each entry: (cpp_fn_name, np_fn, input_prep, sizes)
158+
# Each entry: (cpp_fn_name, np_fn, input_prep, sizes, transcendental)
143159
# input_prep: None → random_array directly; else callable: prep(a) → input
144160
# sizes: list of (size, seed) tuples
161+
# transcendental: True → requires AVX-512 SVML bridge for bit-exact match;
162+
# test is skipped when module compiled without -mavx512f.
145163

146164
_UNARY_MATH = [
147-
("sqrt", np.sqrt, lambda a: np.abs(a), [(10, 42), (10000, 7), (100000, 7)]),
148-
("abs", np.abs, None, [(10, 42), (10000, 7), (100000, 7)]),
149-
("exp", np.exp, None, [(10, 1), (1000, 7), (10000, 7), (100000, 7)]),
150-
("log", np.log, lambda a: np.abs(a) + 0.1, [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
151-
("sin", np.sin, None, [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
152-
("cos", np.cos, None, [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
153-
("tan", np.tan, lambda a: a * 0.5, [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
154-
("log10", np.log10, lambda a: np.abs(a) + 0.1, [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
155-
("log2", np.log2, lambda a: np.abs(a) + 0.1, [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
156-
("arcsin", np.arcsin, lambda a: np.clip(a * 0.5, -1, 1), [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
157-
("arccos", np.arccos, lambda a: np.clip(a * 0.5, -1, 1), [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
158-
("arctan", np.arctan, None, [(10, 42), (1000, 7), (10000, 7), (100000, 7)]),
159-
("round", np.round, lambda a: a * 100, [(10, 42), (10000, 7)]),
160-
("floor", np.floor, lambda a: a * 100, [(10, 42), (10000, 7)]),
161-
("ceil", np.ceil, lambda a: a * 100, [(10, 42), (10000, 7)]),
162-
("degrees", np.degrees, None, [(10, 42), (10000, 7)]),
163-
("radians", np.radians, None, [(10, 42), (10000, 7)]),
164-
("sign", np.sign, None, [(10, 42), (10000, 7)]),
165+
# Pure C++ element-wise — always bit-exact at any size
166+
("sqrt", np.sqrt, lambda a: np.abs(a), [(10, 42), (10000, 7), (100000, 7)], False),
167+
("abs", np.abs, None, [(10, 42), (10000, 7), (100000, 7)], False),
168+
("round", np.round, lambda a: a * 100, [(10, 42), (10000, 7), (100000, 7)], False),
169+
("floor", np.floor, lambda a: a * 100, [(10, 42), (10000, 7), (100000, 7)], False),
170+
("ceil", np.ceil, lambda a: a * 100, [(10, 42), (10000, 7), (100000, 7)], False),
171+
("degrees", np.degrees, None, [(10, 42), (10000, 7), (100000, 7)], False),
172+
("radians", np.radians, None, [(10, 42), (10000, 7), (100000, 7)], False),
173+
("sign", np.sign, None, [(10, 42), (10000, 7), (100000, 7)], False),
174+
# Transcendental — bit-exact only when compiled with -mavx512f (SVML bridge)
175+
("exp", np.exp, None, [(10, 1), (1000, 7), (10000, 7), (100000, 7)], True),
176+
("log", np.log, lambda a: np.abs(a) + 0.1, [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
177+
("sin", np.sin, None, [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
178+
("cos", np.cos, None, [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
179+
("tan", np.tan, lambda a: a * 0.5, [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
180+
("log10", np.log10, lambda a: np.abs(a) + 0.1, [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
181+
("log2", np.log2, lambda a: np.abs(a) + 0.1, [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
182+
("arcsin", np.arcsin, lambda a: np.clip(a * 0.5, -1, 1), [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
183+
("arccos", np.arccos, lambda a: np.clip(a * 0.5, -1, 1), [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
184+
("arctan", np.arctan, None, [(10, 42), (1000, 7), (10000, 7), (100000, 7)], True),
165185
]
166186

167187

168188
# Build parametrize tables at module level
169189
_UNARY_ARGS = []
170190
_UNARY_IDS = []
171-
for fn, npf, prep, sizes in _UNARY_MATH:
191+
for fn, npf, prep, sizes, trans in _UNARY_MATH:
172192
for size, seed in sizes:
173193
tag = f"{fn}_{size}"
174194
if seed != 42: tag += f"_s{seed}"
175195
_UNARY_IDS.append(tag)
176-
_UNARY_ARGS.append(pytest.param(fn, npf, prep, size, seed, id=tag))
196+
_UNARY_ARGS.append(pytest.param(fn, npf, prep, size, seed, trans, id=tag))
177197

178198

179-
@pytest.mark.parametrize("fn_name, np_fn, prep, size, seed", _UNARY_ARGS)
180-
def test_unary_math(fn_name, np_fn, prep, size, seed, cpp, dtype):
199+
@pytest.mark.parametrize("fn_name, np_fn, prep, size, seed, trans", _UNARY_ARGS)
200+
def test_unary_math(fn_name, np_fn, prep, size, seed, trans, cpp, dtype):
201+
if trans and not cpp._has_avx512_svml():
202+
pytest.skip("SVML bridge not compiled with AVX-512")
181203
a = random_array((size,), seed=seed, dtype=dtype)
182204
inp = prep(a) if prep else a
183205
cpp_fn = getattr(cpp, fn_name)
@@ -200,6 +222,8 @@ def test_sign_zero(cpp, dtype):
200222
(2.0, 10000, 7), (3.0, 10000, 7), (0.5, 10000, 7), (-1.0, 10000, 7),
201223
])
202224
def test_power(expval, size, seed, cpp, dtype):
225+
if not cpp._has_avx512_svml():
226+
pytest.skip("SVML bridge not compiled with AVX-512")
203227
e = dtype(expval)
204228
a = np.abs(random_array((size,), seed=seed, dtype=dtype)) + dtype(0.01)
205229
assert_bit_aligned(cpp.power(a, e), np.power(a, e), f"power({expval})_{size}")
@@ -296,15 +320,21 @@ def test_minimum_large(cpp, dtype):
296320
assert_bit_aligned(cpp.minimum(a, b), np.minimum(a, b), "minimum large")
297321

298322
def test_arctan2_array(cpp, dtype):
323+
if not cpp._has_avx512_svml():
324+
pytest.skip("SVML bridge not compiled with AVX-512")
299325
a = random_array((10,), dtype=dtype)
300326
b = np.abs(random_array((10,), dtype=dtype)) + dtype(0.1)
301327
assert_bit_aligned(cpp.arctan2(a, b), np.arctan2(a, b), "arctan2(a,b)")
302328

303329
def test_arctan2_scalar(cpp, dtype):
330+
if not cpp._has_avx512_svml():
331+
pytest.skip("SVML bridge not compiled with AVX-512")
304332
a = random_array((10,), dtype=dtype)
305333
assert_bit_aligned(cpp.arctan2(a, dtype(1.0)), np.arctan2(a, dtype(1.0)), "arctan2(a,1)")
306334

307335
def test_arctan2_large(cpp, dtype):
336+
if not cpp._has_avx512_svml():
337+
pytest.skip("SVML bridge not compiled with AVX-512")
308338
a = random_array((10000,), seed=7, dtype=dtype)
309339
b = np.abs(random_array((10000,), seed=99, dtype=dtype)) + dtype(0.1)
310340
assert_bit_aligned(cpp.arctan2(a, b), np.arctan2(a, b), "arctan2 large")

0 commit comments

Comments
 (0)