You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: runtime AVX-512 dispatch — bit-exact math on every arch
svml_bridge.h: complete rewrite with runtime CPU detection.
- __builtin_cpu_supports("avx512f") selects SVML or npy_* at runtime
- AVX-512 intrinsics isolated via __attribute__((target("avx512f")))
→ binary safe on non-AVX-512 CPUs (no SIGILL)
- f64: __svml_exp8 on AVX-512, npy_exp via dlsym otherwise
- f32 exp/log/sin/cos: numpy's static npy_math_float.h polynomials
(NOT SVML, NOT dlsym npy_expf — those give different results)
- f32 tan/asin/acos/atan/log10/log2/exp2: SVML on AVX-512, npy_*f otherwise
- pow/atan2: npy_pow/npy_atan2 on both paths (SVML on AVX-512)
tests: no more skip logic — all 460 tests always run, always pass.
Makefile: -mavx512f -mfma restored (safe due to runtime guard).
README: updated to document runtime dispatch design.
module.cpp: removed _has_avx512_svml (no longer needed).
Fixes: GitHub Actions CI SIGILL on ubuntu-22.04 (no AVX-512 HW).
Fixes: float32 round banker's rounding (std::round→std::nearbyint).
Copy file name to clipboardExpand all lines: README.md
+18-23Lines changed: 18 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ We created `numpycpp` to keep NumPy's familiar usage patterns while letting C++
17
17
18
18
All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (460 tests, float64 + float32).
19
19
20
-
**Bit-exact math** is achieved via an SVML bridge that resolves numpy's own transcendental functions (`__svml_exp8`, `__svml_sin8`, etc.) from the loaded `_multiarray_umath.so` at runtime. This guarantees that `exp`, `log`, `sin`, `cos`, `tan`, and all other transcendental functions produce the exact same bits as numpy. On platforms without AVX-512, the bridge falls back to `std::` (1‑ULP).
20
+
**Bit-exact math** is achieved by resolving numpy's own math functions from `_multiarray_umath.so` at runtime. The SVML bridge auto-detects your CPU and selects the same path numpy uses: AVX‑512 SVML (`__svml_exp8`) when available, or scalar `npy_exp`/`npy_log`/etc. otherwise. AVX‑512 intrinsics are isolated behind `__attribute__((target))` — the binary is safe on any x86_64 CPU (no SIGILL). Every transcendental function produces the exact same IEEE 754 bits as numpy on **all architectures**.
21
21
22
22
## Quick Start
23
23
@@ -102,18 +102,14 @@ The Makefile applies the following flags:
102
102
```makefile
103
103
CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp \
104
104
-ffp-contract=off -ffloat-store -msse4.1 \
105
+
-mavx512f -mfma \
105
106
-fno-builtin-exp -fno-builtin-log \
106
107
-fno-builtin-sin -fno-builtin-cos \
107
108
-fno-builtin-tan -fno-builtin-pow \
108
109
-fno-builtin-sqrt -fno-builtin-atan2 \
109
110
-fno-builtin-log2 -fno-builtin-log10 \
110
111
-fno-builtin-asin -fno-builtin-acos \
111
112
-fno-builtin-atan -fno-builtin-exp2
112
-
# Optional: enable AVX-512 for bit-exact transcendental math.
113
-
# Requires AVX-512 hardware. Usage: make AVX512=1
114
-
ifdefAVX512
115
-
CXXFLAGS += -mavx512f -mfma
116
-
endif
117
113
LDFLAGS = -shared -ldl
118
114
```
119
115
@@ -122,25 +118,24 @@ LDFLAGS = -shared -ldl
122
118
|`-ffp-contract=off`| Disable FMA contraction — numpy does not contract |
123
119
|`-ffloat-store`| Prevent excess x87 precision in registers |
124
120
|`-msse4.1`| Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
125
-
|`-mavx512f -mfma`|**Optional** (`make AVX512=1`): enable SVML bridge for bit-exact transcendental math. Requires AVX-512 hardware. Without this, transcendental functions fall back to `std::` (1‑ULP difference) |
126
-
|`-fno-builtin-<func>`|**Critical**: prevent GCC from replacing math calls with its built-in implementations. Without these, `exp`/`log`/`sin`/etc. will use libm instead of the SVML bridge, breaking bit-exact alignment|
127
-
|`-ldl`| Required for `dlsym` at runtime to resolve SVML symbols from `_multiarray_umath.so`|
128
-
129
-
> **Why `-fno-builtin` matters**: GCC's built-in math functions produce different last-bits than numpy's SVML.
130
-
> Even `std::exp` vs `__svml_exp8` differ by 1‑2 ULP for some inputs.
131
-
> These flags ensure the SVML bridge intercepts every transcendental call, guaranteeing bit-identical output.
132
-
>
133
-
> **AVX-512 is optional**: The test suite auto-detects whether the module was compiled with `-mavx512f`
134
-
> and skips transcendental tests when it was not. Non-AVX-512 builds are safe for CI and machines
135
-
> without AVX-512 hardware — only the transcendental tests are skipped; all other 350+ tests still run.
121
+
|`-mavx512f -mfma`|Enable AVX‑512 compilation for SVML bridge. Intrinsics are runtime‑guarded via `__attribute__((target))` — safe on any x86_64 CPU (no SIGILL) |
122
+
|`-fno-builtin-<func>`|Prevent GCC from replacing math calls with built‑ins, ensuring the SVML bridge intercepts every call|
123
+
|`-ldl`| Required for `dlsym` at runtime to resolve numpy's math functions from `_multiarray_umath.so`|
124
+
125
+
> **Runtime CPU dispatch**: The SVML bridge auto‑detects AVX‑512 at runtime
126
+
> (`__builtin_cpu_supports`). On AVX‑512 hardware it calls numpy's SVML vector functions
127
+
> (`__svml_exp8`, etc.); otherwise it falls back to numpy's scalar math functions
128
+
> (`npy_exp`, `npy_log`, etc.). Both paths are resolved from the loaded
129
+
> `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind
130
+
> `__attribute__((target("avx512f")))` so the binary runs safely on ANY
131
+
> x86_64 CPU — no SIGILL.
136
132
137
133
### Alignment status
138
134
139
135
The table below reflects the current bit-level parity between `numpycpp` C++ and Python numpy.
140
136
All 460 tests pass under strict IEEE 754 bit comparison (float64 + float32).
141
137
142
-
✅ = bit-exact on AVX-512 (SVML bridge active).
143
-
🔶 = 1-ULP on non-AVX-512 (falls back to `std::` math).
138
+
✅ = bit-exact on ALL architectures (SVML bridge with runtime CPU dispatch).
144
139
145
140
| API group | float64 | float32 | Notes |
146
141
|-------------------|:-------:|:-------:|-------|
@@ -154,9 +149,9 @@ All 460 tests pass under strict IEEE 754 bit comparison (float64 + float32).
> **SVML bridge**: On AVX-512 platforms, `numpycpp`resolves numpy's own SVML vector functions (`__svml_exp8`, `__svml_sin8`, etc.) from the loaded `_multiarray_umath.so` via `dlsym`. This guarantees bit-identical transcendental results. On non-AVX-512, `std::` fallbacks produce ≤ 1 ULP difference.
163
+
> **SVML bridge**: At runtime, `numpycpp`detects CPU features (`__builtin_cpu_supports("avx512f")`) and selects the same math path numpy uses — AVX‑512 SVML vector functions (`__svml_exp8`, etc.) on supported hardware, or scalar `npy_exp`/`npy_log`/etc. otherwise. Both are resolved from the loaded `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind `__attribute__((target("avx512f")))` — the binary compiles and runs safely on ANY x86_64 CPU without SIGILL.
169
164
>
170
165
> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). This matches `np.sum` exactly. Dot products and norms build on pairwise_sum, not BLAS — matching `np.sum(a*b)` and `np.sqrt(np.sum(a*a))` respectively.
0 commit comments