Skip to content

Commit f45415b

Browse files
author
peng.li24
committed
feat: runtime AVX-512 dispatch — bit-exact math on every arch
svml_bridge.h: complete rewrite with runtime CPU detection. - __builtin_cpu_supports("avx512f") selects SVML or npy_* at runtime - AVX-512 intrinsics isolated via __attribute__((target("avx512f"))) → binary safe on non-AVX-512 CPUs (no SIGILL) - f64: __svml_exp8 on AVX-512, npy_exp via dlsym otherwise - f32 exp/log/sin/cos: numpy's static npy_math_float.h polynomials (NOT SVML, NOT dlsym npy_expf — those give different results) - f32 tan/asin/acos/atan/log10/log2/exp2: SVML on AVX-512, npy_*f otherwise - pow/atan2: npy_pow/npy_atan2 on both paths (SVML on AVX-512) tests: no more skip logic — all 460 tests always run, always pass. Makefile: -mavx512f -mfma restored (safe due to runtime guard). README: updated to document runtime dispatch design. module.cpp: removed _has_avx512_svml (no longer needed). Fixes: GitHub Actions CI SIGILL on ubuntu-22.04 (no AVX-512 HW). Fixes: float32 round banker's rounding (std::round→std::nearbyint).
1 parent dc759aa commit f45415b

5 files changed

Lines changed: 264 additions & 187 deletions

File tree

README.md

Lines changed: 18 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ We created `numpycpp` to keep NumPy's familiar usage patterns while letting C++
1717

1818
All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (460 tests, float64 + float32).
1919

20-
**Bit-exact math** is achieved via an SVML bridge that resolves numpy's own transcendental functions (`__svml_exp8`, `__svml_sin8`, etc.) from the loaded `_multiarray_umath.so` at runtime. This guarantees that `exp`, `log`, `sin`, `cos`, `tan`, and all other transcendental functions produce the exact same bits as numpy. On platforms without AVX-512, the bridge falls back to `std::` (1‑ULP).
20+
**Bit-exact math** is achieved by resolving numpy's own math functions from `_multiarray_umath.so` at runtime. The SVML bridge auto-detects your CPU and selects the same path numpy uses: AVX‑512 SVML (`__svml_exp8`) when available, or scalar `npy_exp`/`npy_log`/etc. otherwise. AVX‑512 intrinsics are isolated behind `__attribute__((target))`the binary is safe on any x86_64 CPU (no SIGILL). Every transcendental function produces the exact same IEEE 754 bits as numpy on **all architectures**.
2121

2222
## Quick Start
2323

@@ -102,18 +102,14 @@ The Makefile applies the following flags:
102102
```makefile
103103
CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp \
104104
-ffp-contract=off -ffloat-store -msse4.1 \
105+
-mavx512f -mfma \
105106
-fno-builtin-exp -fno-builtin-log \
106107
-fno-builtin-sin -fno-builtin-cos \
107108
-fno-builtin-tan -fno-builtin-pow \
108109
-fno-builtin-sqrt -fno-builtin-atan2 \
109110
-fno-builtin-log2 -fno-builtin-log10 \
110111
-fno-builtin-asin -fno-builtin-acos \
111112
-fno-builtin-atan -fno-builtin-exp2
112-
# Optional: enable AVX-512 for bit-exact transcendental math.
113-
# Requires AVX-512 hardware. Usage: make AVX512=1
114-
ifdef AVX512
115-
CXXFLAGS += -mavx512f -mfma
116-
endif
117113
LDFLAGS = -shared -ldl
118114
```
119115

@@ -122,25 +118,24 @@ LDFLAGS = -shared -ldl
122118
| `-ffp-contract=off` | Disable FMA contraction — numpy does not contract |
123119
| `-ffloat-store` | Prevent excess x87 precision in registers |
124120
| `-msse4.1` | Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
125-
| `-mavx512f -mfma` | **Optional** (`make AVX512=1`): enable SVML bridge for bit-exact transcendental math. Requires AVX-512 hardware. Without this, transcendental functions fall back to `std::` (1‑ULP difference) |
126-
| `-fno-builtin-<func>` | **Critical**: prevent GCC from replacing math calls with its built-in implementations. Without these, `exp`/`log`/`sin`/etc. will use libm instead of the SVML bridge, breaking bit-exact alignment |
127-
| `-ldl` | Required for `dlsym` at runtime to resolve SVML symbols from `_multiarray_umath.so` |
128-
129-
> **Why `-fno-builtin` matters**: GCC's built-in math functions produce different last-bits than numpy's SVML.
130-
> Even `std::exp` vs `__svml_exp8` differ by 1‑2 ULP for some inputs.
131-
> These flags ensure the SVML bridge intercepts every transcendental call, guaranteeing bit-identical output.
132-
>
133-
> **AVX-512 is optional**: The test suite auto-detects whether the module was compiled with `-mavx512f`
134-
> and skips transcendental tests when it was not. Non-AVX-512 builds are safe for CI and machines
135-
> without AVX-512 hardware — only the transcendental tests are skipped; all other 350+ tests still run.
121+
| `-mavx512f -mfma` | Enable AVX‑512 compilation for SVML bridge. Intrinsics are runtime‑guarded via `__attribute__((target))` — safe on any x86_64 CPU (no SIGILL) |
122+
| `-fno-builtin-<func>` | Prevent GCC from replacing math calls with built‑ins, ensuring the SVML bridge intercepts every call |
123+
| `-ldl` | Required for `dlsym` at runtime to resolve numpy's math functions from `_multiarray_umath.so` |
124+
125+
> **Runtime CPU dispatch**: The SVML bridge auto‑detects AVX‑512 at runtime
126+
> (`__builtin_cpu_supports`). On AVX‑512 hardware it calls numpy's SVML vector functions
127+
> (`__svml_exp8`, etc.); otherwise it falls back to numpy's scalar math functions
128+
> (`npy_exp`, `npy_log`, etc.). Both paths are resolved from the loaded
129+
> `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind
130+
> `__attribute__((target("avx512f")))` so the binary runs safely on ANY
131+
> x86_64 CPU — no SIGILL.
136132
137133
### Alignment status
138134

139135
The table below reflects the current bit-level parity between `numpycpp` C++ and Python numpy.
140136
All 460 tests pass under strict IEEE 754 bit comparison (float64 + float32).
141137

142-
✅ = bit-exact on AVX-512 (SVML bridge active).
143-
🔶 = 1-ULP on non-AVX-512 (falls back to `std::` math).
138+
✅ = bit-exact on ALL architectures (SVML bridge with runtime CPU dispatch).
144139

145140
| API group | float64 | float32 | Notes |
146141
|-------------------|:-------:|:-------:|-------|
@@ -154,9 +149,9 @@ All 460 tests pass under strict IEEE 754 bit comparison (float64 + float32).
154149
| Setops / interp ||| isin, intersect1d, interp, safe_divide |
155150
| Access / convert ||| array_get, asarray, to_vector |
156151
| **Math — element-wise** (sqrt, abs, sign, clip, round, floor, ceil, degrees, radians) ||| Pure C++, no libm dependency |
157-
| **Math — transcendental** (exp, log, sin, cos, tan, asin, acos, atan, log10, log2, exp2) || 🔶 | SVML bridge → bit-exact; fallback → std:: (1-ULP) |
158-
| **Math — power** || 🔶 | SVML bridge for f64; f32: std::pow |
159-
| **Math — atan2** || 🔶 | npy_atan2 via SVML bridge |
152+
| **Math — transcendental** (exp, log, sin, cos, tan, asin, acos, atan, log10, log2, exp2) || | npy_* scalar functions via dlsym, bit-exact on all archs |
153+
| **Math — power** || | npy_pow / npy_powf via SVML bridge |
154+
| **Math — atan2** || | npy_atan2 / npy_atan2f via SVML bridge |
160155
| **Reduction** (sum, mean, max, min, any, all) ||| pairwise_sum matches numpy exactly |
161156
| Statistical (std, var) ||| pairwise_sum + sqrt |
162157
| Binary (maximum, minimum) ||| std::max/min, deterministic |
@@ -165,7 +160,7 @@ All 460 tests pass under strict IEEE 754 bit comparison (float64 + float32).
165160
| **Norm (axis)** ||| Fiber-wise pairwise_sum + sqrt |
166161
| **Einsum** ||| All patterns (ij,ij→i, ij,jk→ik, bij,bjk→bik, etc.) |
167162

168-
> **SVML bridge**: On AVX-512 platforms, `numpycpp` resolves numpy's own SVML vector functions (`__svml_exp8`, `__svml_sin8`, etc.) from the loaded `_multiarray_umath.so` via `dlsym`. This guarantees bit-identical transcendental results. On non-AVX-512, `std::` fallbacks produce ≤ 1 ULP difference.
163+
> **SVML bridge**: At runtime, `numpycpp` detects CPU features (`__builtin_cpu_supports("avx512f")`) and selects the same math path numpy uses — AVX‑512 SVML vector functions (`__svml_exp8`, etc.) on supported hardware, or scalar `npy_exp`/`npy_log`/etc. otherwise. Both are resolved from the loaded `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind `__attribute__((target("avx512f")))` — the binary compiles and runs safely on ANY x86_64 CPU without SIGILL.
169164
>
170165
> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). This matches `np.sum` exactly. Dot products and norms build on pairwise_sum, not BLAS — matching `np.sum(a*b)` and `np.sqrt(np.sum(a*a))` respectively.
171166

0 commit comments

Comments
 (0)