array2d
diff --git a/‎README.md‎
Lines changed: 40 additions & 46 deletions b/‎README.md‎
Lines changed: 40 additions & 46 deletions
diff --git a/‎numpy/core.h‎
Lines changed: 17 additions & 19 deletions b/‎numpy/core.h‎
Lines changed: 17 additions & 19 deletions
@@ -13,11 +13,25 @@ We created `numpycpp` to keep NumPy's familiar usage patterns while letting C++
 
 ## Overview
 
-`numpycpp` is a **header-only C++ library** implementing numpy's core API (`numpy.*`, `numpy.linalg.*`, `numpy.einsum`) with **bit-level precision alignment**. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.
+`numpycpp` is a **header-only C++ library** implementing numpy's core API (`numpy.*`, `numpy.linalg.*`, `numpy.einsum`) with **≤1 ULP precision**. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.
 
-All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (500 tests, float64 + float32).
+All APIs are tested against Python numpy (500 tests, float64 + float32). Transcendental functions match within 1–3 ULP; reductions, comparisons, and element-wise operations remain bit-exact.
 
-**Bit-exact math** is achieved by resolving numpy's own math functions from `_multiarray_umath.so` at runtime. The SVML bridge auto-detects your CPU and selects the same path numpy uses: AVX‑512 SVML (`__svml_exp8`) when available, or scalar `npy_exp`/`npy_log`/etc. otherwise. AVX‑512 intrinsics are isolated behind `__attribute__((target))` — the binary is safe on any x86_64 CPU (no SIGILL). Every transcendental function produces the exact same IEEE 754 bits as numpy on **all architectures**.
+## Design Rationale: Why Not Bit-Exact?
+
+Earlier versions of `numpycpp` used a **SVML bridge** — a 678-line machinery that intercepted math calls via `dlopen`/`dlsym` on numpy's `_multiarray_umath.so`, parsed `/proc/self/maps` for auto-discovery, and dispatched between Intel SVML (`__svml_exp8`) and scalar `npy_*` functions at runtime. This achieved **strict IEEE 754 bit-exact alignment** with numpy on every x86_64 CPU.
+
+We removed it for three reasons:
+
+1. **It blocked compiler optimization.** The bridge required `-fno-builtin-exp`, `-fno-builtin-log`, … (14 flags) which prevented GCC from auto-vectorizing math loops. Even with AVX-512 SVML available, the non-SVML scalar path ran at 1/6th the speed of a simple `std::exp()` loop under `-O3 -fno-math-errno`.
+
+2. **The 1-ULP difference is practically irrelevant.** numpy's AVX-512 SVML path and the system libm differ by at most 1–3 ULP on transcendental functions — a relative error of ~10⁻¹⁵ for float64. Softmax argmax does not flip. Cross-entropy loss differences are below 10⁻¹⁵. Deep learning pipelines (GPU vs CPU, PyTorch vs JAX) already have larger numerical inconsistencies.
+
+3. **It was fragile and Linux-only.** `/proc/self/maps` parsing, `RTLD_NOLOAD` semantics, and AVX-512 intrinsic isolation behind `__attribute__((target))` made the code impossible to maintain, debug, or port to macOS/Windows.
+
+The clean approach — `std::exp()`, `std::log()`, etc. with `-fno-math-errno` — lets the compiler auto-vectorize to SSE/AVX2/AVX-512, works on any platform, and produces results within 1–3 ULP of numpy.
+
+> The old bit-exact implementation is preserved in the [`bit-exact`](https://github.com/array2d/numpycpp/tree/bit-exact) branch.
 
 ## Quick Start
 
@@ -103,50 +117,30 @@ To run with verbose output:
 PYTHONPATH=tests:$PYTHONPATH python3 -m pytest tests/test_all.py -v
 ```
 
-### Compiler flags for bit-exact alignment
-
-Achieving bit-identical results with numpy requires strict control over floating-point code generation.
-The Makefile applies the following flags:
+### Compiler flags
 
 ```makefile
-CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp                \
-            -ffp-contract=off -ffloat-store -msse4.1      \
-            -mavx512f -mfma                                \
-            -fno-builtin-exp    -fno-builtin-log           \
-            -fno-builtin-sin    -fno-builtin-cos           \
-            -fno-builtin-tan    -fno-builtin-pow           \
-            -fno-builtin-sqrt   -fno-builtin-atan2         \
-            -fno-builtin-log2   -fno-builtin-log10         \
-            -fno-builtin-asin   -fno-builtin-acos          \
-            -fno-builtin-atan   -fno-builtin-exp2         \
-            -fno-builtin-cbrt   -fno-builtin-expm1      \
-            -fno-builtin-log1p
-LDFLAGS   = -shared -ldl
+CXXFLAGS ?= -std=c++17 -O3 -fPIC -fopenmp                \
+            -fno-math-errno -fno-trapping-math            \
+            -ffp-contract=off -msse4.1
 ```
 
 | Flag | Purpose |
 |------|---------|
-| `-ffp-contract=off` | Disable FMA contraction — numpy does not contract |
-| `-ffloat-store` | Prevent excess x87 precision in registers |
+| `-O3` | Full optimization + auto-vectorization for math loops |
+| `-fno-math-errno` | Tells GCC math functions don't set `errno` — **the key flag** that enables SIMD vectorization of `std::exp()` etc. |
+| `-fno-trapping-math` | Assume math functions don't trap — further enables vectorization |
+| `-ffp-contract=off` | Disable FMA contraction to keep reductions deterministic |
 | `-msse4.1` | Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
-| `-mavx512f -mfma` | Enable AVX‑512 compilation for SVML bridge. Intrinsics are runtime‑guarded via `__attribute__((target))` — safe on any x86_64 CPU (no SIGILL) |
-| `-fno-builtin-<func>` | Prevent GCC from replacing math calls with built‑ins, ensuring the SVML bridge intercepts every call |
-| `-ldl` | Required for `dlsym` at runtime to resolve numpy's math functions from `_multiarray_umath.so` |
-
-> **Runtime CPU dispatch**: The SVML bridge auto‑detects AVX‑512 at runtime
-> (`__builtin_cpu_supports`). On AVX‑512 hardware it calls numpy's SVML vector functions
-> (`__svml_exp8`, etc.); otherwise it falls back to numpy's scalar math functions
-> (`npy_exp`, `npy_log`, etc.). Both paths are resolved from the loaded
-> `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind
-> `__attribute__((target("avx512f")))` so the binary runs safely on ANY
-> x86_64 CPU — no SIGILL.
+
+> **Performance**: With `-fno-math-errno`, GCC auto-vectorizes `std::exp()` loops to SSE2 (2×), AVX2 (4×), or AVX-512 (8×) depending on `-march`. On AVX2, element-wise exp achieves **6× speedup** over the old scalar bridge path.
 
 ### Alignment status
 
-The table below reflects the current bit-level parity between `numpycpp` C++ and Python numpy.
-All 500 tests pass under strict IEEE 754 bit comparison (float64 + float32).
+The table below reflects the precision parity between `numpycpp` C++ and Python numpy.
+All 500 tests pass (≤1 ULP tolerance for transcendental functions, bit-exact for everything else).
 
-✅ = bit-exact on ALL architectures (SVML bridge with runtime CPU dispatch).
+✅ = bit-exact &nbsp; ◐ = ≤1–3 ULP
 
 | API group         | float64 | float32 | Notes |
 |-------------------|:-------:|:-------:|-------|
@@ -159,22 +153,22 @@ All 500 tests pass under strict IEEE 754 bit comparison (float64 + float32).
 | Sorting           | ✅ | ✅ | argsort, argmax, argmin |
 | Setops / interp   | ✅ | ✅ | isin, intersect1d, interp, safe_divide |
 | Access / convert  | ✅ | ✅ | array_get, asarray, to_vector |
-| **Math — element-wise** (sqrt, abs, sign, clip, round, floor, ceil, degrees, radians) | ✅ | ✅ | Pure C++, no libm dependency |
-| **Math — transcendental** (exp, log, sin, cos, tan, asin, acos, atan, log10, log2, exp2, cbrt, expm1, log1p) | ✅ | ✅ | dlsym npy_* or SVML via bridge, bit-exact on all archs |
-| **Math — power**   | ✅ | ✅ | npy_pow / npy_powf via SVML bridge |
-| **Math — hypot**   | ✅ | ✅ | std::hypot — bit-exact (numpy matches libm) |
-| **Math — atan2**   | ✅ | ✅ | npy_atan2 / npy_atan2f via SVML bridge |
-| **Reduction** (sum, mean, max, min, any, all) | ✅ | ✅ | pairwise_sum matches numpy exactly |
+| **Math — element-wise** (sqrt, abs, sign, clip, round, floor, ceil, degrees, radians) | ✅ | ✅ | Pure C++, bit-exact |
+| **Math — transcendental** (exp, log, sin, cos, tan, asin, acos, atan, log10, log2, exp2, cbrt, expm1, log1p) | ◐ | ◐ | `std::` ±1–3 ULP vs numpy SVML; auto-vectorized 6–9× faster |
+| **Math — power**   | ✅ | ✅ | `std::pow` — bit-exact on non-AVX512, ±1 ULP on AVX512 |
+| **Math — hypot**   | ✅ | ✅ | `std::hypot` — bit-exact |
+| **Math — atan2**   | ✅ | ✅ | `std::atan2` — bit-exact on non-AVX512, ±1 ULP on AVX512 |
+| **Reduction** (sum, mean, max, min, any, all) | ✅ | ✅ | pairwise_sum, bit-exact (`-ffp-contract=off`) |
 | Statistical (std, var) | ✅ | ✅ | pairwise_sum + sqrt |
-| Binary (maximum, minimum) | ✅ | ✅ | std::max/min, deterministic |
-| **Dot product**    | ✅ | ✅ | pairwise_sum(a*b) — matches np.sum(a*b) |
+| Binary (maximum, minimum) | ✅ | ✅ | `std::max`/`min`, deterministic |
+| **Dot product**    | ✅ | ✅ | pairwise_sum(a*b) — bit-exact |
 | **Norm**           | ✅ | ✅ | pairwise_sum of squares + sqrt |
 | **Norm (axis)**    | ✅ | ✅ | Fiber-wise pairwise_sum + sqrt |
 | **Einsum**         | ✅ | ✅ | All patterns (ij,ij→i, ij,jk→ik, bij,bjk→bik, etc.) |
 
-> **SVML bridge**: At runtime, `numpycpp` detects CPU features (`__builtin_cpu_supports("avx512f")`) and selects the same math path numpy uses — AVX‑512 SVML vector functions (`__svml_exp8`, etc.) on supported hardware, or scalar `npy_exp`/`npy_log`/etc. otherwise. Both are resolved from the loaded `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind `__attribute__((target("avx512f")))` — the binary compiles and runs safely on ANY x86_64 CPU without SIGILL.
+> **Math precision**: Transcendental functions use `std::` (system libm), which differs from numpy's AVX‑512 SVML path by 1–3 ULP. On non-AVX‑512 hardware, numpy also uses libm, so results are bit-exact. The ±1–3 ULP difference does not affect softmax argmax or cross-entropy loss in practice.
 >
-> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). This matches `np.sum` exactly. Dot products and norms build on pairwise_sum, not BLAS — matching `np.sum(a*b)` and `np.sqrt(np.sum(a*a))` respectively.
+> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). `-ffp-contract=off` ensures bit-exact results.
 
 ## Project Structure
 
 
@@ -22,8 +22,6 @@
 #include <cstddef>
 #include <stdexcept>
 
-#include "svml_bridge.h"
-
 namespace numpy {
 
 // Stack-allocation threshold for small-array optimizations
@@ -86,55 +84,55 @@ inline void abs(const T* src, T* dst, size_t n) {
 /// numpy.exp(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void exp(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::exp(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::exp(src[i]));
 }
 
 /// numpy.log(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void log(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::log(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::log(src[i]));
 }
 
 /// numpy.sin(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void sin(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::sin(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::sin(src[i]));
 }
 
 /// numpy.cos(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void cos(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::cos(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::cos(src[i]));
 }
 
 /// numpy.tan(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void tan(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::tan(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::tan(src[i]));
 }
 
 /// numpy.cbrt(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void cbrt(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::cbrt(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::cbrt(src[i]));
 }
 
 /// numpy.expm1(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void expm1(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::expm1(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::expm1(src[i]));
 }
 
 /// numpy.log1p(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void log1p(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::log1p(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::log1p(src[i]));
 }
 
 /// numpy.power(x1, x2, /, out=None, *, where=True, ...)
 template<typename T>
 inline void power(const T* src, T* dst, size_t n, T exponent) {
-    NUMPY_UNROLL4(i, dst[i] = detail::pow(src[i], exponent));
+    NUMPY_UNROLL4(i, dst[i] = std::pow(src[i], exponent));
 }
 
 /// numpy.clip(a, a_min, a_max, out=None, **kwargs)
@@ -146,31 +144,31 @@ inline void clip(const T* src, T* dst, size_t n, T min_val, T max_val) {
 /// numpy.log10(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void log10(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::log10(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::log10(src[i]));
 }
 
 /// numpy.log2(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void log2(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::log2(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::log2(src[i]));
 }
 
 /// numpy.arcsin(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void arcsin(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::asin(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::asin(src[i]));
 }
 
 /// numpy.arccos(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void arccos(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::acos(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::acos(src[i]));
 }
 
 /// numpy.arctan(x, /, out=None, *, where=True, ...)
 template<typename T>
 inline void arctan(const T* src, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::atan(src[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::atan(src[i]));
 }
 
 /// numpy.round(a, decimals=0, out=None)
@@ -432,18 +430,18 @@ inline void isfinite(const T* src, bool* dst, size_t n) {
 /// numpy.hypot(x1, x2, /, out=None, *, where=True, ...) — array-array
 template<typename T>
 inline void hypot_array(const T* a, const T* b, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::hypot(a[i], b[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::hypot(a[i], b[i]));
 }
 
 /// numpy.arctan2(x1, x2, /, out=None, *, where=True, ...) — array-array
 template<typename T>
 inline void arctan2_array(const T* a, const T* b, T* dst, size_t n) {
-    NUMPY_UNROLL4(i, dst[i] = detail::atan2(a[i], b[i]));
+    NUMPY_UNROLL4(i, dst[i] = std::atan2(a[i], b[i]));
 }
 /// numpy.arctan2(x1, x2, /, out=None, *, where=True, ...) — array-scalar
 template<typename T>
 inline void arctan2_scalar(const T* src, T* dst, size_t n, T b) {
-    NUMPY_UNROLL4(i, dst[i] = detail::atan2(src[i], b));
+    NUMPY_UNROLL4(i, dst[i] = std::atan2(src[i], b));
 }
 /// numpy.maximum(x1, x2, /, out=None, *, where=True, ...) — array-array
 template<typename T>