docs: update README — empirically-tested minimum flags, 754 tests, detail/ API boundary

peng.li24 · peng.li24 · commit 26c525e89964 · 2026-06-06T16:47:43.000+08:00
- Compiler flags section rewritten based on empirical flag removal test:
  * REQUIRED: -ffp-contract=off  (36 einsum failures without it — implicit FMA)
  * REQUIRED: -mavx512f -mfma   (compile error — svml_bridge.h uses #ifdef __AVX512F__)
  * REQUIRED: -ldl               (link — dlsym for SVML/npy resolution)
  * OPTIONAL: -msse4.1           (all 754 tests pass without it)
  * OPTIONAL: all -fno-builtin-* (all 754 tests pass; numpycpp never calls exp() directly)
  * REMOVED:  -ffloat-store      (was in old README but never in CMakeLists.txt)
- Test count: 500 → 754 (added NaN passthrough, signed-zero, ±∞, domain, AVX boundary)
- Internal headers: svml_bridge.h/npy_math_float.h → numpy/detail/* with #error guard
- Project structure: add detail/ subdirectory, blas_bridge.h, avx512_loops.h
- Dot/Norm/Einsum: corrected 'pairwise_sum' → 'BLAS (cblas_sdot/sgemv/sgemm)'
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ We created `numpycpp` to keep NumPy's familiar usage patterns while letting C++
 
 `numpycpp` is a **header-only C++ library** implementing numpy's core API (`numpy.*`, `numpy.linalg.*`, `numpy.einsum`) with **bit-level precision alignment**. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.
 
-All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (500 tests, float64 + float32).
+All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (754 tests, float64 + float32, including NaN passthrough, signed-zero, ±∞, and domain-error cases).
 
 **Bit-exact math** is achieved by resolving numpy's own math functions from `_multiarray_umath.so` at runtime. The SVML bridge auto-detects your CPU and selects the same path numpy uses: AVX‑512 SVML (`__svml_exp8`) when available, or scalar `npy_exp`/`npy_log`/etc. otherwise. AVX‑512 intrinsics are isolated behind `__attribute__((target))` — the binary is safe on any x86_64 CPU (no SIGILL). Every transcendental function produces the exact same IEEE 754 bits as numpy on **all architectures**.
 
@@ -35,8 +35,9 @@ All APIs are tested against Python numpy under strict bit-level comparison: ever
 #include "numpy/einsum.h"   // numpy.einsum
 ```
 
-> `numpy/svml_bridge.h` and `numpy/npy_math_float.h` are **internal** — they are
-> automatically pulled in by `core.h`. Do not include them directly.
+> `numpy/detail/` headers are **internal** — automatically pulled in by the
+> public headers. Do not include them directly; a compile-time `#error` fires
+> if you try.
 
 ```cpp
 std::vector<double> data = {1.0, 4.0, 9.0};
@@ -89,62 +90,86 @@ Add `-Ipath/to/numpycpp` to your compiler flags and include the headers directly
 ### Testing
 
 The test suite verifies **bit-level precision alignment** between every C++ function and Python numpy.
-No tolerance, no `atol`/`rtol` — raw IEEE 754 bits must match exactly. 500 tests, float64 + float32.
+No tolerance, no `atol`/`rtol` — raw IEEE 754 bits must match exactly.
+754 tests: float64 + float32, including NaN passthrough, signed-zero, ±∞, domain errors, and AVX-512 boundary sizes.
 
 ```bash
-cd tests
-make                    # compile C++ test module
-make test               # run all 500 tests (silent mode: only failures print)
+# build
+cmake -S tests -B tests/build
+cmake --build tests/build -j$(nproc)
+
+# run (silent on pass — failures print hex diff)
+cd tests && python3 -m pytest test_all.py -q --tb=short --no-header
 ```
 
-To run with verbose output:
+### Compiler flags for bit-exact alignment
 
-```bash
-PYTHONPATH=tests:$PYTHONPATH python3 -m pytest tests/test_all.py -v
+The minimum set of flags was determined empirically: each flag was removed in
+isolation and the full 754-test suite was re-run.  Only flags whose removal
+caused at least one test failure are marked **required**.
+
+#### Minimum required flags
+
+```cmake
+target_compile_options(<target> PRIVATE
+    -ffp-contract=off   # REQUIRED — see below
+    -mavx512f -mfma     # REQUIRED — see below
+)
+target_link_libraries(<target> PRIVATE dl)   # REQUIRED — dlsym
 ```
 
-### Compiler flags for bit-exact alignment
+| Flag | Why required | Tested consequence of removal |
+|------|-------------|-------------------------------|
+| `-ffp-contract=off` | Prevents the compiler from silently fusing `a*b + c` into a single FMA instruction. numpycpp's einsum accumulation loops must use the same multiply-then-add order as numpy's BLAS kernels. | 36 einsum tests fail with ±1 ULP differences. |
+| `-mavx512f -mfma` | The SVML bridge declares fast scalar wrappers (`exp_svml_f64`, etc.) inside `#ifdef __AVX512F__`. Without this flag the preprocessor omits those declarations and the dispatcher fails to compile. AVX-512 intrinsics are runtime-guarded via `__builtin_cpu_supports` — the binary is safe on non-AVX-512 CPUs. | Hard compile error: `'exp_svml_f64' was not declared in this scope`. |
+| `-ldl` | `dlsym` / `dlopen` are used at startup to locate numpy's `_multiarray_umath.so` and resolve `npy_exp`, `__svml_exp8`, etc. | Link error: `undefined reference to 'dlsym'`. |
+
+#### Recommended (defensive) flags
+
+These flags produced **no test failures** when removed individually (all 754
+tests still passed), but are kept in `tests/CMakeLists.txt` as a safety net:
 
-Achieving bit-identical results with numpy requires strict control over floating-point code generation.
-The Makefile applies the following flags:
-
-```makefile
-CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp                \
-            -ffp-contract=off -ffloat-store -msse4.1      \
-            -mavx512f -mfma                                \
-            -fno-builtin-exp    -fno-builtin-log           \
-            -fno-builtin-sin    -fno-builtin-cos           \
-            -fno-builtin-tan    -fno-builtin-pow           \
-            -fno-builtin-sqrt   -fno-builtin-atan2         \
-            -fno-builtin-log2   -fno-builtin-log10         \
-            -fno-builtin-asin   -fno-builtin-acos          \
-            -fno-builtin-atan   -fno-builtin-exp2         \
-            -fno-builtin-cbrt   -fno-builtin-expm1      \
-            -fno-builtin-log1p
-LDFLAGS   = -shared -ldl
+```cmake
+target_compile_options(<target> PRIVATE
+    -msse4.1                 # baseline SSE4.1 (good practice; not currently needed)
+    -fno-builtin-exp         # \
+    -fno-builtin-log         #  |
+    -fno-builtin-sin         #  | prevent GCC from replacing direct math calls
+    -fno-builtin-cos         #  | with builtins — numpycpp never calls exp()/sin()
+    -fno-builtin-tan         #  | directly, so these have no measurable effect
+    -fno-builtin-pow         #  | today, but guard against accidental future regressions
+    -fno-builtin-sqrt        #  |
+    -fno-builtin-atan2       #  |
+    -fno-builtin-log2        #  |
+    -fno-builtin-log10       #  |
+    -fno-builtin-asin        #  |
+    -fno-builtin-acos        #  |
+    -fno-builtin-atan        #  |
+    -fno-builtin-exp2        #  |
+    -fno-builtin-cbrt        #  |
+    -fno-builtin-expm1       #  |
+    -fno-builtin-log1p       # /
+)
 ```
 
-| Flag | Purpose |
-|------|---------|
-| `-ffp-contract=off` | Disable FMA contraction — numpy does not contract |
-| `-ffloat-store` | Prevent excess x87 precision in registers |
-| `-msse4.1` | Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
-| `-mavx512f -mfma` | Enable AVX‑512 compilation for SVML bridge. Intrinsics are runtime‑guarded via `__attribute__((target))` — safe on any x86_64 CPU (no SIGILL) |
-| `-fno-builtin-<func>` | Prevent GCC from replacing math calls with built‑ins, ensuring the SVML bridge intercepts every call |
-| `-ldl` | Required for `dlsym` at runtime to resolve numpy's math functions from `_multiarray_umath.so` |
+> **Why `-fno-builtin-*` doesn't matter today**: numpycpp never calls `exp()`,
+> `sin()`, etc. from `<cmath>` directly.  Every transcendental is routed
+> through the SVML bridge's custom-named wrappers (`exp_npy_f64`,
+> `exp_svml_f64`, …) so GCC has no opportunity to substitute its own builtin.
+> The flags are retained for defensive clarity.
 
 > **Runtime CPU dispatch**: The SVML bridge auto‑detects AVX‑512 at runtime
-> (`__builtin_cpu_supports`). On AVX‑512 hardware it calls numpy's SVML vector functions
-> (`__svml_exp8`, etc.); otherwise it falls back to numpy's scalar math functions
-> (`npy_exp`, `npy_log`, etc.). Both paths are resolved from the loaded
-> `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind
-> `__attribute__((target("avx512f")))` so the binary runs safely on ANY
-> x86_64 CPU — no SIGILL.
+> (`__builtin_cpu_supports("avx512f")`). On AVX‑512 hardware it calls numpy's
+> SVML vector functions (`__svml_exp8`, etc.); otherwise it falls back to
+> numpy's scalar math functions (`npy_exp`, `npy_log`, etc.). Both paths are
+> resolved from the loaded `_multiarray_umath.so` via `dlsym`. AVX‑512
+> intrinsics are isolated behind `__attribute__((target("avx512f")))` — the
+> binary compiles and runs safely on **any** x86_64 CPU without SIGILL.
 
 ### Alignment status
 
 The table below reflects the current bit-level parity between `numpycpp` C++ and Python numpy.
-All 500 tests pass under strict IEEE 754 bit comparison (float64 + float32).
+All 754 tests pass under strict IEEE 754 bit comparison (float64 + float32).
 
 ✅ = bit-exact on ALL architectures (SVML bridge with runtime CPU dispatch).
 
@@ -167,35 +192,40 @@ All 500 tests pass under strict IEEE 754 bit comparison (float64 + float32).
 | **Reduction** (sum, mean, max, min, any, all) | ✅ | ✅ | pairwise_sum matches numpy exactly |
 | Statistical (std, var) | ✅ | ✅ | pairwise_sum + sqrt |
 | Binary (maximum, minimum) | ✅ | ✅ | std::max/min, deterministic |
-| **Dot product**    | ✅ | ✅ | pairwise_sum(a*b) — matches np.sum(a*b) |
-| **Norm**           | ✅ | ✅ | pairwise_sum of squares + sqrt |
-| **Norm (axis)**    | ✅ | ✅ | Fiber-wise pairwise_sum + sqrt |
+| **Dot product**    | ✅ | ✅ | BLAS (`cblas_sdot`/`cblas_ddot`) — matches `np.dot` |
+| **Norm**           | ✅ | ✅ | BLAS dot + sqrt — matches `np.linalg.norm` |
+| **Norm (axis)**    | ✅ | ✅ | BLAS dot per fiber + sqrt |
 | **Einsum**         | ✅ | ✅ | All patterns (ij,ij→i, ij,jk→ik, bij,bjk→bik, etc.) |
 
 > **SVML bridge**: At runtime, `numpycpp` detects CPU features (`__builtin_cpu_supports("avx512f")`) and selects the same math path numpy uses — AVX‑512 SVML vector functions (`__svml_exp8`, etc.) on supported hardware, or scalar `npy_exp`/`npy_log`/etc. otherwise. Both are resolved from the loaded `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind `__attribute__((target("avx512f")))` — the binary compiles and runs safely on ANY x86_64 CPU without SIGILL.
 >
-> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). This matches `np.sum` exactly. Dot products and norms build on pairwise_sum, not BLAS — matching `np.sum(a*b)` and `np.sqrt(np.sum(a*a))` respectively.
+> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). This matches `np.sum` exactly.
+>
+> **Dot / Norm / Einsum**: Use BLAS (`cblas_sdot`, `cblas_sgemv`, `cblas_sgemm`) — the same kernels numpy delegates to — so results are bit-identical.
 
 ## Project Structure
 
 ```
 numpycpp/
-├── numpy/              # native C++ headers
-│   ├── core.h          # [PUBLIC] numpy.* equivalents
-│   ├── linalg.h        # [PUBLIC] numpy.linalg.*
-│   ├── einsum.h        # [PUBLIC] numpy.einsum
-│   ├── svml_bridge.h   # [INTERNAL] do not include directly
-│   └── npy_math_float.h # [INTERNAL] do not include directly
-├── pycpp/              # pybind11 wrappers (optional)
+├── numpy/                    # native C++ headers
+│   ├── core.h                # [PUBLIC]   numpy.* equivalents
+│   ├── linalg.h              # [PUBLIC]   numpy.linalg.*
+│   ├── einsum.h              # [PUBLIC]   numpy.einsum
+│   └── detail/               # [INTERNAL] do not include directly — #error guard
+│       ├── svml_bridge.h     #   SVML / npy_* scalar math dispatch
+│       ├── npy_math_float.h  #   npy_* float32 wrappers
+│       ├── blas_bridge.h     #   BLAS (cblas) thin wrappers
+│       └── avx512_loops.h    #   AVX-512 vectorised exp/sin/cos loops
+├── pycpp/                    # pybind11 wrappers (optional)
 │   ├── core_py.h
 │   ├── linalg_py.h
 │   └── einsum_py.h
-├── tests/              # bit-level precision tests + test module
-│   ├── module.cpp      # pybind11 module for testing
-│   ├── test_all.py     # single entry — all APIs, 500 tests, float64+float32
-│   ├── conftest.py     # silent-mode output suppression
-│   └── Makefile
-├── CMakeLists.txt      # build & .deb packaging
+├── tests/                    # bit-level precision tests + test module
+│   ├── module.cpp            # pybind11 module for testing
+│   ├── test_all.py           # single entry — all APIs, 754 tests, float64+float32
+│   ├── conftest.py           # silent-mode output suppression
+│   └── CMakeLists.txt        # test-module build
+├── CMakeLists.txt            # build & .deb packaging
 └── README.md
 ```