Skip to content

Commit 26c525e

Browse files
author
peng.li24
committed
docs: update README — empirically-tested minimum flags, 754 tests, detail/ API boundary
- Compiler flags section rewritten based on empirical flag removal test: * REQUIRED: -ffp-contract=off (36 einsum failures without it — implicit FMA) * REQUIRED: -mavx512f -mfma (compile error — svml_bridge.h uses #ifdef __AVX512F__) * REQUIRED: -ldl (link — dlsym for SVML/npy resolution) * OPTIONAL: -msse4.1 (all 754 tests pass without it) * OPTIONAL: all -fno-builtin-* (all 754 tests pass; numpycpp never calls exp() directly) * REMOVED: -ffloat-store (was in old README but never in CMakeLists.txt) - Test count: 500 → 754 (added NaN passthrough, signed-zero, ±∞, domain, AVX boundary) - Internal headers: svml_bridge.h/npy_math_float.h → numpy/detail/* with #error guard - Project structure: add detail/ subdirectory, blas_bridge.h, avx512_loops.h - Dot/Norm/Einsum: corrected 'pairwise_sum' → 'BLAS (cblas_sdot/sgemv/sgemm)'
1 parent 635ef2a commit 26c525e

1 file changed

Lines changed: 90 additions & 60 deletions

File tree

README.md

Lines changed: 90 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ We created `numpycpp` to keep NumPy's familiar usage patterns while letting C++
1515

1616
`numpycpp` is a **header-only C++ library** implementing numpy's core API (`numpy.*`, `numpy.linalg.*`, `numpy.einsum`) with **bit-level precision alignment**. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.
1717

18-
All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (500 tests, float64 + float32).
18+
All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (754 tests, float64 + float32, including NaN passthrough, signed-zero, ±∞, and domain-error cases).
1919

2020
**Bit-exact math** is achieved by resolving numpy's own math functions from `_multiarray_umath.so` at runtime. The SVML bridge auto-detects your CPU and selects the same path numpy uses: AVX‑512 SVML (`__svml_exp8`) when available, or scalar `npy_exp`/`npy_log`/etc. otherwise. AVX‑512 intrinsics are isolated behind `__attribute__((target))` — the binary is safe on any x86_64 CPU (no SIGILL). Every transcendental function produces the exact same IEEE 754 bits as numpy on **all architectures**.
2121

@@ -35,8 +35,9 @@ All APIs are tested against Python numpy under strict bit-level comparison: ever
3535
#include "numpy/einsum.h" // numpy.einsum
3636
```
3737

38-
> `numpy/svml_bridge.h` and `numpy/npy_math_float.h` are **internal** — they are
39-
> automatically pulled in by `core.h`. Do not include them directly.
38+
> `numpy/detail/` headers are **internal** — automatically pulled in by the
39+
> public headers. Do not include them directly; a compile-time `#error` fires
40+
> if you try.
4041
4142
```cpp
4243
std::vector<double> data = {1.0, 4.0, 9.0};
@@ -89,62 +90,86 @@ Add `-Ipath/to/numpycpp` to your compiler flags and include the headers directly
8990
### Testing
9091

9192
The test suite verifies **bit-level precision alignment** between every C++ function and Python numpy.
92-
No tolerance, no `atol`/`rtol` — raw IEEE 754 bits must match exactly. 500 tests, float64 + float32.
93+
No tolerance, no `atol`/`rtol` — raw IEEE 754 bits must match exactly.
94+
754 tests: float64 + float32, including NaN passthrough, signed-zero, ±∞, domain errors, and AVX-512 boundary sizes.
9395

9496
```bash
95-
cd tests
96-
make # compile C++ test module
97-
make test # run all 500 tests (silent mode: only failures print)
97+
# build
98+
cmake -S tests -B tests/build
99+
cmake --build tests/build -j$(nproc)
100+
101+
# run (silent on pass — failures print hex diff)
102+
cd tests && python3 -m pytest test_all.py -q --tb=short --no-header
98103
```
99104

100-
To run with verbose output:
105+
### Compiler flags for bit-exact alignment
101106

102-
```bash
103-
PYTHONPATH=tests:$PYTHONPATH python3 -m pytest tests/test_all.py -v
107+
The minimum set of flags was determined empirically: each flag was removed in
108+
isolation and the full 754-test suite was re-run. Only flags whose removal
109+
caused at least one test failure are marked **required**.
110+
111+
#### Minimum required flags
112+
113+
```cmake
114+
target_compile_options(<target> PRIVATE
115+
-ffp-contract=off # REQUIRED — see below
116+
-mavx512f -mfma # REQUIRED — see below
117+
)
118+
target_link_libraries(<target> PRIVATE dl) # REQUIRED — dlsym
104119
```
105120

106-
### Compiler flags for bit-exact alignment
121+
| Flag | Why required | Tested consequence of removal |
122+
|------|-------------|-------------------------------|
123+
| `-ffp-contract=off` | Prevents the compiler from silently fusing `a*b + c` into a single FMA instruction. numpycpp's einsum accumulation loops must use the same multiply-then-add order as numpy's BLAS kernels. | 36 einsum tests fail with ±1 ULP differences. |
124+
| `-mavx512f -mfma` | The SVML bridge declares fast scalar wrappers (`exp_svml_f64`, etc.) inside `#ifdef __AVX512F__`. Without this flag the preprocessor omits those declarations and the dispatcher fails to compile. AVX-512 intrinsics are runtime-guarded via `__builtin_cpu_supports` — the binary is safe on non-AVX-512 CPUs. | Hard compile error: `'exp_svml_f64' was not declared in this scope`. |
125+
| `-ldl` | `dlsym` / `dlopen` are used at startup to locate numpy's `_multiarray_umath.so` and resolve `npy_exp`, `__svml_exp8`, etc. | Link error: `undefined reference to 'dlsym'`. |
126+
127+
#### Recommended (defensive) flags
128+
129+
These flags produced **no test failures** when removed individually (all 754
130+
tests still passed), but are kept in `tests/CMakeLists.txt` as a safety net:
107131

108-
Achieving bit-identical results with numpy requires strict control over floating-point code generation.
109-
The Makefile applies the following flags:
110-
111-
```makefile
112-
CXXFLAGS ?= -std=c++17 -O2 -fPIC -fopenmp \
113-
-ffp-contract=off -ffloat-store -msse4.1 \
114-
-mavx512f -mfma \
115-
-fno-builtin-exp -fno-builtin-log \
116-
-fno-builtin-sin -fno-builtin-cos \
117-
-fno-builtin-tan -fno-builtin-pow \
118-
-fno-builtin-sqrt -fno-builtin-atan2 \
119-
-fno-builtin-log2 -fno-builtin-log10 \
120-
-fno-builtin-asin -fno-builtin-acos \
121-
-fno-builtin-atan -fno-builtin-exp2 \
122-
-fno-builtin-cbrt -fno-builtin-expm1 \
123-
-fno-builtin-log1p
124-
LDFLAGS = -shared -ldl
132+
```cmake
133+
target_compile_options(<target> PRIVATE
134+
-msse4.1 # baseline SSE4.1 (good practice; not currently needed)
135+
-fno-builtin-exp # \
136+
-fno-builtin-log # |
137+
-fno-builtin-sin # | prevent GCC from replacing direct math calls
138+
-fno-builtin-cos # | with builtins — numpycpp never calls exp()/sin()
139+
-fno-builtin-tan # | directly, so these have no measurable effect
140+
-fno-builtin-pow # | today, but guard against accidental future regressions
141+
-fno-builtin-sqrt # |
142+
-fno-builtin-atan2 # |
143+
-fno-builtin-log2 # |
144+
-fno-builtin-log10 # |
145+
-fno-builtin-asin # |
146+
-fno-builtin-acos # |
147+
-fno-builtin-atan # |
148+
-fno-builtin-exp2 # |
149+
-fno-builtin-cbrt # |
150+
-fno-builtin-expm1 # |
151+
-fno-builtin-log1p # /
152+
)
125153
```
126154

127-
| Flag | Purpose |
128-
|------|---------|
129-
| `-ffp-contract=off` | Disable FMA contraction — numpy does not contract |
130-
| `-ffloat-store` | Prevent excess x87 precision in registers |
131-
| `-msse4.1` | Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
132-
| `-mavx512f -mfma` | Enable AVX‑512 compilation for SVML bridge. Intrinsics are runtime‑guarded via `__attribute__((target))` — safe on any x86_64 CPU (no SIGILL) |
133-
| `-fno-builtin-<func>` | Prevent GCC from replacing math calls with built‑ins, ensuring the SVML bridge intercepts every call |
134-
| `-ldl` | Required for `dlsym` at runtime to resolve numpy's math functions from `_multiarray_umath.so` |
155+
> **Why `-fno-builtin-*` doesn't matter today**: numpycpp never calls `exp()`,
156+
> `sin()`, etc. from `<cmath>` directly. Every transcendental is routed
157+
> through the SVML bridge's custom-named wrappers (`exp_npy_f64`,
158+
> `exp_svml_f64`, …) so GCC has no opportunity to substitute its own builtin.
159+
> The flags are retained for defensive clarity.
135160
136161
> **Runtime CPU dispatch**: The SVML bridge auto‑detects AVX‑512 at runtime
137-
> (`__builtin_cpu_supports`). On AVX‑512 hardware it calls numpy's SVML vector functions
138-
> (`__svml_exp8`, etc.); otherwise it falls back to numpy's scalar math functions
139-
> (`npy_exp`, `npy_log`, etc.). Both paths are resolved from the loaded
140-
> `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind
141-
> `__attribute__((target("avx512f")))` so the binary runs safely on ANY
142-
> x86_64 CPU — no SIGILL.
162+
> (`__builtin_cpu_supports("avx512f")`). On AVX‑512 hardware it calls numpy's
163+
> SVML vector functions (`__svml_exp8`, etc.); otherwise it falls back to
164+
> numpy's scalar math functions (`npy_exp`, `npy_log`, etc.). Both paths are
165+
> resolved from the loaded `_multiarray_umath.so` via `dlsym`. AVX‑512
166+
> intrinsics are isolated behind `__attribute__((target("avx512f")))` the
167+
> binary compiles and runs safely on **any** x86_64 CPU without SIGILL.
143168
144169
### Alignment status
145170

146171
The table below reflects the current bit-level parity between `numpycpp` C++ and Python numpy.
147-
All 500 tests pass under strict IEEE 754 bit comparison (float64 + float32).
172+
All 754 tests pass under strict IEEE 754 bit comparison (float64 + float32).
148173

149174
✅ = bit-exact on ALL architectures (SVML bridge with runtime CPU dispatch).
150175

@@ -167,35 +192,40 @@ All 500 tests pass under strict IEEE 754 bit comparison (float64 + float32).
167192
| **Reduction** (sum, mean, max, min, any, all) ||| pairwise_sum matches numpy exactly |
168193
| Statistical (std, var) ||| pairwise_sum + sqrt |
169194
| Binary (maximum, minimum) ||| std::max/min, deterministic |
170-
| **Dot product** ||| pairwise_sum(a*b) — matches np.sum(a*b) |
171-
| **Norm** ||| pairwise_sum of squares + sqrt |
172-
| **Norm (axis)** ||| Fiber-wise pairwise_sum + sqrt |
195+
| **Dot product** ||| BLAS (`cblas_sdot`/`cblas_ddot`) — matches `np.dot` |
196+
| **Norm** ||| BLAS dot + sqrt — matches `np.linalg.norm` |
197+
| **Norm (axis)** ||| BLAS dot per fiber + sqrt |
173198
| **Einsum** ||| All patterns (ij,ij→i, ij,jk→ik, bij,bjk→bik, etc.) |
174199

175200
> **SVML bridge**: At runtime, `numpycpp` detects CPU features (`__builtin_cpu_supports("avx512f")`) and selects the same math path numpy uses — AVX‑512 SVML vector functions (`__svml_exp8`, etc.) on supported hardware, or scalar `npy_exp`/`npy_log`/etc. otherwise. Both are resolved from the loaded `_multiarray_umath.so` via `dlsym`. AVX‑512 intrinsics are isolated behind `__attribute__((target("avx512f")))` — the binary compiles and runs safely on ANY x86_64 CPU without SIGILL.
176201
>
177-
> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). This matches `np.sum` exactly. Dot products and norms build on pairwise_sum, not BLAS — matching `np.sum(a*b)` and `np.sqrt(np.sum(a*a))` respectively.
202+
> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). This matches `np.sum` exactly.
203+
>
204+
> **Dot / Norm / Einsum**: Use BLAS (`cblas_sdot`, `cblas_sgemv`, `cblas_sgemm`) — the same kernels numpy delegates to — so results are bit-identical.
178205
179206
## Project Structure
180207

181208
```
182209
numpycpp/
183-
├── numpy/ # native C++ headers
184-
│ ├── core.h # [PUBLIC] numpy.* equivalents
185-
│ ├── linalg.h # [PUBLIC] numpy.linalg.*
186-
│ ├── einsum.h # [PUBLIC] numpy.einsum
187-
│ ├── svml_bridge.h # [INTERNAL] do not include directly
188-
│ └── npy_math_float.h # [INTERNAL] do not include directly
189-
├── pycpp/ # pybind11 wrappers (optional)
210+
├── numpy/ # native C++ headers
211+
│ ├── core.h # [PUBLIC] numpy.* equivalents
212+
│ ├── linalg.h # [PUBLIC] numpy.linalg.*
213+
│ ├── einsum.h # [PUBLIC] numpy.einsum
214+
│ └── detail/ # [INTERNAL] do not include directly — #error guard
215+
│ ├── svml_bridge.h # SVML / npy_* scalar math dispatch
216+
│ ├── npy_math_float.h # npy_* float32 wrappers
217+
│ ├── blas_bridge.h # BLAS (cblas) thin wrappers
218+
│ └── avx512_loops.h # AVX-512 vectorised exp/sin/cos loops
219+
├── pycpp/ # pybind11 wrappers (optional)
190220
│ ├── core_py.h
191221
│ ├── linalg_py.h
192222
│ └── einsum_py.h
193-
├── tests/ # bit-level precision tests + test module
194-
│ ├── module.cpp # pybind11 module for testing
195-
│ ├── test_all.py # single entry — all APIs, 500 tests, float64+float32
196-
│ ├── conftest.py # silent-mode output suppression
197-
│ └── Makefile
198-
├── CMakeLists.txt # build & .deb packaging
223+
├── tests/ # bit-level precision tests + test module
224+
│ ├── module.cpp # pybind11 module for testing
225+
│ ├── test_all.py # single entry — all APIs, 754 tests, float64+float32
226+
│ ├── conftest.py # silent-mode output suppression
227+
│ └── CMakeLists.txt # test-module build
228+
├── CMakeLists.txt # build & .deb packaging
199229
└── README.md
200230
```
201231

0 commit comments

Comments
 (0)