You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add per-function ULP error tables + honest test reporting
README:
- Detailed compiler flag guide: required/optional/harmful flags with
explanations for each, so other Claude instances can correctly compile
- Per-function ULP precision tables for float64/float32, referencing
tests/ulp_precision.csv as the canonical data source
tests:
- Add ULP distance computation (ulp_f64, ulp_f32) to test_all.py
- check_bit_aligned now reports actual ULP values on mismatch instead
of hiding behind rtol/atol tolerance
- assert_bit_aligned prints "OK fn: ±N ULP (X/Y elements differ)" for
every transcendental test — honest, transparent precision reporting
- New tests/make_csv.py: regenerates ulp_precision.csv from C++ module
- tests/ulp_precision.csv: 48 rows, all functions × f64/f32,
measured over 100K random samples on AVX-512 hardware
The flags below are the **minimum correct set** to compile `numpycpp` with full auto-vectorization and deterministic reductions. Do NOT use `-ffast-math` — it enables `-ffinite-math-only` which breaks `isnan`/`isinf`/`isfinite`, and enables FMA contraction which breaks bit-exact reductions.
123
+
124
+
**Required flags** (GCC / Clang):
121
125
122
126
```makefile
123
127
CXXFLAGS ?= -std=c++17 -O3 -fPIC -fopenmp \
124
128
-fno-math-errno -fno-trapping-math \
125
129
-ffp-contract=off -msse4.1
126
130
```
127
131
132
+
**MSVC** (Visual Studio 2019+):
133
+
134
+
```
135
+
/std:c++17 /O2 /openmp /fp:strict /arch:AVX2
136
+
```
137
+
138
+
| Flag | Category | Purpose | Required? |
139
+
|------|----------|---------|-----------|
140
+
|`-std=c++17`| Language | C++17 standard (structured bindings, `if constexpr`, fold expressions) |**Yes**|
141
+
|`-O3`| Optimization | Full optimization including auto-vectorization of math loops |**Yes**|
142
+
|`-fPIC`| ABI | Position-independent code (needed for shared libraries / pybind11 modules) | Yes for `.so`|
143
+
|`-fopenmp`| Parallelism | OpenMP `#pragma omp parallel for` used in reductions | Yes if using reductions |
144
+
|`-fno-math-errno`| Vectorization |**The key flag.** Without it, GCC assumes `std::exp()`/`std::log()` etc. may set `errno`, which prevents SIMD vectorization. This flag alone enables SSE2/AVX2/AVX-512 auto-vectorization of math functions. |**Yes**|
145
+
|`-fno-trapping-math`| Vectorization | Assume math ops don't trap (no SIGFPE). Further enables vectorization of edge cases. |**Yes**|
146
+
|`-ffp-contract=off`| Determinism | Disable FMA contraction (a*b+c → fma(a,b,c)). Required for bit-exact reductions and pairwise_sum. Without this, `sum()` results diverge from numpy. |**Yes**|
147
+
|`-msse4.1`| Intrinsics | Required for einsum SSE intrinsics: `_mm_hadd_pd`, `_mm_insert_epi32`| Yes for einsum |
148
+
149
+
**Optional flags for performance:**
150
+
128
151
| Flag | Purpose |
129
152
|------|---------|
130
-
|`-O3`| Full optimization + auto-vectorization for math loops |
131
-
|`-fno-math-errno`| Tells GCC math functions don't set `errno` — **the key flag** that enables SIMD vectorization of `std::exp()` etc. |
132
-
|`-fno-trapping-math`| Assume math functions don't trap — further enables vectorization |
133
-
|`-ffp-contract=off`| Disable FMA contraction to keep reductions deterministic |
134
-
|`-msse4.1`| Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
135
-
136
-
> **Performance**: With `-fno-math-errno`, GCC auto-vectorizes `std::exp()` loops to SSE2 (2×), AVX2 (4×), or AVX-512 (8×) depending on `-march`. On AVX2, element-wise exp achieves **6× speedup** over the old scalar bridge path.
137
-
138
-
### Alignment status
139
-
140
-
The table below reflects the precision parity between `numpycpp` C++ and Python numpy.
141
-
All 500 tests pass (≤1 ULP tolerance for transcendental functions, bit-exact for everything else).
> **Math precision**: Transcendental functions use `std::` (system libm), which differs from numpy's AVX‑512 SVML path by 1–3 ULP. On non-AVX‑512 hardware, numpy also uses libm, so results are bit-exact. The ±1–3 ULP difference does not affect softmax argmax or cross-entropy loss in practice.
153
+
|`-march=native`| Auto-detect CPU features (AVX2, AVX-512) for wider SIMD. Without it, GCC defaults to SSE2 (2-wide float64). |
154
+
|`-mavx2`| Explicit AVX2 (4-wide float64). Use if cross-compiling for AVX2 targets. |
155
+
|`-mavx512f`| AVX-512 (8-wide float64). Only if your deployment hardware supports it. |
156
+
|`-march=x86-64-v3`| x86-64 microarchitecture level v3: AVX2 + FMA + BMI. Good portable baseline for modern CPUs (~2013+). |
157
+
158
+
**⚠️ What NOT to use:**
159
+
160
+
| Flag | Why it's harmful |
161
+
|------|-----------------|
162
+
|`-ffast-math`| Enables `-ffinite-math-only` (breaks `isnan`/`isinf`/`isfinite`) and `-funsafe-math-optimizations` (allows FMA contraction, breaks reduction determinism). Replaces with the targeted flags above. |
163
+
|`-fno-builtin-exp`, `-fno-builtin-log`, … (14 flags) | These were needed by the old SVML bridge. Now they **block** auto-vectorization. Never use them. |
164
+
|`-ffloat-store`| Forces spills to memory after every FP op, killing performance. Was needed by the old bridge. Never use. |
165
+
166
+
**Performance impact** (GCC 11+ with `-march=native`):
The tables below show the **maximum ULP (Unit in the Last Place) difference** between `numpycpp` C++ output and Python numpy, measured over 100,000 random samples per function.
cd tests && make csv # requires compiled numpycpp.so
182
+
```
183
+
184
+
0 ULP = bit-exact (identical IEEE 754 bits). Non-zero = maximum observed ULP distance (system libm vs numpy SVML/libm).
185
+
186
+
> **Reading ULP values**: For float64, 1 ULP ≈ 2.22×10⁻¹⁶ × |value|. For float32, 1 ULP ≈ 1.19×10⁻⁷ × |value|. A 4-ULP difference at 1.0 means the two values differ by ~8.88×10⁻¹⁶ in float64, or ~4.77×10⁻⁷ in float32.
187
+
188
+
#### float64 (1 ULP = 2.22×10⁻¹⁶)
189
+
190
+
| Function | Max ULP | Category |
191
+
|----------|:-------:|----------|
192
+
|`exp`| 1 | transcendental |
193
+
|`log`| 2 | transcendental |
194
+
|`sin`| 3 | transcendental |
195
+
|`cos`| 3 | transcendental |
196
+
|`tan`| 3 | transcendental |
197
+
|`cbrt`| 4 | transcendental |
198
+
|`expm1`| 2 | transcendental |
199
+
|`log1p`| 2 | transcendental |
200
+
|`log10`| 3 | transcendental |
201
+
|`log2`| 2 | transcendental |
202
+
|`asin` (arcsin) | 2 | transcendental |
203
+
|`acos` (arccos) | 2 | transcendental |
204
+
|`atan` (arctan) | 2 | transcendental |
205
+
|`pow`| 0 | binary |
206
+
|`atan2`| 0 | binary |
207
+
|`hypot`| 0 | binary |
208
+
|`sqrt`| 0 | element-wise |
209
+
|`abs`| 0 | element-wise |
210
+
|`sign`| 0 | element-wise |
211
+
|`round`| 0 | element-wise |
212
+
|`floor`| 0 | element-wise |
213
+
|`ceil`| 0 | element-wise |
214
+
|`degrees`| 0 | element-wise |
215
+
|`radians`| 0 | element-wise |
216
+
217
+
#### float32 (1 ULP = 1.19×10⁻⁷)
218
+
219
+
| Function | Max ULP | Category |
220
+
|----------|:-------:|----------|
221
+
|`exp`| 2 | transcendental |
222
+
|`log`| 3 | transcendental |
223
+
|`sin`| 1 | transcendental |
224
+
|`cos`| 1 | transcendental |
225
+
|`tan`| 3 | transcendental |
226
+
|`cbrt`| 2 | transcendental |
227
+
|`expm1`| 2 | transcendental |
228
+
|`log1p`| 2 | transcendental |
229
+
|`log10`| 4 | transcendental |
230
+
|`log2`| 2 | transcendental |
231
+
|`asin` (arcsin) | 3 | transcendental |
232
+
|`acos` (arccos) | 2 | transcendental |
233
+
|`atan` (arctan) | 2 | transcendental |
234
+
|`pow`| 0 | binary |
235
+
|`atan2`| 0 | binary |
236
+
|`hypot`| 0 | binary |
237
+
|`sqrt`| 0 | element-wise |
238
+
|`abs`| 0 | element-wise |
239
+
|`sign`| 0 | element-wise |
240
+
|`round`| 0 | element-wise |
241
+
|`floor`| 0 | element-wise |
242
+
|`ceil`| 0 | element-wise |
243
+
|`degrees`| 0 | element-wise |
244
+
|`radians`| 0 | element-wise |
245
+
246
+
**Non-math APIs — all bit-exact (0 ULP)** for both float64 and float32:
> **Why not bit-exact for transcendentals?** numpy dispatches to Intel SVML (`__svml_exp8`, etc.) on AVX-512 hardware via its `_multiarray_umath.so`. `numpycpp` uses `std::` (system libm). These are two different implementations of the same mathematical functions — both IEEE 754 compliant, but differing by 1–4 ULP. On non-AVX-512 hardware, numpy also uses libm, so results are bit-exact.
170
265
>
171
-
> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). `-ffp-contract=off` ensures bit-exact results.
266
+
> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). `-ffp-contract=off` ensures bit-exact results regardless of hardware.
0 commit comments