Skip to content

Commit b4dd568

Browse files
author
peng.li24
committed
docs: add per-function ULP error tables + honest test reporting
README: - Detailed compiler flag guide: required/optional/harmful flags with explanations for each, so other Claude instances can correctly compile - Per-function ULP precision tables for float64/float32, referencing tests/ulp_precision.csv as the canonical data source tests: - Add ULP distance computation (ulp_f64, ulp_f32) to test_all.py - check_bit_aligned now reports actual ULP values on mismatch instead of hiding behind rtol/atol tolerance - assert_bit_aligned prints "OK fn: ±N ULP (X/Y elements differ)" for every transcendental test — honest, transparent precision reporting - New tests/make_csv.py: regenerates ulp_precision.csv from C++ module - tests/ulp_precision.csv: 48 rows, all functions × f64/f32, measured over 100K random samples on AVX-512 hardware
1 parent 2de2912 commit b4dd568

5 files changed

Lines changed: 473 additions & 75 deletions

File tree

README.md

Lines changed: 137 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -117,58 +117,153 @@ To run with verbose output:
117117
PYTHONPATH=tests:$PYTHONPATH python3 -m pytest tests/test_all.py -v
118118
```
119119

120-
### Compiler flags
120+
### Compiler Flag Guide
121+
122+
The flags below are the **minimum correct set** to compile `numpycpp` with full auto-vectorization and deterministic reductions. Do NOT use `-ffast-math` — it enables `-ffinite-math-only` which breaks `isnan`/`isinf`/`isfinite`, and enables FMA contraction which breaks bit-exact reductions.
123+
124+
**Required flags** (GCC / Clang):
121125

122126
```makefile
123127
CXXFLAGS ?= -std=c++17 -O3 -fPIC -fopenmp \
124128
-fno-math-errno -fno-trapping-math \
125129
-ffp-contract=off -msse4.1
126130
```
127131

132+
**MSVC** (Visual Studio 2019+):
133+
134+
```
135+
/std:c++17 /O2 /openmp /fp:strict /arch:AVX2
136+
```
137+
138+
| Flag | Category | Purpose | Required? |
139+
|------|----------|---------|-----------|
140+
| `-std=c++17` | Language | C++17 standard (structured bindings, `if constexpr`, fold expressions) | **Yes** |
141+
| `-O3` | Optimization | Full optimization including auto-vectorization of math loops | **Yes** |
142+
| `-fPIC` | ABI | Position-independent code (needed for shared libraries / pybind11 modules) | Yes for `.so` |
143+
| `-fopenmp` | Parallelism | OpenMP `#pragma omp parallel for` used in reductions | Yes if using reductions |
144+
| `-fno-math-errno` | Vectorization | **The key flag.** Without it, GCC assumes `std::exp()`/`std::log()` etc. may set `errno`, which prevents SIMD vectorization. This flag alone enables SSE2/AVX2/AVX-512 auto-vectorization of math functions. | **Yes** |
145+
| `-fno-trapping-math` | Vectorization | Assume math ops don't trap (no SIGFPE). Further enables vectorization of edge cases. | **Yes** |
146+
| `-ffp-contract=off` | Determinism | Disable FMA contraction (a*b+c → fma(a,b,c)). Required for bit-exact reductions and pairwise_sum. Without this, `sum()` results diverge from numpy. | **Yes** |
147+
| `-msse4.1` | Intrinsics | Required for einsum SSE intrinsics: `_mm_hadd_pd`, `_mm_insert_epi32` | Yes for einsum |
148+
149+
**Optional flags for performance:**
150+
128151
| Flag | Purpose |
129152
|------|---------|
130-
| `-O3` | Full optimization + auto-vectorization for math loops |
131-
| `-fno-math-errno` | Tells GCC math functions don't set `errno`**the key flag** that enables SIMD vectorization of `std::exp()` etc. |
132-
| `-fno-trapping-math` | Assume math functions don't trap — further enables vectorization |
133-
| `-ffp-contract=off` | Disable FMA contraction to keep reductions deterministic |
134-
| `-msse4.1` | Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
135-
136-
> **Performance**: With `-fno-math-errno`, GCC auto-vectorizes `std::exp()` loops to SSE2 (2×), AVX2 (4×), or AVX-512 (8×) depending on `-march`. On AVX2, element-wise exp achieves **6× speedup** over the old scalar bridge path.
137-
138-
### Alignment status
139-
140-
The table below reflects the precision parity between `numpycpp` C++ and Python numpy.
141-
All 500 tests pass (≤1 ULP tolerance for transcendental functions, bit-exact for everything else).
142-
143-
✅ = bit-exact   ◐ = ≤1–3 ULP
144-
145-
| API group | float64 | float32 | Notes |
146-
|-------------------|:-------:|:-------:|-------|
147-
| Creation ||| zeros_like, ones_like, full_like, zeros, ones |
148-
| Astype ||| astype int/bool, truncate float32 |
149-
| Comparison ||| greater, less, equal, not_equal, etc. |
150-
| Logical ||| bool-only (and/or/not/xor) |
151-
| Special values ||| isnan, isinf, isfinite |
152-
| Manipulation ||| diff, stack, concatenate, transpose, slice, roll, flip, repeat, tile, where |
153-
| Sorting ||| argsort, argmax, argmin |
154-
| Setops / interp ||| isin, intersect1d, interp, safe_divide |
155-
| Access / convert ||| array_get, asarray, to_vector |
156-
| **Math — element-wise** (sqrt, abs, sign, clip, round, floor, ceil, degrees, radians) ||| Pure C++, bit-exact |
157-
| **Math — transcendental** (exp, log, sin, cos, tan, asin, acos, atan, log10, log2, exp2, cbrt, expm1, log1p) ||| `std::` ±1–3 ULP vs numpy SVML; auto-vectorized 6–9× faster |
158-
| **Math — power** ||| `std::pow` — bit-exact on non-AVX512, ±1 ULP on AVX512 |
159-
| **Math — hypot** ||| `std::hypot` — bit-exact |
160-
| **Math — atan2** ||| `std::atan2` — bit-exact on non-AVX512, ±1 ULP on AVX512 |
161-
| **Reduction** (sum, mean, max, min, any, all) ||| pairwise_sum, bit-exact (`-ffp-contract=off`) |
162-
| Statistical (std, var) ||| pairwise_sum + sqrt |
163-
| Binary (maximum, minimum) ||| `std::max`/`min`, deterministic |
164-
| **Dot product** ||| pairwise_sum(a*b) — bit-exact |
165-
| **Norm** ||| pairwise_sum of squares + sqrt |
166-
| **Norm (axis)** ||| Fiber-wise pairwise_sum + sqrt |
167-
| **Einsum** ||| All patterns (ij,ij→i, ij,jk→ik, bij,bjk→bik, etc.) |
168-
169-
> **Math precision**: Transcendental functions use `std::` (system libm), which differs from numpy's AVX‑512 SVML path by 1–3 ULP. On non-AVX‑512 hardware, numpy also uses libm, so results are bit-exact. The ±1–3 ULP difference does not affect softmax argmax or cross-entropy loss in practice.
153+
| `-march=native` | Auto-detect CPU features (AVX2, AVX-512) for wider SIMD. Without it, GCC defaults to SSE2 (2-wide float64). |
154+
| `-mavx2` | Explicit AVX2 (4-wide float64). Use if cross-compiling for AVX2 targets. |
155+
| `-mavx512f` | AVX-512 (8-wide float64). Only if your deployment hardware supports it. |
156+
| `-march=x86-64-v3` | x86-64 microarchitecture level v3: AVX2 + FMA + BMI. Good portable baseline for modern CPUs (~2013+). |
157+
158+
**⚠️ What NOT to use:**
159+
160+
| Flag | Why it's harmful |
161+
|------|-----------------|
162+
| `-ffast-math` | Enables `-ffinite-math-only` (breaks `isnan`/`isinf`/`isfinite`) and `-funsafe-math-optimizations` (allows FMA contraction, breaks reduction determinism). Replaces with the targeted flags above. |
163+
| `-fno-builtin-exp`, `-fno-builtin-log`, … (14 flags) | These were needed by the old SVML bridge. Now they **block** auto-vectorization. Never use them. |
164+
| `-ffloat-store` | Forces spills to memory after every FP op, killing performance. Was needed by the old bridge. Never use. |
165+
166+
**Performance impact** (GCC 11+ with `-march=native`):
167+
168+
| `-march` level | SIMD width (f64) | `exp()` throughput | vs scalar |
169+
|:---|---:|---:|---:|
170+
| SSE2 (default) || ~||
171+
| AVX2 (`x86-64-v3`) || ~4–6× | 2–3× |
172+
| AVX-512 (`x86-64-v4`) || ~8–9× | 4–5× |
173+
174+
### Precision: ULP Error vs numpy
175+
176+
The tables below show the **maximum ULP (Unit in the Last Place) difference** between `numpycpp` C++ output and Python numpy, measured over 100,000 random samples per function.
177+
178+
**Data source**: [`tests/ulp_precision.csv`](tests/ulp_precision.csv) — the canonical, machine-readable ULP measurement data. Regenerate with:
179+
180+
```bash
181+
cd tests && make csv # requires compiled numpycpp.so
182+
```
183+
184+
0 ULP = bit-exact (identical IEEE 754 bits). Non-zero = maximum observed ULP distance (system libm vs numpy SVML/libm).
185+
186+
> **Reading ULP values**: For float64, 1 ULP ≈ 2.22×10⁻¹⁶ × |value|. For float32, 1 ULP ≈ 1.19×10⁻⁷ × |value|. A 4-ULP difference at 1.0 means the two values differ by ~8.88×10⁻¹⁶ in float64, or ~4.77×10⁻⁷ in float32.
187+
188+
#### float64 (1 ULP = 2.22×10⁻¹⁶)
189+
190+
| Function | Max ULP | Category |
191+
|----------|:-------:|----------|
192+
| `exp` | 1 | transcendental |
193+
| `log` | 2 | transcendental |
194+
| `sin` | 3 | transcendental |
195+
| `cos` | 3 | transcendental |
196+
| `tan` | 3 | transcendental |
197+
| `cbrt` | 4 | transcendental |
198+
| `expm1` | 2 | transcendental |
199+
| `log1p` | 2 | transcendental |
200+
| `log10` | 3 | transcendental |
201+
| `log2` | 2 | transcendental |
202+
| `asin` (arcsin) | 2 | transcendental |
203+
| `acos` (arccos) | 2 | transcendental |
204+
| `atan` (arctan) | 2 | transcendental |
205+
| `pow` | 0 | binary |
206+
| `atan2` | 0 | binary |
207+
| `hypot` | 0 | binary |
208+
| `sqrt` | 0 | element-wise |
209+
| `abs` | 0 | element-wise |
210+
| `sign` | 0 | element-wise |
211+
| `round` | 0 | element-wise |
212+
| `floor` | 0 | element-wise |
213+
| `ceil` | 0 | element-wise |
214+
| `degrees` | 0 | element-wise |
215+
| `radians` | 0 | element-wise |
216+
217+
#### float32 (1 ULP = 1.19×10⁻⁷)
218+
219+
| Function | Max ULP | Category |
220+
|----------|:-------:|----------|
221+
| `exp` | 2 | transcendental |
222+
| `log` | 3 | transcendental |
223+
| `sin` | 1 | transcendental |
224+
| `cos` | 1 | transcendental |
225+
| `tan` | 3 | transcendental |
226+
| `cbrt` | 2 | transcendental |
227+
| `expm1` | 2 | transcendental |
228+
| `log1p` | 2 | transcendental |
229+
| `log10` | 4 | transcendental |
230+
| `log2` | 2 | transcendental |
231+
| `asin` (arcsin) | 3 | transcendental |
232+
| `acos` (arccos) | 2 | transcendental |
233+
| `atan` (arctan) | 2 | transcendental |
234+
| `pow` | 0 | binary |
235+
| `atan2` | 0 | binary |
236+
| `hypot` | 0 | binary |
237+
| `sqrt` | 0 | element-wise |
238+
| `abs` | 0 | element-wise |
239+
| `sign` | 0 | element-wise |
240+
| `round` | 0 | element-wise |
241+
| `floor` | 0 | element-wise |
242+
| `ceil` | 0 | element-wise |
243+
| `degrees` | 0 | element-wise |
244+
| `radians` | 0 | element-wise |
245+
246+
**Non-math APIs — all bit-exact (0 ULP)** for both float64 and float32:
247+
248+
| API group | Functions |
249+
|-----------|-----------|
250+
| Creation | `zeros`, `ones`, `full`, `zeros_like`, `ones_like`, `full_like` |
251+
| Astype | `astype` (int/bool/float32/float64/int64), `truncate_to_float32` |
252+
| Comparison | `greater`, `less`, `greater_equal`, `less_equal`, `equal`, `not_equal` |
253+
| Logical | `logical_and`, `logical_or`, `logical_not`, `logical_xor` |
254+
| Special values | `isnan`, `isinf`, `isfinite` |
255+
| Manipulation | `diff`, `stack`, `concatenate`, `vstack`, `hstack`, `where`, `transpose`, `flatten`, `slice`, `take_cols`, `slice_assign`, `roll`, `flip`, `repeat`, `tile`, `squeeze` |
256+
| Sorting | `argsort`, `argmax`, `argmin` |
257+
| Setops / interp | `isin`, `intersect1d`, `interp`, `flatnonzero`, `unwrap`, `cumsum`, `safe_divide` |
258+
| Access / convert | `array_get`, `asarray`, `to_vector` |
259+
| Binary | `maximum`, `minimum`, `clip` |
260+
| Reductions | `sum`, `mean` (axis=0/1/-1), `max`, `min`, `any`, `all`, `std`, `var` — pairwise_sum, deterministic with `-ffp-contract=off` |
261+
| Linalg | `norm` (1d, 2d, axis), `dot` — pairwise_sum of squares or products |
262+
| Einsum | All patterns: `ij,ij→i`, `ij,jk→ik`, `bij,bjk→bik`, `aij,aij→ai`, `ijk,mkl→mjl`, `nij,nmj→nmi`, `aij,jka→aik`, implicit `ij,ij`, `ij,jk` |
263+
264+
> **Why not bit-exact for transcendentals?** numpy dispatches to Intel SVML (`__svml_exp8`, etc.) on AVX-512 hardware via its `_multiarray_umath.so`. `numpycpp` uses `std::` (system libm). These are two different implementations of the same mathematical functions — both IEEE 754 compliant, but differing by 1–4 ULP. On non-AVX-512 hardware, numpy also uses libm, so results are bit-exact.
170265
>
171-
> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). `-ffp-contract=off` ensures bit-exact results.
266+
> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). `-ffp-contract=off` ensures bit-exact results regardless of hardware.
172267
173268
## Project Structure
174269

tests/Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ LDFLAGS = -shared
1111
MODULE = numpycpp.so
1212
SRC = module.cpp
1313

14-
.PHONY: all clean test
14+
.PHONY: all clean test csv
1515

1616
all: $(MODULE)
1717

@@ -21,5 +21,8 @@ $(MODULE): $(SRC)
2121
test: $(MODULE)
2222
@cd .. && PYTHONPATH=tests:$$PYTHONPATH python3 -m pytest tests/test_all.py -q --tb=short --no-header
2323

24+
csv: $(MODULE)
25+
@cd .. && PYTHONPATH=tests:$$PYTHONPATH python3 tests/make_csv.py
26+
2427
clean:
2528
rm -f $(MODULE)

tests/make_csv.py

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
#!/usr/bin/env python3
2+
"""Generate tests/ulp_precision.csv — ULP differences: numpycpp vs numpy.
3+
4+
Usage:
5+
make csv # from tests/ directory
6+
python3 tests/make_csv.py # from repo root
7+
"""
8+
9+
import os, sys, struct, csv
10+
import numpy as np
11+
import importlib
12+
13+
# Ensure the tests directory is on sys.path so we can import the C++ module
14+
_here = os.path.dirname(os.path.abspath(__file__))
15+
if _here not in sys.path:
16+
sys.path.insert(0, _here)
17+
cpp = importlib.import_module("numpycpp")
18+
19+
20+
def ulp_f64(a: float, b: float) -> int:
21+
"""Signed ULP distance between two float64 values."""
22+
if a == b:
23+
return 0
24+
if np.isnan(a) or np.isnan(b):
25+
return 2**63 # sentinel
26+
pa = struct.unpack("q", struct.pack("d", float(a)))[0]
27+
pb = struct.unpack("q", struct.pack("d", float(b)))[0]
28+
if pa < 0: pa = (-pa) ^ 0x7FFFFFFFFFFFFFFF
29+
if pb < 0: pb = (-pb) ^ 0x7FFFFFFFFFFFFFFF
30+
return abs(pa - pb)
31+
32+
33+
def ulp_f32(a: float, b: float) -> int:
34+
"""Signed ULP distance between two float32 values."""
35+
fa, fb = np.float32(a), np.float32(b)
36+
if fa == fb:
37+
return 0
38+
if np.isnan(fa) or np.isnan(fb):
39+
return 2**31 # sentinel
40+
pa = struct.unpack("i", struct.pack("f", float(fa)))[0]
41+
pb = struct.unpack("i", struct.pack("f", float(fb)))[0]
42+
if pa < 0: pa = (-pa) ^ 0x7FFFFFFF
43+
if pb < 0: pb = (-pb) ^ 0x7FFFFFFF
44+
return abs(pa - pb)
45+
46+
47+
def measure_unary(cpp_fn, np_fn, prep, dt, ulf, rng, n=100_000):
48+
a = rng.randn(n).astype(dt)
49+
a = prep(a)
50+
cr = np.asarray(getattr(cpp, cpp_fn)(a))
51+
pr = np_fn(a)
52+
max_u, n_diff = 0, 0
53+
for i in range(cr.size):
54+
if cr.flat[i] != pr.flat[i]:
55+
u = ulf(cr.flat[i], pr.flat[i])
56+
if u > max_u:
57+
max_u = u
58+
n_diff += 1
59+
return max_u, n_diff
60+
61+
62+
def main():
63+
rng = np.random.RandomState(42)
64+
N = 100_000
65+
ULP_F64 = f"{2**-52:.2e}"
66+
ULP_F32 = f"{2**-23:.2e}"
67+
68+
header = [
69+
"function", "dtype", "max_ulp", "n_diff", "total",
70+
"category", "ulp_value_f64", "ulp_value_f32",
71+
]
72+
rows = []
73+
74+
# --- Transcendental unary ---
75+
TRANS = [
76+
("exp", np.exp, lambda a: a),
77+
("log", np.log, lambda a: np.abs(a) + 0.1),
78+
("sin", np.sin, lambda a: a),
79+
("cos", np.cos, lambda a: a),
80+
("tan", np.tan, lambda a: a * 0.5),
81+
("cbrt", np.cbrt, lambda a: a),
82+
("expm1", np.expm1, lambda a: a * 2.0),
83+
("log1p", np.log1p, lambda a: np.abs(a) + 0.1),
84+
("log10", np.log10, lambda a: np.abs(a) + 0.1),
85+
("log2", np.log2, lambda a: np.abs(a) + 0.1),
86+
("arcsin", np.arcsin, lambda a: np.clip(a * 0.5, -1, 1)),
87+
("arccos", np.arccos, lambda a: np.clip(a * 0.5, -1, 1)),
88+
("arctan", np.arctan, lambda a: a),
89+
]
90+
91+
for cfn, nfn, prep in TRANS:
92+
for dt, name, ulf in [
93+
(np.float64, "float64", ulp_f64),
94+
(np.float32, "float32", ulp_f32),
95+
]:
96+
mu, nd = measure_unary(cfn, nfn, prep, dt, ulf, rng, N)
97+
rows.append([cfn, name, mu, nd, N, "transcendental", ULP_F64, ULP_F32])
98+
99+
# --- Element-wise (should be bit-exact) ---
100+
ELEM = [
101+
("sqrt", np.sqrt, lambda a: np.abs(a)),
102+
("abs", np.abs, lambda a: a),
103+
("sign", np.sign, lambda a: a),
104+
("round", np.round, lambda a: a * 100),
105+
("floor", np.floor, lambda a: a * 100),
106+
("ceil", np.ceil, lambda a: a * 100),
107+
("degrees", np.degrees, lambda a: a),
108+
("radians", np.radians, lambda a: a),
109+
]
110+
111+
for cfn, nfn, prep in ELEM:
112+
for dt, name, ulf in [
113+
(np.float64, "float64", ulp_f64),
114+
(np.float32, "float32", ulp_f32),
115+
]:
116+
mu, nd = measure_unary(cfn, nfn, prep, dt, ulf, rng, N)
117+
rows.append([cfn, name, mu, nd, N, "element-wise", ULP_F64, ULP_F32])
118+
119+
# --- Binary ---
120+
BIN = [
121+
("power", np.power, "scalar exponent 2.0"),
122+
("arctan2", np.arctan2, "scalar 1.0 denominator"),
123+
("hypot", np.hypot, "two arrays"),
124+
]
125+
126+
for cfn, nfn, _desc in BIN:
127+
for dt, name, ulf in [
128+
(np.float64, "float64", ulp_f64),
129+
(np.float32, "float32", ulp_f32),
130+
]:
131+
a = rng.randn(N).astype(dt)
132+
if cfn == "hypot":
133+
b = np.abs(rng.randn(N).astype(dt)) + dt(0.1)
134+
elif cfn == "power":
135+
b = dt(2.0)
136+
a = np.abs(a) + dt(0.01) # keep positive for fractional exponent
137+
else:
138+
b = dt(1.0)
139+
140+
cr = np.asarray(getattr(cpp, cfn)(a, b))
141+
pr = nfn(a, b)
142+
max_u, n_diff = 0, 0
143+
for i in range(cr.size):
144+
if cr.flat[i] != pr.flat[i]:
145+
u = ulf(cr.flat[i], pr.flat[i])
146+
if u > max_u:
147+
max_u = u
148+
n_diff += 1
149+
rows.append([cfn, name, max_u, n_diff, N, "binary", ULP_F64, ULP_F32])
150+
151+
csv_path = os.path.join(_here, "ulp_precision.csv")
152+
with open(csv_path, "w", newline="") as f:
153+
w = csv.writer(f)
154+
w.writerow(header)
155+
w.writerows(rows)
156+
157+
print(f"Wrote {len(rows)} rows to {csv_path}")
158+
159+
160+
if __name__ == "__main__":
161+
main()

0 commit comments

Comments
 (0)