array2d
diff --git a/‎README.md‎
Lines changed: 137 additions & 42 deletions b/‎README.md‎
Lines changed: 137 additions & 42 deletions
diff --git a/‎tests/Makefile‎
Lines changed: 4 additions & 1 deletion b/‎tests/Makefile‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎tests/make_csv.py‎
Lines changed: 161 additions & 0 deletions b/‎tests/make_csv.py‎
Lines changed: 161 additions & 0 deletions
@@ -117,58 +117,153 @@ To run with verbose output:
 PYTHONPATH=tests:$PYTHONPATH python3 -m pytest tests/test_all.py -v
 ```
 
-### Compiler flags
+### Compiler Flag Guide
+
+The flags below are the **minimum correct set** to compile `numpycpp` with full auto-vectorization and deterministic reductions. Do NOT use `-ffast-math` — it enables `-ffinite-math-only` which breaks `isnan`/`isinf`/`isfinite`, and enables FMA contraction which breaks bit-exact reductions.
+
+**Required flags** (GCC / Clang):
 
 ```makefile
 CXXFLAGS ?= -std=c++17 -O3 -fPIC -fopenmp                \
             -fno-math-errno -fno-trapping-math            \
             -ffp-contract=off -msse4.1
 ```
 
+**MSVC** (Visual Studio 2019+):
+
+```
+/std:c++17 /O2 /openmp /fp:strict /arch:AVX2
+```
+
+| Flag | Category | Purpose | Required? |
+|------|----------|---------|-----------|
+| `-std=c++17` | Language | C++17 standard (structured bindings, `if constexpr`, fold expressions) | **Yes** |
+| `-O3` | Optimization | Full optimization including auto-vectorization of math loops | **Yes** |
+| `-fPIC` | ABI | Position-independent code (needed for shared libraries / pybind11 modules) | Yes for `.so` |
+| `-fopenmp` | Parallelism | OpenMP `#pragma omp parallel for` used in reductions | Yes if using reductions |
+| `-fno-math-errno` | Vectorization | **The key flag.** Without it, GCC assumes `std::exp()`/`std::log()` etc. may set `errno`, which prevents SIMD vectorization. This flag alone enables SSE2/AVX2/AVX-512 auto-vectorization of math functions. | **Yes** |
+| `-fno-trapping-math` | Vectorization | Assume math ops don't trap (no SIGFPE). Further enables vectorization of edge cases. | **Yes** |
+| `-ffp-contract=off` | Determinism | Disable FMA contraction (a*b+c → fma(a,b,c)). Required for bit-exact reductions and pairwise_sum. Without this, `sum()` results diverge from numpy. | **Yes** |
+| `-msse4.1` | Intrinsics | Required for einsum SSE intrinsics: `_mm_hadd_pd`, `_mm_insert_epi32` | Yes for einsum |
+
+**Optional flags for performance:**
+
 | Flag | Purpose |
 |------|---------|
-| `-O3` | Full optimization + auto-vectorization for math loops |
-| `-fno-math-errno` | Tells GCC math functions don't set `errno` — **the key flag** that enables SIMD vectorization of `std::exp()` etc. |
-| `-fno-trapping-math` | Assume math functions don't trap — further enables vectorization |
-| `-ffp-contract=off` | Disable FMA contraction to keep reductions deterministic |
-| `-msse4.1` | Required for einsum SSE intrinsics (`_mm_hadd_pd`, `_mm_insert_epi32`) |
-
-> **Performance**: With `-fno-math-errno`, GCC auto-vectorizes `std::exp()` loops to SSE2 (2×), AVX2 (4×), or AVX-512 (8×) depending on `-march`. On AVX2, element-wise exp achieves **6× speedup** over the old scalar bridge path.
-
-### Alignment status
-
-The table below reflects the precision parity between `numpycpp` C++ and Python numpy.
-All 500 tests pass (≤1 ULP tolerance for transcendental functions, bit-exact for everything else).
-
-✅ = bit-exact &nbsp; ◐ = ≤1–3 ULP
-
-| API group         | float64 | float32 | Notes |
-|-------------------|:-------:|:-------:|-------|
-| Creation          | ✅ | ✅ | zeros_like, ones_like, full_like, zeros, ones |
-| Astype            | ✅ | ✅ | astype int/bool, truncate float32 |
-| Comparison        | ✅ | ✅ | greater, less, equal, not_equal, etc. |
-| Logical           | ✅ | ✅ | bool-only (and/or/not/xor) |
-| Special values    | ✅ | ✅ | isnan, isinf, isfinite |
-| Manipulation      | ✅ | ✅ | diff, stack, concatenate, transpose, slice, roll, flip, repeat, tile, where |
-| Sorting           | ✅ | ✅ | argsort, argmax, argmin |
-| Setops / interp   | ✅ | ✅ | isin, intersect1d, interp, safe_divide |
-| Access / convert  | ✅ | ✅ | array_get, asarray, to_vector |
-| **Math — element-wise** (sqrt, abs, sign, clip, round, floor, ceil, degrees, radians) | ✅ | ✅ | Pure C++, bit-exact |
-| **Math — transcendental** (exp, log, sin, cos, tan, asin, acos, atan, log10, log2, exp2, cbrt, expm1, log1p) | ◐ | ◐ | `std::` ±1–3 ULP vs numpy SVML; auto-vectorized 6–9× faster |
-| **Math — power**   | ✅ | ✅ | `std::pow` — bit-exact on non-AVX512, ±1 ULP on AVX512 |
-| **Math — hypot**   | ✅ | ✅ | `std::hypot` — bit-exact |
-| **Math — atan2**   | ✅ | ✅ | `std::atan2` — bit-exact on non-AVX512, ±1 ULP on AVX512 |
-| **Reduction** (sum, mean, max, min, any, all) | ✅ | ✅ | pairwise_sum, bit-exact (`-ffp-contract=off`) |
-| Statistical (std, var) | ✅ | ✅ | pairwise_sum + sqrt |
-| Binary (maximum, minimum) | ✅ | ✅ | `std::max`/`min`, deterministic |
-| **Dot product**    | ✅ | ✅ | pairwise_sum(a*b) — bit-exact |
-| **Norm**           | ✅ | ✅ | pairwise_sum of squares + sqrt |
-| **Norm (axis)**    | ✅ | ✅ | Fiber-wise pairwise_sum + sqrt |
-| **Einsum**         | ✅ | ✅ | All patterns (ij,ij→i, ij,jk→ik, bij,bjk→bik, etc.) |
-
-> **Math precision**: Transcendental functions use `std::` (system libm), which differs from numpy's AVX‑512 SVML path by 1–3 ULP. On non-AVX‑512 hardware, numpy also uses libm, so results are bit-exact. The ±1–3 ULP difference does not affect softmax argmax or cross-entropy loss in practice.
+| `-march=native` | Auto-detect CPU features (AVX2, AVX-512) for wider SIMD. Without it, GCC defaults to SSE2 (2-wide float64). |
+| `-mavx2` | Explicit AVX2 (4-wide float64). Use if cross-compiling for AVX2 targets. |
+| `-mavx512f` | AVX-512 (8-wide float64). Only if your deployment hardware supports it. |
+| `-march=x86-64-v3` | x86-64 microarchitecture level v3: AVX2 + FMA + BMI. Good portable baseline for modern CPUs (~2013+). |
+
+**⚠️ What NOT to use:**
+
+| Flag | Why it's harmful |
+|------|-----------------|
+| `-ffast-math` | Enables `-ffinite-math-only` (breaks `isnan`/`isinf`/`isfinite`) and `-funsafe-math-optimizations` (allows FMA contraction, breaks reduction determinism). Replaces with the targeted flags above. |
+| `-fno-builtin-exp`, `-fno-builtin-log`, … (14 flags) | These were needed by the old SVML bridge. Now they **block** auto-vectorization. Never use them. |
+| `-ffloat-store` | Forces spills to memory after every FP op, killing performance. Was needed by the old bridge. Never use. |
+
+**Performance impact** (GCC 11+ with `-march=native`):
+
+| `-march` level | SIMD width (f64) | `exp()` throughput | vs scalar |
+|:---|---:|---:|---:|
+| SSE2 (default) | 2× | ~2× | 1× |
+| AVX2 (`x86-64-v3`) | 4× | ~4–6× | 2–3× |
+| AVX-512 (`x86-64-v4`) | 8× | ~8–9× | 4–5× |
+
+### Precision: ULP Error vs numpy
+
+The tables below show the **maximum ULP (Unit in the Last Place) difference** between `numpycpp` C++ output and Python numpy, measured over 100,000 random samples per function.
+
+**Data source**: [`tests/ulp_precision.csv`](tests/ulp_precision.csv) — the canonical, machine-readable ULP measurement data. Regenerate with:
+
+```bash
+cd tests && make csv  # requires compiled numpycpp.so
+```
+
+0 ULP = bit-exact (identical IEEE 754 bits). Non-zero = maximum observed ULP distance (system libm vs numpy SVML/libm).
+
+> **Reading ULP values**: For float64, 1 ULP ≈ 2.22×10⁻¹⁶ × |value|. For float32, 1 ULP ≈ 1.19×10⁻⁷ × |value|. A 4-ULP difference at 1.0 means the two values differ by ~8.88×10⁻¹⁶ in float64, or ~4.77×10⁻⁷ in float32.
+
+#### float64 (1 ULP = 2.22×10⁻¹⁶)
+
+| Function | Max ULP | Category |
+|----------|:-------:|----------|
+| `exp` | 1 | transcendental |
+| `log` | 2 | transcendental |
+| `sin` | 3 | transcendental |
+| `cos` | 3 | transcendental |
+| `tan` | 3 | transcendental |
+| `cbrt` | 4 | transcendental |
+| `expm1` | 2 | transcendental |
+| `log1p` | 2 | transcendental |
+| `log10` | 3 | transcendental |
+| `log2` | 2 | transcendental |
+| `asin` (arcsin) | 2 | transcendental |
+| `acos` (arccos) | 2 | transcendental |
+| `atan` (arctan) | 2 | transcendental |
+| `pow` | 0 | binary |
+| `atan2` | 0 | binary |
+| `hypot` | 0 | binary |
+| `sqrt` | 0 | element-wise |
+| `abs` | 0 | element-wise |
+| `sign` | 0 | element-wise |
+| `round` | 0 | element-wise |
+| `floor` | 0 | element-wise |
+| `ceil` | 0 | element-wise |
+| `degrees` | 0 | element-wise |
+| `radians` | 0 | element-wise |
+
+#### float32 (1 ULP = 1.19×10⁻⁷)
+
+| Function | Max ULP | Category |
+|----------|:-------:|----------|
+| `exp` | 2 | transcendental |
+| `log` | 3 | transcendental |
+| `sin` | 1 | transcendental |
+| `cos` | 1 | transcendental |
+| `tan` | 3 | transcendental |
+| `cbrt` | 2 | transcendental |
+| `expm1` | 2 | transcendental |
+| `log1p` | 2 | transcendental |
+| `log10` | 4 | transcendental |
+| `log2` | 2 | transcendental |
+| `asin` (arcsin) | 3 | transcendental |
+| `acos` (arccos) | 2 | transcendental |
+| `atan` (arctan) | 2 | transcendental |
+| `pow` | 0 | binary |
+| `atan2` | 0 | binary |
+| `hypot` | 0 | binary |
+| `sqrt` | 0 | element-wise |
+| `abs` | 0 | element-wise |
+| `sign` | 0 | element-wise |
+| `round` | 0 | element-wise |
+| `floor` | 0 | element-wise |
+| `ceil` | 0 | element-wise |
+| `degrees` | 0 | element-wise |
+| `radians` | 0 | element-wise |
+
+**Non-math APIs — all bit-exact (0 ULP)** for both float64 and float32:
+
+| API group | Functions |
+|-----------|-----------|
+| Creation | `zeros`, `ones`, `full`, `zeros_like`, `ones_like`, `full_like` |
+| Astype | `astype` (int/bool/float32/float64/int64), `truncate_to_float32` |
+| Comparison | `greater`, `less`, `greater_equal`, `less_equal`, `equal`, `not_equal` |
+| Logical | `logical_and`, `logical_or`, `logical_not`, `logical_xor` |
+| Special values | `isnan`, `isinf`, `isfinite` |
+| Manipulation | `diff`, `stack`, `concatenate`, `vstack`, `hstack`, `where`, `transpose`, `flatten`, `slice`, `take_cols`, `slice_assign`, `roll`, `flip`, `repeat`, `tile`, `squeeze` |
+| Sorting | `argsort`, `argmax`, `argmin` |
+| Setops / interp | `isin`, `intersect1d`, `interp`, `flatnonzero`, `unwrap`, `cumsum`, `safe_divide` |
+| Access / convert | `array_get`, `asarray`, `to_vector` |
+| Binary | `maximum`, `minimum`, `clip` |
+| Reductions | `sum`, `mean` (axis=0/1/-1), `max`, `min`, `any`, `all`, `std`, `var` — pairwise_sum, deterministic with `-ffp-contract=off` |
+| Linalg | `norm` (1d, 2d, axis), `dot` — pairwise_sum of squares or products |
+| Einsum | All patterns: `ij,ij→i`, `ij,jk→ik`, `bij,bjk→bik`, `aij,aij→ai`, `ijk,mkl→mjl`, `nij,nmj→nmi`, `aij,jka→aik`, implicit `ij,ij`, `ij,jk` |
+
+> **Why not bit-exact for transcendentals?** numpy dispatches to Intel SVML (`__svml_exp8`, etc.) on AVX-512 hardware via its `_multiarray_umath.so`. `numpycpp` uses `std::` (system libm). These are two different implementations of the same mathematical functions — both IEEE 754 compliant, but differing by 1–4 ULP. On non-AVX-512 hardware, numpy also uses libm, so results are bit-exact.
 >
-> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). `-ffp-contract=off` ensures bit-exact results.
+> **Reductions**: All reductions use numpy's pairwise summation algorithm (recursive split, 8-accumulator unrolled). `-ffp-contract=off` ensures bit-exact results regardless of hardware.
 
 ## Project Structure
 
 
@@ -11,7 +11,7 @@ LDFLAGS   = -shared
 MODULE    = numpycpp.so
 SRC       = module.cpp
 
-.PHONY: all clean test
+.PHONY: all clean test csv
 
 all: $(MODULE)
 
@@ -21,5 +21,8 @@ $(MODULE): $(SRC)
 test: $(MODULE)
 	@cd .. && PYTHONPATH=tests:$$PYTHONPATH python3 -m pytest tests/test_all.py -q --tb=short --no-header
 
+csv: $(MODULE)
+	@cd .. && PYTHONPATH=tests:$$PYTHONPATH python3 tests/make_csv.py
+
 clean:
 	rm -f $(MODULE)
@@ -0,0 +1,161 @@
+#!/usr/bin/env python3
+"""Generate tests/ulp_precision.csv — ULP differences: numpycpp vs numpy.
+
+Usage:
+    make csv          # from tests/ directory
+    python3 tests/make_csv.py   # from repo root
+"""
+
+import os, sys, struct, csv
+import numpy as np
+import importlib
+
+# Ensure the tests directory is on sys.path so we can import the C++ module
+_here = os.path.dirname(os.path.abspath(__file__))
+if _here not in sys.path:
+    sys.path.insert(0, _here)
+cpp = importlib.import_module("numpycpp")
+
+
+def ulp_f64(a: float, b: float) -> int:
+    """Signed ULP distance between two float64 values."""
+    if a == b:
+        return 0
+    if np.isnan(a) or np.isnan(b):
+        return 2**63  # sentinel
+    pa = struct.unpack("q", struct.pack("d", float(a)))[0]
+    pb = struct.unpack("q", struct.pack("d", float(b)))[0]
+    if pa < 0: pa = (-pa) ^ 0x7FFFFFFFFFFFFFFF
+    if pb < 0: pb = (-pb) ^ 0x7FFFFFFFFFFFFFFF
+    return abs(pa - pb)
+
+
+def ulp_f32(a: float, b: float) -> int:
+    """Signed ULP distance between two float32 values."""
+    fa, fb = np.float32(a), np.float32(b)
+    if fa == fb:
+        return 0
+    if np.isnan(fa) or np.isnan(fb):
+        return 2**31  # sentinel
+    pa = struct.unpack("i", struct.pack("f", float(fa)))[0]
+    pb = struct.unpack("i", struct.pack("f", float(fb)))[0]
+    if pa < 0: pa = (-pa) ^ 0x7FFFFFFF
+    if pb < 0: pb = (-pb) ^ 0x7FFFFFFF
+    return abs(pa - pb)
+
+
+def measure_unary(cpp_fn, np_fn, prep, dt, ulf, rng, n=100_000):
+    a = rng.randn(n).astype(dt)
+    a = prep(a)
+    cr = np.asarray(getattr(cpp, cpp_fn)(a))
+    pr = np_fn(a)
+    max_u, n_diff = 0, 0
+    for i in range(cr.size):
+        if cr.flat[i] != pr.flat[i]:
+            u = ulf(cr.flat[i], pr.flat[i])
+            if u > max_u:
+                max_u = u
+            n_diff += 1
+    return max_u, n_diff
+
+
+def main():
+    rng = np.random.RandomState(42)
+    N = 100_000
+    ULP_F64 = f"{2**-52:.2e}"
+    ULP_F32 = f"{2**-23:.2e}"
+
+    header = [
+        "function", "dtype", "max_ulp", "n_diff", "total",
+        "category", "ulp_value_f64", "ulp_value_f32",
+    ]
+    rows = []
+
+    # --- Transcendental unary ---
+    TRANS = [
+        ("exp",    np.exp,    lambda a: a),
+        ("log",    np.log,    lambda a: np.abs(a) + 0.1),
+        ("sin",    np.sin,    lambda a: a),
+        ("cos",    np.cos,    lambda a: a),
+        ("tan",    np.tan,    lambda a: a * 0.5),
+        ("cbrt",   np.cbrt,   lambda a: a),
+        ("expm1",  np.expm1,  lambda a: a * 2.0),
+        ("log1p",  np.log1p,  lambda a: np.abs(a) + 0.1),
+        ("log10",  np.log10,  lambda a: np.abs(a) + 0.1),
+        ("log2",   np.log2,   lambda a: np.abs(a) + 0.1),
+        ("arcsin", np.arcsin, lambda a: np.clip(a * 0.5, -1, 1)),
+        ("arccos", np.arccos, lambda a: np.clip(a * 0.5, -1, 1)),
+        ("arctan", np.arctan, lambda a: a),
+    ]
+
+    for cfn, nfn, prep in TRANS:
+        for dt, name, ulf in [
+            (np.float64, "float64", ulp_f64),
+            (np.float32, "float32", ulp_f32),
+        ]:
+            mu, nd = measure_unary(cfn, nfn, prep, dt, ulf, rng, N)
+            rows.append([cfn, name, mu, nd, N, "transcendental", ULP_F64, ULP_F32])
+
+    # --- Element-wise (should be bit-exact) ---
+    ELEM = [
+        ("sqrt",    np.sqrt,    lambda a: np.abs(a)),
+        ("abs",     np.abs,     lambda a: a),
+        ("sign",    np.sign,    lambda a: a),
+        ("round",   np.round,   lambda a: a * 100),
+        ("floor",   np.floor,   lambda a: a * 100),
+        ("ceil",    np.ceil,    lambda a: a * 100),
+        ("degrees", np.degrees, lambda a: a),
+        ("radians", np.radians, lambda a: a),
+    ]
+
+    for cfn, nfn, prep in ELEM:
+        for dt, name, ulf in [
+            (np.float64, "float64", ulp_f64),
+            (np.float32, "float32", ulp_f32),
+        ]:
+            mu, nd = measure_unary(cfn, nfn, prep, dt, ulf, rng, N)
+            rows.append([cfn, name, mu, nd, N, "element-wise", ULP_F64, ULP_F32])
+
+    # --- Binary ---
+    BIN = [
+        ("power",   np.power,   "scalar exponent 2.0"),
+        ("arctan2", np.arctan2,  "scalar 1.0 denominator"),
+        ("hypot",   np.hypot,    "two arrays"),
+    ]
+
+    for cfn, nfn, _desc in BIN:
+        for dt, name, ulf in [
+            (np.float64, "float64", ulp_f64),
+            (np.float32, "float32", ulp_f32),
+        ]:
+            a = rng.randn(N).astype(dt)
+            if cfn == "hypot":
+                b = np.abs(rng.randn(N).astype(dt)) + dt(0.1)
+            elif cfn == "power":
+                b = dt(2.0)
+                a = np.abs(a) + dt(0.01)  # keep positive for fractional exponent
+            else:
+                b = dt(1.0)
+
+            cr = np.asarray(getattr(cpp, cfn)(a, b))
+            pr = nfn(a, b)
+            max_u, n_diff = 0, 0
+            for i in range(cr.size):
+                if cr.flat[i] != pr.flat[i]:
+                    u = ulf(cr.flat[i], pr.flat[i])
+                    if u > max_u:
+                        max_u = u
+                    n_diff += 1
+            rows.append([cfn, name, max_u, n_diff, N, "binary", ULP_F64, ULP_F32])
+
+    csv_path = os.path.join(_here, "ulp_precision.csv")
+    with open(csv_path, "w", newline="") as f:
+        w = csv.writer(f)
+        w.writerow(header)
+        w.writerows(rows)
+
+    print(f"Wrote {len(rows)} rows to {csv_path}")
+
+
+if __name__ == "__main__":
+    main()