numpycpp

Background

NumPy is fast — but its ceiling is locked by Python.

We created numpycpp to keep NumPy's familiar usage patterns while letting C++ break through Python's performance ceiling and accelerate your code further.

Overview

numpycpp is a header-only C++ library implementing numpy's core API (numpy.*, numpy.linalg.*, numpy.einsum) with bit-level precision alignment. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.

All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (981 tests, float64 + float32, including NaN passthrough, signed-zero, ±∞, domain-error cases, and advanced indexing).

Bit-exact math is achieved by resolving numpy's own math functions from _multiarray_umath.so at runtime. The SVML bridge auto-detects your CPU and selects the same path numpy uses: AVX‑512 SVML (__svml_exp8) when available, or scalar npy_exp/npy_log/etc. otherwise. AVX‑512 intrinsics are isolated behind __attribute__((target)) — the binary is safe on any x86_64 CPU (no SIGILL). Every transcendental function produces the exact same IEEE 754 bits as numpy on all architectures.

Quick Start

Dependencies

C++17 compiler (GCC >= 9, Clang >= 7, MSVC >= 2019)

Usage

Public headers — include the umbrella or individual modules:

#include <numpycpp/numpy.h>          // ← single entry point (recommended)

// or include only what you need:
#include <numpycpp/init.h>           // zeros_like, ones_like, full, arange, …
#include <numpycpp/elementwise.h>    // sqrt, exp, sin, astype, …
#include <numpycpp/reduce.h>         // sum, mean, std, var, cumsum, …
#include <numpycpp/manipulation.h>   // transpose, take, slice, putmask, …
#include <numpycpp/io.h>             // isin, interp, unwrap, …
#include <numpycpp/linalg.h>         // dot, norm, matmul, einsum

numpycpp/detail/ headers are internal — automatically pulled in by the public headers. Do not include them directly.

pybind11 users — include <numpycpp/numpy_py.h> instead to get the full set of pybind11 wrapper functions (numpy::sum(py::array_t<T>) etc.).

std::vector<double> data = {1.0, 4.0, 9.0};
std::vector<double> result(data.size());

numpy::sqrt(data.data(), result.data(), data.size());
// result → {1.0, 2.0, 3.0}

double s = numpy::sum(data.data(), data.size());
// s → 14.0

Install

Ubuntu (DEB)

Download the latest .deb release or build from source:

mkdir build && cd build
cmake ..
make deb
sudo dpkg -i numpycpp-dev-*.deb

Headers are installed to /usr/include/numpycpp/ along with a CMake config that supports both backends.

CMake — bitexact backend (default)

find_package(numpycpp REQUIRED)
target_link_libraries(myapp PRIVATE numpycpp::numpycpp)
# cmake propagates -ldl automatically — no extra flags needed

CMake — std backend

set(NUMPYCPP_STD_ONLY ON)           # set BEFORE find_package
find_package(numpycpp REQUIRED)
target_link_libraries(myapp PRIVATE numpycpp::numpycpp)
# cmake propagates -DNUMPYCPP_STD_ONLY automatically — no extra flags needed

pybind11_add_module users

With certain CMake / pybind11 version combinations, pybind11_add_module may lose IMPORTED targets during generation. If you hit this, use the variables-based fallback:

set(NUMPYCPP_STD_ONLY OFF)          # or ON for std backend
find_package(numpycpp REQUIRED)
pybind11_add_module(mymodule module.cpp)
target_include_directories(mymodule PRIVATE ${numpycpp_INCLUDE_DIRS})
target_compile_features(mymodule PRIVATE cxx_std_17)
# bitexact: add manually → target_link_libraries(mymodule PRIVATE dl)
# std:      add manually → target_compile_definitions(mymodule PRIVATE NUMPYCPP_STD_ONLY)

Manual (header-only)

Add -Ipath/to/numpycpp to your compiler flags and include the headers directly. No build step, no copy required.

Bitexact backend: add -ldl at link time (no other flags needed at -O2; see compiler flags table below)
Std backend: add -DNUMPYCPP_STD_ONLY (no -ldl needed)

Testing

The test suite verifies bit-level precision alignment between every C++ function and Python numpy. No tolerance, no atol/rtol — raw IEEE 754 bits must match exactly. 981 tests: float64 + float32, including NaN passthrough, signed-zero, ±∞, domain errors, advanced indexing, and AVX-512 boundary sizes.

# build
cmake -S tests -B tests/build
cmake --build tests/build -j$(nproc)

# run (silent on pass — failures print hex diff)
cd tests && python3 -m pytest test_all.py -q --tb=short --no-header

Two backends — choose at cmake time

numpycpp ships two interchangeable math backends selected via a single cmake flag. All public APIs (numpy::exp, numpy::dot, numpy::einsum, …) are identical; only the internal implementation and precision guarantee differ.

The library is header-only — both backends live in the same installed headers. The backend is a consumer compile-time choice, not an install-time choice. One DEB installs everything; NUMPYCPP_STD_ONLY selects the backend.

cmake -DNUMPYCPP_STD_ONLY=OFF ..   # default — bit-exact backend
cmake -DNUMPYCPP_STD_ONLY=ON  ..   # std / performance-first backend

Property	`NUMPYCPP_STD_ONLY=OFF` (bitexact)	`NUMPYCPP_STD_ONLY=ON` (std)
Transcendental math	`dlsym` → `npy_exp` / `__svml_exp8`	`std::exp`, `std::log`, …
Dot / matmul	OpenBLAS ILP64 via `dlsym`	Pure C++ loops (auto-vectorised)
Precision vs numpy	IEEE 754 bit-identical	0–2 ULP (not bit-exact)
External deps	libdl + numpy `.so` loaded	None — pure C++17
DEB package	same `numpycpp-dev-<ver>-Linux.deb`	same `numpycpp-dev-<ver>-Linux.deb`
cmake propagation	`target_link_libraries(… dl)`	`target_compile_definitions(… NUMPYCPP_STD_ONLY)`

Compiler flags — bitexact backend (`NUMPYCPP_STD_ONLY=OFF`)

The minimum set was determined empirically: each flag was removed in isolation and the full 981-test suite was re-run. Only flags whose removal caused at least one test failure are marked required.

target_compile_options(<target> PRIVATE
    -ffp-contract=off          # REQUIRED — see below
    -mavx512f -mfma            # REQUIRED — see below
    -mprefer-vector-width=256  # REQUIRED — see below
)
target_link_libraries(<target> PRIVATE dl)   # REQUIRED — dlsym

Flag	Status	Why required	Consequence of removal
`-ffp-contract=off`	required	Prevents silent FMA fusion of `a*b+c`. einsum loops must match numpy's BLAS multiply-then-add order.	36 einsum tests fail with ±1 ULP.
`-mavx512f -mfma`	required	SVML bridge declares `exp_svml_f64` etc. inside `#ifdef __AVX512F__`. AVX-512 intrinsics are runtime-guarded — binary safe on non-AVX-512 CPUs.	Hard compile error: `'exp_svml_f64' was not declared`.
`-mprefer-vector-width=256`	required	Prevents GCC from emitting ZMM instructions globally. Some cloud VMs expose `avx512f` in CPUID but trap ZMM via hypervisor XSAVE. The SVML bridge is safe (runtime guard), but unguarded auto-vectorized ZMM causes SIGILL.	SIGILL at startup on some cloud VMs (GitHub Actions azure runners).
`-ldl`	required	`dlsym`/`dlopen` locate numpy's `_multiarray_umath.so` at runtime.	Link error: `undefined reference to 'dlsym'`.
`-fno-builtin-exp` …	recommended	Prevents GCC from substituting npy_* call sites with builtins. numpycpp never calls `exp()` from `<cmath>` directly, so no current effect — kept as defensive guard.	No test failure when removed today.

Compiler flags — std backend (`NUMPYCPP_STD_ONLY=ON`)

target_compile_definitions(<target> PRIVATE NUMPYCPP_STD_ONLY)
target_compile_options(<target> PRIVATE
    -O3
    -march=native              # auto-vectorise with all available SIMD
)
# No -ldl needed — no dlsym in std backend

Flag	Status	Why
`NUMPYCPP_STD_ONLY`	required	Selects `std_math_backend.h` + `std_linalg_backend.h` instead of SVML/BLAS bridges. Set via `cmake -DNUMPYCPP_STD_ONLY=ON` or `set(NUMPYCPP_STD_ONLY ON)` before `find_package`.
`-O3 -march=native`	recommended	Enables full auto-vectorisation of the C++ loops (exp/dot/gemm). Without optimisation, std backend is slow.
`-ffp-contract=off`	not needed	FMA contraction is welcome in std mode — improves precision and performance of gemm/dot.
`-mavx512f -mprefer-vector-width=256`	not needed	SVML bridge not compiled in; no ZMM-trap risk. `-march=native` selects appropriate SIMD automatically.
`-ldl`	not needed	No dlopen/dlsym in std backend.

Runtime CPU dispatch (bitexact only): The SVML bridge auto‑detects AVX‑512 at runtime (__builtin_cpu_supports("avx512f")). On AVX‑512 hardware it calls numpy's SVML vector functions (__svml_exp8, etc.); otherwise it falls back to numpy's scalar npy_exp/npy_log/etc. Both paths are resolved from the loaded _multiarray_umath.so via dlsym. AVX‑512 intrinsics are isolated behind __attribute__((target("avx512f"))) — safe on any x86_64 CPU.

Alignment status

Two backends, same API — choose with cmake -DNUMPYCPP_STD_ONLY=ON/OFF.

Legend	Meaning
✅	IEEE 754 bit-identical to numpy (float64 + float32)
〜	Correct result, 0–2 ULP from numpy (not bit-exact)

Category	Functions	`bitexact` (`STD_ONLY=OFF`)	`std` (`STD_ONLY=ON`)
Creation	`zeros_like` `ones_like` `full_like` `empty_like` `zeros` `ones` `full`	✅	✅
Type conversion	`astype` (int/float/bool/int64) `truncate_to_float32`	✅	✅
Comparison	`greater` `less` `equal` `not_equal` `greater_equal` `less_equal`	✅	✅
Logical	`logical_and` `logical_or` `logical_not` `logical_xor`	✅	✅
Special values	`isnan` `isinf` `isfinite`	✅	✅
Manipulation	`diff` `stack` `vstack` `hstack` `concatenate` `transpose` `flatten` `squeeze` `roll` `flip` `repeat` `tile` `where`	✅	✅
Advanced indexing	`take` `compress` `slice` (N-D + step) `put` `putmask` `slice_assign`	✅	✅
Sorting	`argsort` `argmax` `argmin`	✅	✅
Set / interp	`isin` `intersect1d` `interp` `unwrap` `flatnonzero` `safe_divide`	✅	✅
Reduction	`sum` `mean` `max` `min` `any` `all` `std` `var` `cumsum` `mean` (axis)	✅	✅
Math — pure C++	`sqrt` `abs` `sign` `clip` `round` `floor` `ceil` `degrees` `radians` `maximum` `minimum`	✅	✅
Math — transcendental	`exp` `log` `sin` `cos` `tan` `arcsin` `arccos` `arctan` `log10` `log2` `exp2` `cbrt` `expm1` `log1p`	✅	〜 0–1 ULP
Math — power / atan2	`power` `arctan2`	✅	〜 0–1 ULP
Math — hypot	`hypot`	✅	✅
Dot product	`numpy.dot` (1-D)	✅	〜 0–1 ULP
Norm	`numpy.linalg.norm` (scalar + axis)	✅	〜 0–1 ULP
Matmul	`numpy.matmul` (2-D, 1-D×2-D, 2-D×1-D, batched 3-D)	✅	〜 0–2 ULP
Einsum	`ij,ij→i` `ij,jk→ik` `bij,bjk→bik` and all 2-operand patterns	✅	〜 0–2 ULP
Matrix inverse	`numpy.linalg.inv` (N×N)	✅	〜 0–2 ULP

bitexact backend: transcendentals resolved via dlsym from numpy's _multiarray_umath.so — same npy_exp/npy_log kernels numpy uses, with AVX‑512 SVML vector path (__svml_exp8 etc.) when available. Dot/matmul/einsum use OpenBLAS ILP64 (cblas_sgemm64_) — the same BLAS numpy delegates to. Results are IEEE 754 bit-identical on all architectures.

std backend: transcendentals use std::exp/std::sin/… from <cmath> (glibc, typically 0–1 ULP). Dot/matmul/einsum use plain C++ loops (compiler auto-vectorises with -O3 -march=native). No external dependencies.

Reductions (both backends): pairwise summation algorithm (recursive split, 8-accumulator unrolled) — matches np.sum exactly. hypot (both backends): std::hypot — numpy delegates to the same libm call.

Project Structure

numpycpp/
├── numpycpp/                   # header-only library (all public + internal headers)
│   ├── numpy.h                 # [PUBLIC]   umbrella — includes all core modules below
│   ├── numpy_py.h              # [PUBLIC]   umbrella — includes all pybind11 wrappers below
│   ├── init.h                  # [PUBLIC]   zeros_like, ones_like, full, arange, linspace, …
│   ├── init_py.h               # [PUBLIC]   pybind11 wrappers for init.h
│   ├── elementwise.h           # [PUBLIC]   sqrt/exp/sin/…, comparison, logical, astype
│   ├── elementwise_py.h        # [PUBLIC]   pybind11 wrappers for elementwise.h
│   ├── reduce.h                # [PUBLIC]   sum/mean/std/var/cumsum, axis reductions
│   ├── reduce_py.h             # [PUBLIC]   pybind11 wrappers for reduce.h
│   ├── manipulation.h          # [PUBLIC]   transpose/take/slice/put/putmask/argsort/…
│   ├── manipulation_py.h       # [PUBLIC]   pybind11 wrappers for manipulation.h
│   ├── io.h                    # [PUBLIC]   isin, interp, unwrap, safe_divide, …
│   ├── io_py.h                 # [PUBLIC]   pybind11 wrappers for io.h
│   ├── linalg.h                # [PUBLIC]   dot, norm, matmul, einsum
│   ├── linalg_py.h             # [PUBLIC]   pybind11 wrappers for linalg.h
│   └── detail/                 # [INTERNAL] do not include directly
│       ├── macros.h            #   NUMPY_UNROLL4, NUMPY_SMALL_STACK
│       ├── svml_bridge.h       #   bitexact: SVML / npy_* scalar math (dlsym)
│       ├── std_math_backend.h  #   std: pure <cmath> std::exp/log/sin/… (no deps)
│       ├── blas_bridge.h       #   bitexact: OpenBLAS ILP64 cblas wrappers (dlsym)
│       ├── std_linalg_backend.h#   std: pure C++ loop dot/gemm (no deps)
│       ├── avx512_loops.h      #   bitexact: AVX-512 vectorised exp/sin/cos loops
│       └── npy_math_float.h    #   bitexact: npy_* float32 wrappers
├── bench/                      # performance benchmarks
│   ├── CMakeLists.txt
│   ├── bench_core.cpp          # C++ benchmark driver
│   ├── bench.py                # pybind11-based benchmark runner
│   └── bench_numpy.py          # pure-numpy baseline
├── tests/                      # bit-level precision tests + test module
│   ├── module.cpp              # pybind11 module for testing
│   ├── test_all.py             # single entry — all APIs, 981 tests, float64+float32
│   ├── conftest.py             # silent-mode output suppression
│   ├── make_csv.py             # ULP precision CSV generator
│   ├── diagnose_numpy.py       # numpy internal diagnostic tool
│   ├── ulp_precision.csv       # per-function ULP comparison data
│   └── CMakeLists.txt          # test-module build
├── example/                    # minimal usage examples
│   ├── CMakeLists.txt
│   └── main.cpp
├── cmake/
│   └── preinst                 # DEB pre-install script (clean old headers)
├── issue/                      # issue tracking & root-cause analysis
│   └── 001-mean_pairwise_sum_vs_sequential.md
├── CMakeLists.txt              # build & .deb packaging
└── README.md

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpycpp

Background

Overview

Quick Start

Dependencies

Usage

Install

Testing

Two backends — choose at cmake time

Compiler flags — bitexact backend (`NUMPYCPP_STD_ONLY=OFF`)

Compiler flags — std backend (`NUMPYCPP_STD_ONLY=ON`)

Alignment status

Project Structure

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

numpycpp

Background

Overview

Quick Start

Dependencies

Usage

Install

Testing

Two backends — choose at cmake time

Compiler flags — bitexact backend (NUMPYCPP_STD_ONLY=OFF)

Compiler flags — std backend (NUMPYCPP_STD_ONLY=ON)

Alignment status

Project Structure

License

Compiler flags — bitexact backend (`NUMPYCPP_STD_ONLY=OFF`)

Compiler flags — std backend (`NUMPYCPP_STD_ONLY=ON`)