NumPy is fast — but its ceiling is locked by Python.
We created numpycpp to keep NumPy's familiar usage patterns while letting C++ break through Python's performance ceiling and accelerate your code further.
numpycpp is a header-only C++ library implementing numpy's core API (numpy.*, numpy.linalg.*, numpy.einsum) with bit-level precision alignment. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.
All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (981 tests, float64 + float32, including NaN passthrough, signed-zero, ±∞, domain-error cases, and advanced indexing).
Bit-exact math is achieved by resolving numpy's own math functions from _multiarray_umath.so at runtime. The SVML bridge auto-detects your CPU and selects the same path numpy uses: AVX‑512 SVML (__svml_exp8) when available, or scalar npy_exp/npy_log/etc. otherwise. AVX‑512 intrinsics are isolated behind __attribute__((target)) — the binary is safe on any x86_64 CPU (no SIGILL). Every transcendental function produces the exact same IEEE 754 bits as numpy on all architectures.
- C++17 compiler (GCC >= 9, Clang >= 7, MSVC >= 2019)
Public headers — include the umbrella or individual modules:
#include <numpycpp/numpy.h> // ← single entry point (recommended)
// or include only what you need:
#include <numpycpp/init.h> // zeros_like, ones_like, full, arange, …
#include <numpycpp/elementwise.h> // sqrt, exp, sin, astype, …
#include <numpycpp/reduce.h> // sum, mean, std, var, cumsum, …
#include <numpycpp/manipulation.h> // transpose, take, slice, putmask, …
#include <numpycpp/io.h> // isin, interp, unwrap, …
#include <numpycpp/linalg.h> // dot, norm, matmul, einsum
numpycpp/detail/headers are internal — automatically pulled in by the public headers. Do not include them directly.pybind11 users — include
<numpycpp/numpy_py.h>instead to get the full set of pybind11 wrapper functions (numpy::sum(py::array_t<T>)etc.).
std::vector<double> data = {1.0, 4.0, 9.0};
std::vector<double> result(data.size());
numpy::sqrt(data.data(), result.data(), data.size());
// result → {1.0, 2.0, 3.0}
double s = numpy::sum(data.data(), data.size());
// s → 14.0Ubuntu (DEB)
Download the latest .deb release or build from source:
mkdir build && cd build
cmake ..
make deb
sudo dpkg -i numpycpp-dev-*.debHeaders are installed to /usr/include/numpycpp/ along with a CMake config that supports both backends.
CMake — bitexact backend (default)
find_package(numpycpp REQUIRED)
target_link_libraries(myapp PRIVATE numpycpp::numpycpp)
# cmake propagates -ldl automatically — no extra flags neededCMake — std backend
set(NUMPYCPP_STD_ONLY ON) # set BEFORE find_package
find_package(numpycpp REQUIRED)
target_link_libraries(myapp PRIVATE numpycpp::numpycpp)
# cmake propagates -DNUMPYCPP_STD_ONLY automatically — no extra flags neededpybind11_add_module users
With certain CMake / pybind11 version combinations, pybind11_add_module may lose IMPORTED targets
during generation. If you hit this, use the variables-based fallback:
set(NUMPYCPP_STD_ONLY OFF) # or ON for std backend
find_package(numpycpp REQUIRED)
pybind11_add_module(mymodule module.cpp)
target_include_directories(mymodule PRIVATE ${numpycpp_INCLUDE_DIRS})
target_compile_features(mymodule PRIVATE cxx_std_17)
# bitexact: add manually → target_link_libraries(mymodule PRIVATE dl)
# std: add manually → target_compile_definitions(mymodule PRIVATE NUMPYCPP_STD_ONLY)Manual (header-only)
Add -Ipath/to/numpycpp to your compiler flags and include the headers directly. No build step, no copy required.
- Bitexact backend: add
-ldlat link time (no other flags needed at-O2; see compiler flags table below) - Std backend: add
-DNUMPYCPP_STD_ONLY(no-ldlneeded)
The test suite verifies bit-level precision alignment between every C++ function and Python numpy.
No tolerance, no atol/rtol — raw IEEE 754 bits must match exactly.
981 tests: float64 + float32, including NaN passthrough, signed-zero, ±∞, domain errors, advanced indexing, and AVX-512 boundary sizes.
# build
cmake -S tests -B tests/build
cmake --build tests/build -j$(nproc)
# run (silent on pass — failures print hex diff)
cd tests && python3 -m pytest test_all.py -q --tb=short --no-headernumpycpp ships two interchangeable math backends selected via a single cmake flag.
All public APIs (numpy::exp, numpy::dot, numpy::einsum, …) are identical;
only the internal implementation and precision guarantee differ.
The library is header-only — both backends live in the same installed headers.
The backend is a consumer compile-time choice, not an install-time choice.
One DEB installs everything; NUMPYCPP_STD_ONLY selects the backend.
cmake -DNUMPYCPP_STD_ONLY=OFF .. # default — bit-exact backend
cmake -DNUMPYCPP_STD_ONLY=ON .. # std / performance-first backend| Property | NUMPYCPP_STD_ONLY=OFF (bitexact) |
NUMPYCPP_STD_ONLY=ON (std) |
|---|---|---|
| Transcendental math | dlsym → npy_exp / __svml_exp8 |
std::exp, std::log, … |
| Dot / matmul | OpenBLAS ILP64 via dlsym |
Pure C++ loops (auto-vectorised) |
| Precision vs numpy | IEEE 754 bit-identical | 0–2 ULP (not bit-exact) |
| External deps | libdl + numpy .so loaded |
None — pure C++17 |
| DEB package | same numpycpp-dev-<ver>-Linux.deb |
same numpycpp-dev-<ver>-Linux.deb |
| cmake propagation | target_link_libraries(… dl) |
target_compile_definitions(… NUMPYCPP_STD_ONLY) |
The minimum set was determined empirically: each flag was removed in isolation and the full 981-test suite was re-run. Only flags whose removal caused at least one test failure are marked required.
target_compile_options(<target> PRIVATE
-ffp-contract=off # REQUIRED — see below
-mavx512f -mfma # REQUIRED — see below
-mprefer-vector-width=256 # REQUIRED — see below
)
target_link_libraries(<target> PRIVATE dl) # REQUIRED — dlsym| Flag | Status | Why required | Consequence of removal |
|---|---|---|---|
-ffp-contract=off |
required | Prevents silent FMA fusion of a*b+c. einsum loops must match numpy's BLAS multiply-then-add order. |
36 einsum tests fail with ±1 ULP. |
-mavx512f -mfma |
required | SVML bridge declares exp_svml_f64 etc. inside #ifdef __AVX512F__. AVX-512 intrinsics are runtime-guarded — binary safe on non-AVX-512 CPUs. |
Hard compile error: 'exp_svml_f64' was not declared. |
-mprefer-vector-width=256 |
required | Prevents GCC from emitting ZMM instructions globally. Some cloud VMs expose avx512f in CPUID but trap ZMM via hypervisor XSAVE. The SVML bridge is safe (runtime guard), but unguarded auto-vectorized ZMM causes SIGILL. |
SIGILL at startup on some cloud VMs (GitHub Actions azure runners). |
-ldl |
required | dlsym/dlopen locate numpy's _multiarray_umath.so at runtime. |
Link error: undefined reference to 'dlsym'. |
-fno-builtin-exp … |
recommended | Prevents GCC from substituting npy_* call sites with builtins. numpycpp never calls exp() from <cmath> directly, so no current effect — kept as defensive guard. |
No test failure when removed today. |
target_compile_definitions(<target> PRIVATE NUMPYCPP_STD_ONLY)
target_compile_options(<target> PRIVATE
-O3
-march=native # auto-vectorise with all available SIMD
)
# No -ldl needed — no dlsym in std backend| Flag | Status | Why |
|---|---|---|
NUMPYCPP_STD_ONLY |
required | Selects std_math_backend.h + std_linalg_backend.h instead of SVML/BLAS bridges. Set via cmake -DNUMPYCPP_STD_ONLY=ON or set(NUMPYCPP_STD_ONLY ON) before find_package. |
-O3 -march=native |
recommended | Enables full auto-vectorisation of the C++ loops (exp/dot/gemm). Without optimisation, std backend is slow. |
-ffp-contract=off |
not needed | FMA contraction is welcome in std mode — improves precision and performance of gemm/dot. |
-mavx512f -mprefer-vector-width=256 |
not needed | SVML bridge not compiled in; no ZMM-trap risk. -march=native selects appropriate SIMD automatically. |
-ldl |
not needed | No dlopen/dlsym in std backend. |
Runtime CPU dispatch (bitexact only): The SVML bridge auto‑detects AVX‑512 at runtime (
__builtin_cpu_supports("avx512f")). On AVX‑512 hardware it calls numpy's SVML vector functions (__svml_exp8, etc.); otherwise it falls back to numpy's scalarnpy_exp/npy_log/etc. Both paths are resolved from the loaded_multiarray_umath.soviadlsym. AVX‑512 intrinsics are isolated behind__attribute__((target("avx512f")))— safe on any x86_64 CPU.
Two backends, same API — choose with cmake -DNUMPYCPP_STD_ONLY=ON/OFF.
| Legend | Meaning |
|---|---|
| ✅ | IEEE 754 bit-identical to numpy (float64 + float32) |
| 〜 | Correct result, 0–2 ULP from numpy (not bit-exact) |
| Category | Functions | bitexact (STD_ONLY=OFF) |
std (STD_ONLY=ON) |
|---|---|---|---|
| Creation | zeros_like ones_like full_like empty_like zeros ones full |
✅ | ✅ |
| Type conversion | astype (int/float/bool/int64) truncate_to_float32 |
✅ | ✅ |
| Comparison | greater less equal not_equal greater_equal less_equal |
✅ | ✅ |
| Logical | logical_and logical_or logical_not logical_xor |
✅ | ✅ |
| Special values | isnan isinf isfinite |
✅ | ✅ |
| Manipulation | diff stack vstack hstack concatenate transpose flatten squeeze roll flip repeat tile where |
✅ | ✅ |
| Advanced indexing | take compress slice (N-D + step) put putmask slice_assign |
✅ | ✅ |
| Sorting | argsort argmax argmin |
✅ | ✅ |
| Set / interp | isin intersect1d interp unwrap flatnonzero safe_divide |
✅ | ✅ |
| Reduction | sum mean max min any all std var cumsum mean (axis) |
✅ | ✅ |
| Math — pure C++ | sqrt abs sign clip round floor ceil degrees radians maximum minimum |
✅ | ✅ |
| Math — transcendental | exp log sin cos tan arcsin arccos arctan log10 log2 exp2 cbrt expm1 log1p |
✅ | 〜 0–1 ULP |
| Math — power / atan2 | power arctan2 |
✅ | 〜 0–1 ULP |
| Math — hypot | hypot |
✅ | ✅ |
| Dot product | numpy.dot (1-D) |
✅ | 〜 0–1 ULP |
| Norm | numpy.linalg.norm (scalar + axis) |
✅ | 〜 0–1 ULP |
| Matmul | numpy.matmul (2-D, 1-D×2-D, 2-D×1-D, batched 3-D) |
✅ | 〜 0–2 ULP |
| Einsum | ij,ij→i ij,jk→ik bij,bjk→bik and all 2-operand patterns |
✅ | 〜 0–2 ULP |
| Matrix inverse | numpy.linalg.inv (N×N) |
✅ | 〜 0–2 ULP |
bitexact backend: transcendentals resolved via
dlsymfrom numpy's_multiarray_umath.so— samenpy_exp/npy_logkernels numpy uses, with AVX‑512 SVML vector path (__svml_exp8etc.) when available. Dot/matmul/einsum use OpenBLAS ILP64 (cblas_sgemm64_) — the same BLAS numpy delegates to. Results are IEEE 754 bit-identical on all architectures.std backend: transcendentals use
std::exp/std::sin/… from<cmath>(glibc, typically 0–1 ULP). Dot/matmul/einsum use plain C++ loops (compiler auto-vectorises with-O3 -march=native). No external dependencies.Reductions (both backends): pairwise summation algorithm (recursive split, 8-accumulator unrolled) — matches
np.sumexactly. hypot (both backends):std::hypot— numpy delegates to the same libm call.
numpycpp/
├── numpycpp/ # header-only library (all public + internal headers)
│ ├── numpy.h # [PUBLIC] umbrella — includes all core modules below
│ ├── numpy_py.h # [PUBLIC] umbrella — includes all pybind11 wrappers below
│ ├── init.h # [PUBLIC] zeros_like, ones_like, full, arange, linspace, …
│ ├── init_py.h # [PUBLIC] pybind11 wrappers for init.h
│ ├── elementwise.h # [PUBLIC] sqrt/exp/sin/…, comparison, logical, astype
│ ├── elementwise_py.h # [PUBLIC] pybind11 wrappers for elementwise.h
│ ├── reduce.h # [PUBLIC] sum/mean/std/var/cumsum, axis reductions
│ ├── reduce_py.h # [PUBLIC] pybind11 wrappers for reduce.h
│ ├── manipulation.h # [PUBLIC] transpose/take/slice/put/putmask/argsort/…
│ ├── manipulation_py.h # [PUBLIC] pybind11 wrappers for manipulation.h
│ ├── io.h # [PUBLIC] isin, interp, unwrap, safe_divide, …
│ ├── io_py.h # [PUBLIC] pybind11 wrappers for io.h
│ ├── linalg.h # [PUBLIC] dot, norm, matmul, einsum
│ ├── linalg_py.h # [PUBLIC] pybind11 wrappers for linalg.h
│ └── detail/ # [INTERNAL] do not include directly
│ ├── macros.h # NUMPY_UNROLL4, NUMPY_SMALL_STACK
│ ├── svml_bridge.h # bitexact: SVML / npy_* scalar math (dlsym)
│ ├── std_math_backend.h # std: pure <cmath> std::exp/log/sin/… (no deps)
│ ├── blas_bridge.h # bitexact: OpenBLAS ILP64 cblas wrappers (dlsym)
│ ├── std_linalg_backend.h# std: pure C++ loop dot/gemm (no deps)
│ ├── avx512_loops.h # bitexact: AVX-512 vectorised exp/sin/cos loops
│ └── npy_math_float.h # bitexact: npy_* float32 wrappers
├── bench/ # performance benchmarks
│ ├── CMakeLists.txt
│ ├── bench_core.cpp # C++ benchmark driver
│ ├── bench.py # pybind11-based benchmark runner
│ └── bench_numpy.py # pure-numpy baseline
├── tests/ # bit-level precision tests + test module
│ ├── module.cpp # pybind11 module for testing
│ ├── test_all.py # single entry — all APIs, 981 tests, float64+float32
│ ├── conftest.py # silent-mode output suppression
│ ├── make_csv.py # ULP precision CSV generator
│ ├── diagnose_numpy.py # numpy internal diagnostic tool
│ ├── ulp_precision.csv # per-function ULP comparison data
│ └── CMakeLists.txt # test-module build
├── example/ # minimal usage examples
│ ├── CMakeLists.txt
│ └── main.cpp
├── cmake/
│ └── preinst # DEB pre-install script (clean old headers)
├── issue/ # issue tracking & root-cause analysis
│ └── 001-mean_pairwise_sum_vs_sequential.md
├── CMakeLists.txt # build & .deb packaging
└── README.md