Skip to content

Latest commit

 

History

History
302 lines (243 loc) · 16.2 KB

File metadata and controls

302 lines (243 loc) · 16.2 KB

numpycpp

CI License: MIT C++17 CMake Tests PRs Welcome

Background

NumPy is fast — but its ceiling is locked by Python.

We created numpycpp to keep NumPy's familiar usage patterns while letting C++ break through Python's performance ceiling and accelerate your code further.

Overview

numpycpp is a header-only C++ library implementing numpy's core API (numpy.*, numpy.linalg.*, numpy.einsum) with bit-level precision alignment. Raw pointer + size interface. Zero external dependencies — pure C++17 standard library.

All APIs are tested against Python numpy under strict bit-level comparison: every IEEE 754 float bit must match exactly (981 tests, float64 + float32, including NaN passthrough, signed-zero, ±∞, domain-error cases, and advanced indexing).

Bit-exact math is achieved by resolving numpy's own math functions from _multiarray_umath.so at runtime. The SVML bridge auto-detects your CPU and selects the same path numpy uses: AVX‑512 SVML (__svml_exp8) when available, or scalar npy_exp/npy_log/etc. otherwise. AVX‑512 intrinsics are isolated behind __attribute__((target)) — the binary is safe on any x86_64 CPU (no SIGILL). Every transcendental function produces the exact same IEEE 754 bits as numpy on all architectures.

Quick Start

Dependencies

  • C++17 compiler (GCC >= 9, Clang >= 7, MSVC >= 2019)

Usage

Public headers — include the umbrella or individual modules:

#include <numpycpp/numpy.h>          // ← single entry point (recommended)

// or include only what you need:
#include <numpycpp/init.h>           // zeros_like, ones_like, full, arange, …
#include <numpycpp/elementwise.h>    // sqrt, exp, sin, astype, …
#include <numpycpp/reduce.h>         // sum, mean, std, var, cumsum, …
#include <numpycpp/manipulation.h>   // transpose, take, slice, putmask, …
#include <numpycpp/io.h>             // isin, interp, unwrap, …
#include <numpycpp/linalg.h>         // dot, norm, matmul, einsum

numpycpp/detail/ headers are internal — automatically pulled in by the public headers. Do not include them directly.

pybind11 users — include <numpycpp/numpy_py.h> instead to get the full set of pybind11 wrapper functions (numpy::sum(py::array_t<T>) etc.).

std::vector<double> data = {1.0, 4.0, 9.0};
std::vector<double> result(data.size());

numpy::sqrt(data.data(), result.data(), data.size());
// result → {1.0, 2.0, 3.0}

double s = numpy::sum(data.data(), data.size());
// s → 14.0

Install

Ubuntu (DEB)

Download the latest .deb release or build from source:

mkdir build && cd build
cmake ..
make deb
sudo dpkg -i numpycpp-dev-*.deb

Headers are installed to /usr/include/numpycpp/ along with a CMake config that supports both backends.

CMake — bitexact backend (default)

find_package(numpycpp REQUIRED)
target_link_libraries(myapp PRIVATE numpycpp::numpycpp)
# cmake propagates -ldl automatically — no extra flags needed

CMake — std backend

set(NUMPYCPP_STD_ONLY ON)           # set BEFORE find_package
find_package(numpycpp REQUIRED)
target_link_libraries(myapp PRIVATE numpycpp::numpycpp)
# cmake propagates -DNUMPYCPP_STD_ONLY automatically — no extra flags needed

pybind11_add_module users

With certain CMake / pybind11 version combinations, pybind11_add_module may lose IMPORTED targets during generation. If you hit this, use the variables-based fallback:

set(NUMPYCPP_STD_ONLY OFF)          # or ON for std backend
find_package(numpycpp REQUIRED)
pybind11_add_module(mymodule module.cpp)
target_include_directories(mymodule PRIVATE ${numpycpp_INCLUDE_DIRS})
target_compile_features(mymodule PRIVATE cxx_std_17)
# bitexact: add manually → target_link_libraries(mymodule PRIVATE dl)
# std:      add manually → target_compile_definitions(mymodule PRIVATE NUMPYCPP_STD_ONLY)

Manual (header-only)

Add -Ipath/to/numpycpp to your compiler flags and include the headers directly. No build step, no copy required.

  • Bitexact backend: add -ldl at link time (no other flags needed at -O2; see compiler flags table below)
  • Std backend: add -DNUMPYCPP_STD_ONLY (no -ldl needed)

Testing

The test suite verifies bit-level precision alignment between every C++ function and Python numpy. No tolerance, no atol/rtol — raw IEEE 754 bits must match exactly. 981 tests: float64 + float32, including NaN passthrough, signed-zero, ±∞, domain errors, advanced indexing, and AVX-512 boundary sizes.

# build
cmake -S tests -B tests/build
cmake --build tests/build -j$(nproc)

# run (silent on pass — failures print hex diff)
cd tests && python3 -m pytest test_all.py -q --tb=short --no-header

Two backends — choose at cmake time

numpycpp ships two interchangeable math backends selected via a single cmake flag. All public APIs (numpy::exp, numpy::dot, numpy::einsum, …) are identical; only the internal implementation and precision guarantee differ.

The library is header-only — both backends live in the same installed headers. The backend is a consumer compile-time choice, not an install-time choice. One DEB installs everything; NUMPYCPP_STD_ONLY selects the backend.

cmake -DNUMPYCPP_STD_ONLY=OFF ..   # default — bit-exact backend
cmake -DNUMPYCPP_STD_ONLY=ON  ..   # std / performance-first backend
Property NUMPYCPP_STD_ONLY=OFF (bitexact) NUMPYCPP_STD_ONLY=ON (std)
Transcendental math dlsymnpy_exp / __svml_exp8 std::exp, std::log, …
Dot / matmul OpenBLAS ILP64 via dlsym Pure C++ loops (auto-vectorised)
Precision vs numpy IEEE 754 bit-identical 0–2 ULP (not bit-exact)
External deps libdl + numpy .so loaded None — pure C++17
DEB package same numpycpp-dev-<ver>-Linux.deb same numpycpp-dev-<ver>-Linux.deb
cmake propagation target_link_libraries(… dl) target_compile_definitions(… NUMPYCPP_STD_ONLY)

Compiler flags — bitexact backend (NUMPYCPP_STD_ONLY=OFF)

The minimum set was determined empirically: each flag was removed in isolation and the full 981-test suite was re-run. Only flags whose removal caused at least one test failure are marked required.

target_compile_options(<target> PRIVATE
    -ffp-contract=off          # REQUIRED — see below
    -mavx512f -mfma            # REQUIRED — see below
    -mprefer-vector-width=256  # REQUIRED — see below
)
target_link_libraries(<target> PRIVATE dl)   # REQUIRED — dlsym
Flag Status Why required Consequence of removal
-ffp-contract=off required Prevents silent FMA fusion of a*b+c. einsum loops must match numpy's BLAS multiply-then-add order. 36 einsum tests fail with ±1 ULP.
-mavx512f -mfma required SVML bridge declares exp_svml_f64 etc. inside #ifdef __AVX512F__. AVX-512 intrinsics are runtime-guarded — binary safe on non-AVX-512 CPUs. Hard compile error: 'exp_svml_f64' was not declared.
-mprefer-vector-width=256 required Prevents GCC from emitting ZMM instructions globally. Some cloud VMs expose avx512f in CPUID but trap ZMM via hypervisor XSAVE. The SVML bridge is safe (runtime guard), but unguarded auto-vectorized ZMM causes SIGILL. SIGILL at startup on some cloud VMs (GitHub Actions azure runners).
-ldl required dlsym/dlopen locate numpy's _multiarray_umath.so at runtime. Link error: undefined reference to 'dlsym'.
-fno-builtin-exp recommended Prevents GCC from substituting npy_* call sites with builtins. numpycpp never calls exp() from <cmath> directly, so no current effect — kept as defensive guard. No test failure when removed today.

Compiler flags — std backend (NUMPYCPP_STD_ONLY=ON)

target_compile_definitions(<target> PRIVATE NUMPYCPP_STD_ONLY)
target_compile_options(<target> PRIVATE
    -O3
    -march=native              # auto-vectorise with all available SIMD
)
# No -ldl needed — no dlsym in std backend
Flag Status Why
NUMPYCPP_STD_ONLY required Selects std_math_backend.h + std_linalg_backend.h instead of SVML/BLAS bridges. Set via cmake -DNUMPYCPP_STD_ONLY=ON or set(NUMPYCPP_STD_ONLY ON) before find_package.
-O3 -march=native recommended Enables full auto-vectorisation of the C++ loops (exp/dot/gemm). Without optimisation, std backend is slow.
-ffp-contract=off not needed FMA contraction is welcome in std mode — improves precision and performance of gemm/dot.
-mavx512f -mprefer-vector-width=256 not needed SVML bridge not compiled in; no ZMM-trap risk. -march=native selects appropriate SIMD automatically.
-ldl not needed No dlopen/dlsym in std backend.

Runtime CPU dispatch (bitexact only): The SVML bridge auto‑detects AVX‑512 at runtime (__builtin_cpu_supports("avx512f")). On AVX‑512 hardware it calls numpy's SVML vector functions (__svml_exp8, etc.); otherwise it falls back to numpy's scalar npy_exp/npy_log/etc. Both paths are resolved from the loaded _multiarray_umath.so via dlsym. AVX‑512 intrinsics are isolated behind __attribute__((target("avx512f"))) — safe on any x86_64 CPU.

Alignment status

Two backends, same API — choose with cmake -DNUMPYCPP_STD_ONLY=ON/OFF.

Legend Meaning
IEEE 754 bit-identical to numpy (float64 + float32)
Correct result, 0–2 ULP from numpy (not bit-exact)
Category Functions bitexact (STD_ONLY=OFF) std (STD_ONLY=ON)
Creation zeros_like ones_like full_like empty_like zeros ones full
Type conversion astype (int/float/bool/int64) truncate_to_float32
Comparison greater less equal not_equal greater_equal less_equal
Logical logical_and logical_or logical_not logical_xor
Special values isnan isinf isfinite
Manipulation diff stack vstack hstack concatenate transpose flatten squeeze roll flip repeat tile where
Advanced indexing take compress slice (N-D + step) put putmask slice_assign
Sorting argsort argmax argmin
Set / interp isin intersect1d interp unwrap flatnonzero safe_divide
Reduction sum mean max min any all std var cumsum mean (axis)
Math — pure C++ sqrt abs sign clip round floor ceil degrees radians maximum minimum
Math — transcendental exp log sin cos tan arcsin arccos arctan log10 log2 exp2 cbrt expm1 log1p 〜 0–1 ULP
Math — power / atan2 power arctan2 〜 0–1 ULP
Math — hypot hypot
Dot product numpy.dot (1-D) 〜 0–1 ULP
Norm numpy.linalg.norm (scalar + axis) 〜 0–1 ULP
Matmul numpy.matmul (2-D, 1-D×2-D, 2-D×1-D, batched 3-D) 〜 0–2 ULP
Einsum ij,ij→i ij,jk→ik bij,bjk→bik and all 2-operand patterns 〜 0–2 ULP
Matrix inverse numpy.linalg.inv (N×N) 〜 0–2 ULP

bitexact backend: transcendentals resolved via dlsym from numpy's _multiarray_umath.so — same npy_exp/npy_log kernels numpy uses, with AVX‑512 SVML vector path (__svml_exp8 etc.) when available. Dot/matmul/einsum use OpenBLAS ILP64 (cblas_sgemm64_) — the same BLAS numpy delegates to. Results are IEEE 754 bit-identical on all architectures.

std backend: transcendentals use std::exp/std::sin/… from <cmath> (glibc, typically 0–1 ULP). Dot/matmul/einsum use plain C++ loops (compiler auto-vectorises with -O3 -march=native). No external dependencies.

Reductions (both backends): pairwise summation algorithm (recursive split, 8-accumulator unrolled) — matches np.sum exactly. hypot (both backends): std::hypot — numpy delegates to the same libm call.

Project Structure

numpycpp/
├── numpycpp/                   # header-only library (all public + internal headers)
│   ├── numpy.h                 # [PUBLIC]   umbrella — includes all core modules below
│   ├── numpy_py.h              # [PUBLIC]   umbrella — includes all pybind11 wrappers below
│   ├── init.h                  # [PUBLIC]   zeros_like, ones_like, full, arange, linspace, …
│   ├── init_py.h               # [PUBLIC]   pybind11 wrappers for init.h
│   ├── elementwise.h           # [PUBLIC]   sqrt/exp/sin/…, comparison, logical, astype
│   ├── elementwise_py.h        # [PUBLIC]   pybind11 wrappers for elementwise.h
│   ├── reduce.h                # [PUBLIC]   sum/mean/std/var/cumsum, axis reductions
│   ├── reduce_py.h             # [PUBLIC]   pybind11 wrappers for reduce.h
│   ├── manipulation.h          # [PUBLIC]   transpose/take/slice/put/putmask/argsort/…
│   ├── manipulation_py.h       # [PUBLIC]   pybind11 wrappers for manipulation.h
│   ├── io.h                    # [PUBLIC]   isin, interp, unwrap, safe_divide, …
│   ├── io_py.h                 # [PUBLIC]   pybind11 wrappers for io.h
│   ├── linalg.h                # [PUBLIC]   dot, norm, matmul, einsum
│   ├── linalg_py.h             # [PUBLIC]   pybind11 wrappers for linalg.h
│   └── detail/                 # [INTERNAL] do not include directly
│       ├── macros.h            #   NUMPY_UNROLL4, NUMPY_SMALL_STACK
│       ├── svml_bridge.h       #   bitexact: SVML / npy_* scalar math (dlsym)
│       ├── std_math_backend.h  #   std: pure <cmath> std::exp/log/sin/… (no deps)
│       ├── blas_bridge.h       #   bitexact: OpenBLAS ILP64 cblas wrappers (dlsym)
│       ├── std_linalg_backend.h#   std: pure C++ loop dot/gemm (no deps)
│       ├── avx512_loops.h      #   bitexact: AVX-512 vectorised exp/sin/cos loops
│       └── npy_math_float.h    #   bitexact: npy_* float32 wrappers
├── bench/                      # performance benchmarks
│   ├── CMakeLists.txt
│   ├── bench_core.cpp          # C++ benchmark driver
│   ├── bench.py                # pybind11-based benchmark runner
│   └── bench_numpy.py          # pure-numpy baseline
├── tests/                      # bit-level precision tests + test module
│   ├── module.cpp              # pybind11 module for testing
│   ├── test_all.py             # single entry — all APIs, 981 tests, float64+float32
│   ├── conftest.py             # silent-mode output suppression
│   ├── make_csv.py             # ULP precision CSV generator
│   ├── diagnose_numpy.py       # numpy internal diagnostic tool
│   ├── ulp_precision.csv       # per-function ULP comparison data
│   └── CMakeLists.txt          # test-module build
├── example/                    # minimal usage examples
│   ├── CMakeLists.txt
│   └── main.cpp
├── cmake/
│   └── preinst                 # DEB pre-install script (clean old headers)
├── issue/                      # issue tracking & root-cause analysis
│   └── 001-mean_pairwise_sum_vs_sequential.md
├── CMakeLists.txt              # build & .deb packaging
└── README.md

License

MIT