generated from sfarrens/pyralid-template
-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Martin reported 2-5x slowdowns running benchmarks in the apptainer container compared to intelpython. We investigated and found the primary cause: Intel MKL's FFT implementation is 7-9x faster than the pocketfft backend used by numpy in pip-installed packages.
Benchmark Results
| Operation | Container (OpenBLAS) | intelpython (MKL) | MKL Speedup |
|---|---|---|---|
| FFT | |||
| fft 10M | 0.78s | 0.11s | 7.5x |
| rfft 10M | 0.40s | 0.05s | 9.0x |
| fft2 4096² | 1.33s | 0.16s | 8.5x |
| BLAS/LAPACK | |||
| matmul 2000² | 0.022s | 0.023s | 1.0x |
| SVD 1500² | 1.65s | 0.84s | 2.0x |
| QR 2000² | 0.83s | 0.43s | 1.9x |
| lstsq 2000×1000 | 0.58s | 0.26s | 2.2x |
| eigh 1500² | 0.43s | 0.26s | 1.6x |
| solve 1500² | 0.06s | 0.19s | 0.3x (OpenBLAS wins) |
| Other | |||
| scipy.fft 10M | 0.44s | 0.45s | 1.0x |
| std 10M | 0.07s | 0.03s | 2.1x |
| sort 1M | 0.03s | 0.05s | 0.7x (OpenBLAS wins) |
Summary
- FFT dominates: 7-9x difference via
numpy.fft. This is the likely source of Martin's observed slowdowns. - Decompositions (SVD, QR, eigh): MKL 1.5-2x faster
- Basic operations (matmul, solve, sort): roughly equivalent, some favor OpenBLAS
- scipy.fft: identical performance (both use pocketfft regardless of backend)
pyfftw is not a drop-in fix
We tested pyfftw as a potential solution. The drop-in interface (pyfftw.interfaces.numpy_fft) is actually slower than numpy.fft. Only explicit use of pyfftw's native API with FFTW_MEASURE planning achieves competitive speeds (0.17s vs MKL's 0.11s) — this requires code refactoring.
Options
None of these are great:
- Build an Intel MKL-based container — the official Intel Python image uses conda, which conflicts with our pip-based stack. Would require significant rework.
- Get sp_validation running on Candide outside the container — use intelpython directly for compute-heavy work. Loses the reproducibility and tooling benefits of the container.
- Take the performance hit — accept 2-5x slower FFT-heavy workloads in the container.
Metadata
Metadata
Assignees
Labels
No labels