Container performance vs intelpython: FFT is the bottleneck

Martin reported 2-5x slowdowns running benchmarks in the apptainer container compared to intelpython. We investigated and found the primary cause: Intel MKL's FFT implementation is 7-9x faster than the pocketfft backend used by numpy in pip-installed packages.

### Benchmark Results

| Operation | Container (OpenBLAS) | intelpython (MKL) | MKL Speedup |
|-----------|---------------------|-------------------|-------------|
| **FFT** |
| fft 10M | 0.78s | 0.11s | **7.5x** |
| rfft 10M | 0.40s | 0.05s | **9.0x** |
| fft2 4096² | 1.33s | 0.16s | **8.5x** |
| **BLAS/LAPACK** |
| matmul 2000² | 0.022s | 0.023s | 1.0x |
| SVD 1500² | 1.65s | 0.84s | **2.0x** |
| QR 2000² | 0.83s | 0.43s | **1.9x** |
| lstsq 2000×1000 | 0.58s | 0.26s | **2.2x** |
| eigh 1500² | 0.43s | 0.26s | **1.6x** |
| solve 1500² | 0.06s | 0.19s | 0.3x (OpenBLAS wins) |
| **Other** |
| scipy.fft 10M | 0.44s | 0.45s | 1.0x |
| std 10M | 0.07s | 0.03s | **2.1x** |
| sort 1M | 0.03s | 0.05s | 0.7x (OpenBLAS wins) |

### Summary

- **FFT dominates**: 7-9x difference via `numpy.fft`. This is the likely source of Martin's observed slowdowns.
- **Decompositions** (SVD, QR, eigh): MKL 1.5-2x faster
- **Basic operations** (matmul, solve, sort): roughly equivalent, some favor OpenBLAS
- **scipy.fft**: identical performance (both use pocketfft regardless of backend)

### pyfftw is not a drop-in fix

We tested pyfftw as a potential solution. The drop-in interface (`pyfftw.interfaces.numpy_fft`) is actually *slower* than numpy.fft. Only explicit use of pyfftw's native API with `FFTW_MEASURE` planning achieves competitive speeds (0.17s vs MKL's 0.11s) — this requires code refactoring.

### Options

None of these are great:

1. **Build an Intel MKL-based container** — the official Intel Python image uses conda, which conflicts with our pip-based stack. Would require significant rework.
2. **Get sp_validation running on Candide outside the container** — use intelpython directly for compute-heavy work. Loses the reproducibility and tooling benefits of the container.
3. **Take the performance hit** — accept 2-5x slower FFT-heavy workloads in the container.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container performance vs intelpython: FFT is the bottleneck #175

Benchmark Results

Summary

pyfftw is not a drop-in fix

Options

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	Container (OpenBLAS)	intelpython (MKL)	MKL Speedup
FFT
fft 10M	0.78s	0.11s	7.5x
rfft 10M	0.40s	0.05s	9.0x
fft2 4096²	1.33s	0.16s	8.5x
BLAS/LAPACK
matmul 2000²	0.022s	0.023s	1.0x
SVD 1500²	1.65s	0.84s	2.0x
QR 2000²	0.83s	0.43s	1.9x
lstsq 2000×1000	0.58s	0.26s	2.2x
eigh 1500²	0.43s	0.26s	1.6x
solve 1500²	0.06s	0.19s	0.3x (OpenBLAS wins)
Other
scipy.fft 10M	0.44s	0.45s	1.0x
std 10M	0.07s	0.03s	2.1x
sort 1M	0.03s	0.05s	0.7x (OpenBLAS wins)

Container performance vs intelpython: FFT is the bottleneck #175

Description

Benchmark Results

Summary

pyfftw is not a drop-in fix

Options

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions