Skip to content

[BUG] matmul produces incorrect results for F-contiguous / non-row-major inputs #89

@SwayamInSync

Description

@SwayamInSync

np.matmul on QuadPrecision arrays returns incorrect results whenever either input is F-contiguous (or otherwise not row-major-contiguous in its last two axes).

Reproducer

click to expand
import numpy as np
from numpy_quaddtype import QuadPrecision, QuadPrecDType

dt = QuadPrecDType(backend='sleef')
def Q(v): return QuadPrecision(str(float(v)), backend='sleef')

A = np.asfortranarray(np.array([[Q(1), Q(2), Q(3)],
                                [Q(4), Q(5), Q(6)]], dtype=dt))
B = np.asfortranarray(np.array([[Q(1), Q(2)],
                                [Q(3), Q(4)],
                                [Q(5), Q(6)]], dtype=dt))

# A is F-contig: strides=(16, 32); B is F-contig: strides=(16, 48)
print("quad F@F:", [[float(v) for v in r] for r in (A @ B)])
# -> [[23.0, 27.0], [35.0, 32.0]]   <- WRONG

# Same call against native dtypes works fine — proving the bug is on our side:
A32 = np.asfortranarray(np.array([[1., 2., 3.], [4., 5., 6.]], dtype=np.float32))
B32 = np.asfortranarray(np.array([[1., 2.], [3., 4.], [5., 6.]], dtype=np.float32))
print("float32 F@F:", (A32 @ B32).tolist())
# -> [[22.0, 28.0], [49.0, 64.0]]   <- correct

# Workaround: force C-contig before matmul
print("quad (C@C):",
      [[float(v) for v in r] for r in (np.ascontiguousarray(A) @ np.ascontiguousarray(B))])
# -> [[22.0, 28.0], [49.0, 64.0]]   <- correct

Root cause

quad_matmul_strided_loop_aligned in src/csrc/umath/matmul.cpp unconditionally:

size_t lda = A_row_stride / sizeof(Sleef_quad);
// ...
qblas_gemm('R', 'N', 'N', m, p, n, &alpha, A_ptr, lda, B_ptr, ldb,
           &beta, C_ptr, ldc_numpy);

This silently assumes:

  • the input is row-major contiguous in its core dimensions and the right layout argument to QBLAS is always 'R'.

Possible fixes

  • Inspect each operand's last-two-axis strides: dispatch directly to qblas_gemm('R', ...) or ('C', ...) when both operands share a contiguous layout, otherwise memcpy the offending operand(s) into a row-major temp buffer (same trick the unaligned path already uses) and call qblas_gemm('R', ...).
  • Performance (needs QBLAS-side change too). For mixed-layout inputs (R @ C and C @ R) the copy above can be skipped by passing transa/transb='T' instead, which would require qblas_gemm to gain transpose support. See [FEAT] Support transpose flags transa, transb != 'N' in GEMM SwayamInSync/QBLAS#13

Related tests

On the #88 two tests already exercise this bug and are marked xfail:

  • tests/test_dot.py::TestMatmulBatched::test_batched_fortran_order
  • tests/test_dot.py::TestMatmulBatched::test_batched_transposed_view

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions