Add optional core lib features to wheel build by ksivaman · Pull Request #3004 · NVIDIA/TransformerEngine

ksivaman · 2026-05-17T01:37:56Z

Description

Update wheel builds to include all features that can be enabled via a source build.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Enable NVTE_WITH_CUSOLVERMP, NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, and NVTE_UB_WITH_MPI in the core lib wheel.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: ksivamani <ksivamani@nvidia.com>

greptile-apps · 2026-05-17T01:40:52Z

Greptile Summary

This PR updates the wheel build pipeline to include optional core library features — NVTE_WITH_CUSOLVERMP, NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, and NVTE_UB_WITH_MPI — that were previously only available in source builds.

Dockerfiles (x86 + aarch64): OpenMPI is installed from the system package manager, symlinked under /opt/mpi, and exposed via PATH, LD_LIBRARY_PATH, and the new MPI_HOME env variable.
build_wheels.sh: The three NVIDIA Python packages are pip-installed, their HOME paths are derived from site-packages, unversioned .so stubs are created for linker compatibility, and the four feature flags are exported before the build begins.

Confidence Score: 3/5

The build script has a likely path bug in CUSOLVERMP_HOME that would silently produce a wheel missing cuSolverMP support.

The CUSOLVERMP_HOME is set to a path omitting the cusolvermp package-name segment, so the .so symlink loop silently skips its lib directory and the linker won't find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.

build_tools/wheel_utils/build_wheels.sh — specifically the CUSOLVERMP_HOME path on line 34.

Important Files Changed

Filename	Overview
build_tools/wheel_utils/build_wheels.sh	Adds pip install of nvidia-cublasmp, nvidia-cusolvermp, nvidia-nvshmem; derives HOME paths from site-packages; creates unversioned .so symlinks; exports NVTE_WITH_* feature flags — but CUSOLVERMP_HOME is missing the cusolvermp path segment, likely breaking cuSolverMP linkage.
build_tools/wheel_utils/Dockerfile.x86	Installs openmpi/openmpi-devel, creates /opt/mpi symlinks, updates PATH/LD_LIBRARY_PATH and sets MPI_HOME; missing ldconfig call after writing the ld.so.conf.d entry.
build_tools/wheel_utils/Dockerfile.aarch	Mirrors x86 Dockerfile changes for aarch64: OpenMPI install, /opt/mpi symlinks with aarch64-specific include path, and updated environment — same missing ldconfig call.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Docker image build] --> B[Install CUDA toolkit + cuDNN]
    B --> C[Install OpenMPI via dnf]
    C --> D[Symlink /opt/mpi + update ld.so.conf.d]
    D --> E[Set PATH / LD_LIBRARY_PATH / MPI_HOME]
    E --> F[Run build_wheels.sh]
    F --> G[pip install cmake / pybind11 / ninja / wheel]
    G --> H[pip install nvidia-cublasmp / cusolvermp / nvshmem]
    H --> I[Derive CUBLASMP_HOME / CUSOLVERMP_HOME / NVSHMEM_HOME from site-packages]
    I --> J[Create unversioned .so symlinks in each lib dir]
    J --> K[Export feature flags]
    K --> L{Build targets}
    L --> M[Metapackage wheel]
    L --> N[Common core wheel]
    L --> O[PyTorch sdist]
    L --> P[JAX sdist]

_{Reviews (1): Last reviewed commit: "Add optional core features to wheel buil..." | Re-trigger Greptile}

greptile-apps · 2026-05-17T01:40:56Z

+
+SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
+export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}"
+export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}"


Likely incorrect CUSOLVERMP_HOME path

The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.

greptile-apps · 2026-05-17T01:40:58Z

+RUN mkdir -p /opt/mpi && \
+    ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
+    ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
+    ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \
+    echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf


After writing to /etc/ld.so.conf.d/openmpi-x86_64.conf, ldconfig should be called in the same RUN layer to update the dynamic linker cache. Without it, tools that depend on the ldconfig cache (rather than LD_LIBRARY_PATH) will not find the OpenMPI libraries at build time inside the container.

Suggested change

RUN mkdir -p /opt/mpi && \

ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \

ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \

ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \

echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf

RUN mkdir -p /opt/mpi && \

ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \

ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \

ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \

echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf && \

ldconfig

greptile-apps · 2026-05-17T01:40:58Z

+RUN mkdir -p /opt/mpi && \
+    ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
+    ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
+    ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \
+    echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf


Same as Dockerfile.x86: ldconfig should be called after appending to the /etc/ld.so.conf.d/ file so the dynamic linker cache is updated within the same Docker layer.

Suggested change

RUN mkdir -p /opt/mpi && \

ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \

ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \

ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \

echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf

RUN mkdir -p /opt/mpi && \

ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \

ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \

ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \

echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf && \

ldconfig

Add optional core features to wheel build

29b1dc3

Signed-off-by: ksivamani <ksivamani@nvidia.com>

ksivaman requested review from cyanguwa, denera and mk-61 May 17, 2026 01:37

ksivaman marked this pull request as draft May 17, 2026 01:38

greptile-apps Bot reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional core lib features to wheel build#3004

Add optional core lib features to wheel build#3004
ksivaman wants to merge 1 commit into
NVIDIA:mainfrom
ksivaman:expand_wheel_builds

ksivaman commented May 17, 2026

Uh oh!

greptile-apps Bot commented May 17, 2026

Uh oh!

greptile-apps Bot May 17, 2026

Uh oh!

greptile-apps Bot May 17, 2026

Uh oh!

greptile-apps Bot May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ksivaman commented May 17, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 17, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant