Skip to content

Add optional core lib features to wheel build#3004

Draft
ksivaman wants to merge 1 commit into
NVIDIA:mainfrom
ksivaman:expand_wheel_builds
Draft

Add optional core lib features to wheel build#3004
ksivaman wants to merge 1 commit into
NVIDIA:mainfrom
ksivaman:expand_wheel_builds

Conversation

@ksivaman
Copy link
Copy Markdown
Member

Description

Update wheel builds to include all features that can be enabled via a source build.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Enable NVTE_WITH_CUSOLVERMP, NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, and NVTE_UB_WITH_MPI in the core lib wheel.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: ksivamani <ksivamani@nvidia.com>
@ksivaman ksivaman requested review from cyanguwa, denera and mk-61 May 17, 2026 01:37
@ksivaman ksivaman marked this pull request as draft May 17, 2026 01:38
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 17, 2026

Greptile Summary

This PR updates the wheel build pipeline to include optional core library features — NVTE_WITH_CUSOLVERMP, NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, and NVTE_UB_WITH_MPI — that were previously only available in source builds.

  • Dockerfiles (x86 + aarch64): OpenMPI is installed from the system package manager, symlinked under /opt/mpi, and exposed via PATH, LD_LIBRARY_PATH, and the new MPI_HOME env variable.
  • build_wheels.sh: The three NVIDIA Python packages are pip-installed, their HOME paths are derived from site-packages, unversioned .so stubs are created for linker compatibility, and the four feature flags are exported before the build begins.

Confidence Score: 3/5

The build script has a likely path bug in CUSOLVERMP_HOME that would silently produce a wheel missing cuSolverMP support.

The CUSOLVERMP_HOME is set to a path omitting the cusolvermp package-name segment, so the .so symlink loop silently skips its lib directory and the linker won't find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.

build_tools/wheel_utils/build_wheels.sh — specifically the CUSOLVERMP_HOME path on line 34.

Important Files Changed

Filename Overview
build_tools/wheel_utils/build_wheels.sh Adds pip install of nvidia-cublasmp, nvidia-cusolvermp, nvidia-nvshmem; derives HOME paths from site-packages; creates unversioned .so symlinks; exports NVTE_WITH_* feature flags — but CUSOLVERMP_HOME is missing the cusolvermp path segment, likely breaking cuSolverMP linkage.
build_tools/wheel_utils/Dockerfile.x86 Installs openmpi/openmpi-devel, creates /opt/mpi symlinks, updates PATH/LD_LIBRARY_PATH and sets MPI_HOME; missing ldconfig call after writing the ld.so.conf.d entry.
build_tools/wheel_utils/Dockerfile.aarch Mirrors x86 Dockerfile changes for aarch64: OpenMPI install, /opt/mpi symlinks with aarch64-specific include path, and updated environment — same missing ldconfig call.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Docker image build] --> B[Install CUDA toolkit + cuDNN]
    B --> C[Install OpenMPI via dnf]
    C --> D[Symlink /opt/mpi + update ld.so.conf.d]
    D --> E[Set PATH / LD_LIBRARY_PATH / MPI_HOME]
    E --> F[Run build_wheels.sh]
    F --> G[pip install cmake / pybind11 / ninja / wheel]
    G --> H[pip install nvidia-cublasmp / cusolvermp / nvshmem]
    H --> I[Derive CUBLASMP_HOME / CUSOLVERMP_HOME / NVSHMEM_HOME from site-packages]
    I --> J[Create unversioned .so symlinks in each lib dir]
    J --> K[Export feature flags]
    K --> L{Build targets}
    L --> M[Metapackage wheel]
    L --> N[Common core wheel]
    L --> O[PyTorch sdist]
    L --> P[JAX sdist]
Loading

Reviews (1): Last reviewed commit: "Add optional core features to wheel buil..." | Re-trigger Greptile


SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}"
export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Likely incorrect CUSOLVERMP_HOME path

The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.

Comment on lines +39 to +43
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 After writing to /etc/ld.so.conf.d/openmpi-x86_64.conf, ldconfig should be called in the same RUN layer to update the dynamic linker cache. Without it, tools that depend on the ldconfig cache (rather than LD_LIBRARY_PATH) will not find the OpenMPI libraries at build time inside the container.

Suggested change
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf && \
ldconfig

Comment on lines +39 to +43
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Same as Dockerfile.x86: ldconfig should be called after appending to the /etc/ld.so.conf.d/ file so the dynamic linker cache is updated within the same Docker layer.

Suggested change
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf && \
ldconfig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant