Add optional core lib features to wheel build#3004
Conversation
Signed-off-by: ksivamani <ksivamani@nvidia.com>
Greptile SummaryThis PR updates the wheel build pipeline to include optional core library features —
Confidence Score: 3/5The build script has a likely path bug in CUSOLVERMP_HOME that would silently produce a wheel missing cuSolverMP support. The CUSOLVERMP_HOME is set to a path omitting the cusolvermp package-name segment, so the .so symlink loop silently skips its lib directory and the linker won't find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported. build_tools/wheel_utils/build_wheels.sh — specifically the CUSOLVERMP_HOME path on line 34. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Docker image build] --> B[Install CUDA toolkit + cuDNN]
B --> C[Install OpenMPI via dnf]
C --> D[Symlink /opt/mpi + update ld.so.conf.d]
D --> E[Set PATH / LD_LIBRARY_PATH / MPI_HOME]
E --> F[Run build_wheels.sh]
F --> G[pip install cmake / pybind11 / ninja / wheel]
G --> H[pip install nvidia-cublasmp / cusolvermp / nvshmem]
H --> I[Derive CUBLASMP_HOME / CUSOLVERMP_HOME / NVSHMEM_HOME from site-packages]
I --> J[Create unversioned .so symlinks in each lib dir]
J --> K[Export feature flags]
K --> L{Build targets}
L --> M[Metapackage wheel]
L --> N[Common core wheel]
L --> O[PyTorch sdist]
L --> P[JAX sdist]
Reviews (1): Last reviewed commit: "Add optional core features to wheel buil..." | Re-trigger Greptile |
|
|
||
| SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])") | ||
| export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}" | ||
| export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}" |
There was a problem hiding this comment.
Likely incorrect
CUSOLVERMP_HOME path
The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.
| RUN mkdir -p /opt/mpi && \ | ||
| ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \ | ||
| ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \ | ||
| ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \ | ||
| echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf |
There was a problem hiding this comment.
After writing to
/etc/ld.so.conf.d/openmpi-x86_64.conf, ldconfig should be called in the same RUN layer to update the dynamic linker cache. Without it, tools that depend on the ldconfig cache (rather than LD_LIBRARY_PATH) will not find the OpenMPI libraries at build time inside the container.
| RUN mkdir -p /opt/mpi && \ | |
| ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \ | |
| ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \ | |
| ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \ | |
| echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf | |
| RUN mkdir -p /opt/mpi && \ | |
| ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \ | |
| ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \ | |
| ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \ | |
| echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf && \ | |
| ldconfig |
| RUN mkdir -p /opt/mpi && \ | ||
| ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \ | ||
| ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \ | ||
| ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \ | ||
| echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf |
There was a problem hiding this comment.
Same as
Dockerfile.x86: ldconfig should be called after appending to the /etc/ld.so.conf.d/ file so the dynamic linker cache is updated within the same Docker layer.
| RUN mkdir -p /opt/mpi && \ | |
| ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \ | |
| ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \ | |
| ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \ | |
| echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf | |
| RUN mkdir -p /opt/mpi && \ | |
| ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \ | |
| ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \ | |
| ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \ | |
| echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf && \ | |
| ldconfig |
Description
Update wheel builds to include all features that can be enabled via a source build.
Type of change
Changes
NVTE_WITH_CUSOLVERMP,NVTE_WITH_CUBLASMP,NVTE_ENABLE_NVSHMEM, andNVTE_UB_WITH_MPIin the core lib wheel.Checklist: