Use std::optional instead of direct default argument for LaunchParams by rdspring1 · Pull Request #5980 · NVIDIA/Fuser

rdspring1 · 2026-02-18T22:02:32Z

LaunchParams is constructed when nvfuser_direct is loaded because it a default argument for KernelExecutor.
It uses at::cuda::getCurrentDeviceProperties(), which can cause CUDA error: CUDA driver version is insufficient for CUDA runtime version when importing nvfuser_direct.
The smoke_test check for this, so the PR uses std::optional instead of direct default argument for LaunchParams

* LaunchParams is constructed when nvfuser_direct is loaded. * It uses at::cuda::getCurrentDeviceProperties(), which initializes Cuda context.

rdspring1 · 2026-02-18T22:02:39Z

!test

github-actions · 2026-02-18T22:03:17Z

Description

Changed LaunchParams parameter from direct default to std::optional to avoid CUDA initialization at import time
Updated compile() method signature to use std::optional with py::none() default
Updated run() method signature to use std::optional with py::none() default
Added value_or(LaunchParams()) calls to handle optional parameters while maintaining backward compatibility

Changes walkthrough

Relevant files

Bug fix

runtime.cpp `Make LaunchParams optional to prevent CUDA initialization at import` python/python_direct/runtime.cpp Changed compile() method parameter from const LaunchParams& to std::optional Changed run() method parameter from const LaunchParams& to std::optional Updated default arguments from LaunchParams() to py::none() for both methods Added value_or(LaunchParams()) calls to handle optional parameters Added explanatory comments about avoiding default LaunchParams construction at import time	+13/-6

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

API Consistency

The changes modify the KernelExecutor API by making launch_constraints optional with std::optional. While this solves the CUDA initialization issue, it changes the method signatures. Ensure that all existing Python code that calls these methods will continue to work correctly with the new optional parameters, and that the default behavior remains identical.

 std::optional<LaunchParams> launch_constraints,
 const CompileParams& compile_params,
 SchedulerType scheduler_type) {
// launch_constraints is optional to avoid creating default
// LaunchParams when importing shared library.
self.compile(
    fusion,
    from_pyiterable(args),
    launch_constraints.value_or(LaunchParams()),
    compile_params,
    scheduler_type);

Performance Impact

The change from direct LaunchParams reference to std::optional adds a layer of indirection. While this is necessary to avoid CUDA initialization during import, verify that the performance impact is minimal and that the value_or() call doesn't introduce significant overhead in the hot path of kernel execution.

          launch_constraints.value_or(LaunchParams()),
          compile_params,
          scheduler_type);
    },
    R"(
        Compile a fusion into a CUDA kernel.

        Parameters
        ----------
        fusion : Fusion
            The fusion to compile.
        args : KernelArgumentHolder, optional
            The kernel arguments. If empty, will be populated during run.
        launch_constraints : LaunchParams, optional
            Constraints for kernel launch parameters.
        compile_params : CompileParams, optional
            Parameters for kernel compilation.
        scheduler_type : SchedulerType, optional
            The type of scheduler to use (default: None).

        Returns
        -------
        None
      )",
    py::arg("fusion"),
    py::arg("args") = py::list(),
    py::arg("launch_constraints") = py::none(),
    py::arg("compile_params") = CompileParams(),
    py::arg("scheduler_type") = SchedulerType::None)
.def(
    "run",
    [](KernelExecutor& self,
       const py::iterable& args,
       std::optional<LaunchParams> launch_constraints,
       const CompileParams& compile_params) {
      // launch_constraints is optional to avoid creating default
      // LaunchParams when importing shared library.
      KernelArgumentHolder outputs = self.run(
          from_pyiterable(args),
          {},
          launch_constraints.value_or(LaunchParams()),

Test failures

(Medium, 1) NVFuser validation mismatch in TmaPersistentTest on dlcluster_h100

Test Name H100 Source

TmaPersistentTestP.TmaInnerPersistentRmsNorm/__bfloat_2048_5120 ❌ Link

greptile-apps · 2026-02-18T22:05:30Z

Greptile Summary

This PR fixes a CUDA driver initialization error that occurs when importing nvfuser_direct on systems where the CUDA driver version is insufficient for the runtime. The root cause is that LaunchParams() default constructor calls assertValid(), which in turn calls at::cuda::getCurrentDeviceProperties() — this requires a working CUDA driver. Previously, LaunchParams() was used as a default argument value in the pybind11 binding definitions for KernelExecutor.compile() and KernelExecutor.run(), meaning a LaunchParams object was constructed at module import time.

The fix replaces the default LaunchParams() argument with std::optional<LaunchParams> defaulting to py::none(), and uses .value_or(LaunchParams()) inside the lambda bodies. This defers the CUDA context initialization until the methods are actually called, rather than at import time.

Changed launch_constraints parameter type from const LaunchParams& to std::optional<LaunchParams> in both compile and run bindings
Default argument changed from LaunchParams() to py::none() to avoid constructing a LaunchParams at module load time
CompileParams was not changed because its default constructor has no CUDA dependencies

Confidence Score: 5/5

This PR is safe to merge — it is a minimal, well-scoped fix that defers CUDA context initialization from import time to method call time with no behavioral change for callers.
The change is small (one file), clearly motivated by a real bug (CUDA driver error on import), and the implementation is correct. The std::optional + value_or pattern is idiomatic and pybind11 handles std::optional with py::none() correctly. No functional behavior changes for users who pass a LaunchParams object explicitly, and the default case produces identical results. The existing test_import_correct() smoke test validates this fix.
No files require special attention

Important Files Changed

Filename	Overview
python/python_direct/runtime.cpp	Changes `launch_constraints` parameter from `const LaunchParams&` with `LaunchParams()` default to `std::optional<LaunchParams>` with `py::none()` default in both `compile` and `run` bindings. This defers `LaunchParams()` construction (which triggers `at::cuda::getCurrentDeviceProperties()`) from module import time to method call time. Clean and correct fix.

Sequence Diagram

sequenceDiagram
    participant User as Python User
    participant Mod as nvfuser_direct module
    participant Bind as pybind11 bindings
    participant KE as KernelExecutor
    participant LP as LaunchParams
    participant CUDA as CUDA Runtime

    Note over User,CUDA: BEFORE (broken)
    User->>Mod: import nvfuser_direct
    Mod->>Bind: Register KernelExecutor bindings
    Bind->>LP: LaunchParams() [default arg]
    LP->>LP: assertValid()
    LP->>CUDA: getCurrentDeviceProperties()
    CUDA-->>LP: ERROR: driver version insufficient

    Note over User,CUDA: AFTER (fixed)
    User->>Mod: import nvfuser_direct
    Mod->>Bind: Register KernelExecutor bindings
    Note right of Bind: Default = py::none() (no LaunchParams created)
    User->>KE: executor.compile(fusion, args)
    KE->>LP: LaunchParams() via value_or()
    LP->>LP: assertValid()
    LP->>CUDA: getCurrentDeviceProperties()
    CUDA-->>LP: OK (CUDA is available at runtime)

_{Last reviewed commit: 3e40e6c}

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

wujingyue

I believe I'm missing the context. Why is this the right fix?

CUDA error: CUDA driver version is insufficient for CUDA runtime version

That sounds like a dealbreaker. I don't think any GPU code (including nvfuser) can run with an insufficient driver version.

rdspring1 · 2026-02-19T00:06:28Z

the smoke_test is running on builder tag, which probably doesn't have very recent GPU drivers
If I recall correctly, this is done on purpose. import nvfuser or import nvfuser_direct should not initialize cuda context or directly link to libcuda.so . You could potentially link to libcudart.so or use a runtime load with dlopen.
The idea is that import nvfuser or nvfuser_direct should do a similar thing like import torch that will not crash when GPU is not available or driver being too old.

From ^^^ @xwang233

The smoke test purposefully uses an incorrect driver. You're supposed to be able to import nvfuser_direct without crashing.

wujingyue

LaunchParams is constructed when nvfuser_direct is loaded because it a default argument for KernelExecutor.

Could you say more about this? Is KernelExecutor::compile called accidentally?

rdspring1 · 2026-02-19T05:44:53Z

Could you say more about this? Is KernelExecutor::compile called accidentally?

LaunchParams is a default argument for pybind11 binding for KernelExecutor::compile. A LaunchParams object is created when loading shared library for the default argument. This eventually calls LaunchParams::assertValid(), causing the issue.

      .def(
          "compile",
          [](KernelExecutor& self,
             Fusion* fusion,
             const py::iterable& args,
             const LaunchParams& launch_constraints,
             const CompileParams& compile_params,
             SchedulerType scheduler_type) {
            self.compile(
                fusion,
                from_pyiterable(args),
                launch_constraints,
                compile_params,
                scheduler_type);
          },
          R"(
              Compile a fusion into a CUDA kernel

              Parameters
              ----------
              fusion : Fusion
                  The fusion to compile.
              args : KernelArgumentHolder, optional
                  The kernel arguments. If empty, will be populated during run.
              launch_constraints : LaunchParams, optional
                  Constraints for kernel launch parameters.
              compile_params : CompileParams, optional
                  Parameters for kernel compilation.                                                                                                                      
              scheduler_type : SchedulerType, optional
                  The type of scheduler to use (default: None).

              Returns
              -------
              None
            )",
          py::arg("fusion"),
          py::arg("args") = py::list(),
          py::arg("launch_constraints") = LaunchParams(),  <<<< Default argument
          py::arg("compile_params") = CompileParams(),
          py::arg("scheduler_type") = SchedulerType::None)

Gemini Summary:

In your pybind11 code, the line py::arg("launch_constraints") = LaunchParams() is not just a type declaration; it is a constructor call.

The Root Cause: Eager Default Arguments
In C++, default arguments are evaluated at the time the function is registered. Because this registration happens inside the PYBIND11_MODULE block (which runs during the Python import), the following chain occurs:

Python executes import _C_DIRECT.

pybind11 starts registering the compile method.

To register the default value for launch_constraints, it must instantiate a LaunchParams object.

The LaunchParams constructor (Frame #1) is called.

The constructor calls assertValid() (Frame #0) to ensure the parameters are within hardware limits.

assertValid() tries to query the GPU via LibTorch/CUDA.

Stack trace: when loading from ._C_DIRECT import * shared library

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/__init__.py", line 23, in <module>
    from ._C_DIRECT import *  # noqa: F401,F403
    ^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: CUDA error: CUDA driver version is insufficient for CUDA runtime version
Search for `cudaErrorInsufficientDriver' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
Exception raised from dsa_get_device_count at /pytorch/c10/cuda/CUDADeviceAssertionHost.cpp:60 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x70fab6b71efd in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0d4 (0x70fab6e9a0d4 in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDAKernelLaunchRegistry::CUDAKernelLaunchRegistry() + 0x9a (0x70fab6ed2bea in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::CUDAKernelLaunchRegistry::get_singleton_ref() + 0x4a (0x70fab6ed2eba in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x55 (0x70fab6ed3b85 in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #5: c10::cuda::current_device() + 0x33 (0x70fab6ed4bc3 in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #6: at::cuda::getCurrentDeviceProperties() + 0x9 (0x70f983291939 in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: nvfuser::LaunchParams::assertValid() + 0x45 (0x70f8d7482dc5 in /opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/../nvfuser_common/lib/libnvfuser_codegen.so)
frame #8: <unknown function> + 0x13d81c (0x70f8db30681c in /opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x62dbc (0x70f8db22bdbc in /opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x581c6 (0x70f8db2211c6 in /opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
<omitting python frames>

Use std::optional instead of direct default argument for LaunchParams

3e40e6c

* LaunchParams is constructed when nvfuser_direct is loaded. * It uses at::cuda::getCurrentDeviceProperties(), which initializes Cuda context.

rdspring1 requested review from jjsjann123 and wujingyue February 18, 2026 22:02

rdspring1 added the Direct Bindings Python extension with direct mapping to NvFuser CPP objects. label Feb 18, 2026

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

wujingyue reviewed Feb 19, 2026

View reviewed changes

rdspring1 requested a review from wujingyue February 19, 2026 05:44

wujingyue approved these changes Feb 19, 2026

View reviewed changes

xwang233 approved these changes Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use std::optional instead of direct default argument for LaunchParams#5980

Use std::optional instead of direct default argument for LaunchParams#5980
rdspring1 wants to merge 1 commit intomainfrom
direct_smoke

rdspring1 commented Feb 18, 2026

Uh oh!

rdspring1 commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026 •

edited by xwang233

Loading

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

greptile-apps bot commented Feb 18, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

wujingyue left a comment

Uh oh!

rdspring1 commented Feb 19, 2026

Uh oh!

wujingyue left a comment

Uh oh!

rdspring1 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

rdspring1 commented Feb 18, 2026

Uh oh!

rdspring1 commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

greptile-apps bot commented Feb 18, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

rdspring1 commented Feb 19, 2026

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

rdspring1 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

github-actions bot commented Feb 18, 2026 •

edited by xwang233

Loading