Skip to content

Use std::optional instead of direct default argument for LaunchParams#5980

Open
rdspring1 wants to merge 1 commit intomainfrom
direct_smoke
Open

Use std::optional instead of direct default argument for LaunchParams#5980
rdspring1 wants to merge 1 commit intomainfrom
direct_smoke

Conversation

@rdspring1
Copy link
Collaborator

  • LaunchParams is constructed when nvfuser_direct is loaded because it a default argument for KernelExecutor.
  • It uses at::cuda::getCurrentDeviceProperties(), which can cause CUDA error: CUDA driver version is insufficient for CUDA runtime version when importing nvfuser_direct.
  • The smoke_test check for this, so the PR uses std::optional instead of direct default argument for LaunchParams

* LaunchParams is constructed when nvfuser_direct is loaded.
* It uses at::cuda::getCurrentDeviceProperties(), which initializes
Cuda context.
@rdspring1 rdspring1 added the Direct Bindings Python extension with direct mapping to NvFuser CPP objects. label Feb 18, 2026
@rdspring1
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Feb 18, 2026

Description

  • Changed LaunchParams parameter from direct default to std::optional to avoid CUDA initialization at import time

  • Updated compile() method signature to use std::optional with py::none() default

  • Updated run() method signature to use std::optional with py::none() default

  • Added value_or(LaunchParams()) calls to handle optional parameters while maintaining backward compatibility

Changes walkthrough

Relevant files
Bug fix
runtime.cpp
Make LaunchParams optional to prevent CUDA initialization at import

python/python_direct/runtime.cpp

  • Changed compile() method parameter from const LaunchParams& to
    std::optional
  • Changed run() method parameter from const LaunchParams& to
    std::optional
  • Updated default arguments from LaunchParams() to py::none() for both
    methods
  • Added value_or(LaunchParams()) calls to handle optional parameters
  • Added explanatory comments about avoiding default LaunchParams
    construction at import time
  • +13/-6   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    API Consistency

    The changes modify the KernelExecutor API by making launch_constraints optional with std::optional. While this solves the CUDA initialization issue, it changes the method signatures. Ensure that all existing Python code that calls these methods will continue to work correctly with the new optional parameters, and that the default behavior remains identical.

     std::optional<LaunchParams> launch_constraints,
     const CompileParams& compile_params,
     SchedulerType scheduler_type) {
    // launch_constraints is optional to avoid creating default
    // LaunchParams when importing shared library.
    self.compile(
        fusion,
        from_pyiterable(args),
        launch_constraints.value_or(LaunchParams()),
        compile_params,
        scheduler_type);
    Performance Impact

    The change from direct LaunchParams reference to std::optional adds a layer of indirection. While this is necessary to avoid CUDA initialization during import, verify that the performance impact is minimal and that the value_or() call doesn't introduce significant overhead in the hot path of kernel execution.

              launch_constraints.value_or(LaunchParams()),
              compile_params,
              scheduler_type);
        },
        R"(
            Compile a fusion into a CUDA kernel.
    
            Parameters
            ----------
            fusion : Fusion
                The fusion to compile.
            args : KernelArgumentHolder, optional
                The kernel arguments. If empty, will be populated during run.
            launch_constraints : LaunchParams, optional
                Constraints for kernel launch parameters.
            compile_params : CompileParams, optional
                Parameters for kernel compilation.
            scheduler_type : SchedulerType, optional
                The type of scheduler to use (default: None).
    
            Returns
            -------
            None
          )",
        py::arg("fusion"),
        py::arg("args") = py::list(),
        py::arg("launch_constraints") = py::none(),
        py::arg("compile_params") = CompileParams(),
        py::arg("scheduler_type") = SchedulerType::None)
    .def(
        "run",
        [](KernelExecutor& self,
           const py::iterable& args,
           std::optional<LaunchParams> launch_constraints,
           const CompileParams& compile_params) {
          // launch_constraints is optional to avoid creating default
          // LaunchParams when importing shared library.
          KernelArgumentHolder outputs = self.run(
              from_pyiterable(args),
              {},
              launch_constraints.value_or(LaunchParams()),

    Test failures

    • (Medium, 1) NVFuser validation mismatch in TmaPersistentTest on dlcluster_h100

      Test Name H100 Source
      TmaPersistentTestP.TmaInnerPersistentRmsNorm/__bfloat_2048_5120 Link

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 18, 2026

    Greptile Summary

    This PR fixes a CUDA driver initialization error that occurs when importing nvfuser_direct on systems where the CUDA driver version is insufficient for the runtime. The root cause is that LaunchParams() default constructor calls assertValid(), which in turn calls at::cuda::getCurrentDeviceProperties() — this requires a working CUDA driver. Previously, LaunchParams() was used as a default argument value in the pybind11 binding definitions for KernelExecutor.compile() and KernelExecutor.run(), meaning a LaunchParams object was constructed at module import time.

    The fix replaces the default LaunchParams() argument with std::optional<LaunchParams> defaulting to py::none(), and uses .value_or(LaunchParams()) inside the lambda bodies. This defers the CUDA context initialization until the methods are actually called, rather than at import time.

    • Changed launch_constraints parameter type from const LaunchParams& to std::optional<LaunchParams> in both compile and run bindings
    • Default argument changed from LaunchParams() to py::none() to avoid constructing a LaunchParams at module load time
    • CompileParams was not changed because its default constructor has no CUDA dependencies

    Confidence Score: 5/5

    • This PR is safe to merge — it is a minimal, well-scoped fix that defers CUDA context initialization from import time to method call time with no behavioral change for callers.
    • The change is small (one file), clearly motivated by a real bug (CUDA driver error on import), and the implementation is correct. The std::optional + value_or pattern is idiomatic and pybind11 handles std::optional with py::none() correctly. No functional behavior changes for users who pass a LaunchParams object explicitly, and the default case produces identical results. The existing test_import_correct() smoke test validates this fix.
    • No files require special attention

    Important Files Changed

    Filename Overview
    python/python_direct/runtime.cpp Changes launch_constraints parameter from const LaunchParams& with LaunchParams() default to std::optional<LaunchParams> with py::none() default in both compile and run bindings. This defers LaunchParams() construction (which triggers at::cuda::getCurrentDeviceProperties()) from module import time to method call time. Clean and correct fix.

    Sequence Diagram

    sequenceDiagram
        participant User as Python User
        participant Mod as nvfuser_direct module
        participant Bind as pybind11 bindings
        participant KE as KernelExecutor
        participant LP as LaunchParams
        participant CUDA as CUDA Runtime
    
        Note over User,CUDA: BEFORE (broken)
        User->>Mod: import nvfuser_direct
        Mod->>Bind: Register KernelExecutor bindings
        Bind->>LP: LaunchParams() [default arg]
        LP->>LP: assertValid()
        LP->>CUDA: getCurrentDeviceProperties()
        CUDA-->>LP: ERROR: driver version insufficient
    
        Note over User,CUDA: AFTER (fixed)
        User->>Mod: import nvfuser_direct
        Mod->>Bind: Register KernelExecutor bindings
        Note right of Bind: Default = py::none() (no LaunchParams created)
        User->>KE: executor.compile(fusion, args)
        KE->>LP: LaunchParams() via value_or()
        LP->>LP: assertValid()
        LP->>CUDA: getCurrentDeviceProperties()
        CUDA-->>LP: OK (CUDA is available at runtime)
    
    Loading

    Last reviewed commit: 3e40e6c

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I believe I'm missing the context. Why is this the right fix?

    CUDA error: CUDA driver version is insufficient for CUDA runtime version

    That sounds like a dealbreaker. I don't think any GPU code (including nvfuser) can run with an insufficient driver version.

    @rdspring1
    Copy link
    Collaborator Author

    the smoke_test is running on builder tag, which probably doesn't have very recent GPU drivers
    If I recall correctly, this is done on purpose. import nvfuser or import nvfuser_direct should not initialize cuda context or directly link to libcuda.so . You could potentially link to libcudart.so or use a runtime load with dlopen.
    The idea is that import nvfuser or nvfuser_direct should do a similar thing like import torch that will not crash when GPU is not available or driver being too old.

    From ^^^ @xwang233

    The smoke test purposefully uses an incorrect driver. You're supposed to be able to import nvfuser_direct without crashing.

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LaunchParams is constructed when nvfuser_direct is loaded because it a default argument for KernelExecutor.

    Could you say more about this? Is KernelExecutor::compile called accidentally?

    @rdspring1
    Copy link
    Collaborator Author

    Could you say more about this? Is KernelExecutor::compile called accidentally?

    LaunchParams is a default argument for pybind11 binding for KernelExecutor::compile. A LaunchParams object is created when loading shared library for the default argument. This eventually calls LaunchParams::assertValid(), causing the issue.

          .def(
              "compile",
              [](KernelExecutor& self,
                 Fusion* fusion,
                 const py::iterable& args,
                 const LaunchParams& launch_constraints,
                 const CompileParams& compile_params,
                 SchedulerType scheduler_type) {
                self.compile(
                    fusion,
                    from_pyiterable(args),
                    launch_constraints,
                    compile_params,
                    scheduler_type);
              },
              R"(
                  Compile a fusion into a CUDA kernel
    
                  Parameters
                  ----------
                  fusion : Fusion
                      The fusion to compile.
                  args : KernelArgumentHolder, optional
                      The kernel arguments. If empty, will be populated during run.
                  launch_constraints : LaunchParams, optional
                      Constraints for kernel launch parameters.
                  compile_params : CompileParams, optional
                      Parameters for kernel compilation.                                                                                                                      
                  scheduler_type : SchedulerType, optional
                      The type of scheduler to use (default: None).
    
                  Returns
                  -------
                  None
                )",
              py::arg("fusion"),
              py::arg("args") = py::list(),
              py::arg("launch_constraints") = LaunchParams(),  <<<< Default argument
              py::arg("compile_params") = CompileParams(),
              py::arg("scheduler_type") = SchedulerType::None)

    Gemini Summary:

    In your pybind11 code, the line py::arg("launch_constraints") = LaunchParams() is not just a type declaration; it is a constructor call.
    
    The Root Cause: Eager Default Arguments
    In C++, default arguments are evaluated at the time the function is registered. Because this registration happens inside the PYBIND11_MODULE block (which runs during the Python import), the following chain occurs:
    
    Python executes import _C_DIRECT.
    
    pybind11 starts registering the compile method.
    
    To register the default value for launch_constraints, it must instantiate a LaunchParams object.
    
    The LaunchParams constructor (Frame #1) is called.
    
    The constructor calls assertValid() (Frame #0) to ensure the parameters are within hardware limits.
    
    assertValid() tries to query the GPU via LibTorch/CUDA.
    

    Stack trace: when loading from ._C_DIRECT import * shared library

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/__init__.py", line 23, in <module>
        from ._C_DIRECT import *  # noqa: F401,F403
        ^^^^^^^^^^^^^^^^^^^^^^^^
    ImportError: CUDA error: CUDA driver version is insufficient for CUDA runtime version
    Search for `cudaErrorInsufficientDriver' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
    Exception raised from dsa_get_device_count at /pytorch/c10/cuda/CUDADeviceAssertionHost.cpp:60 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x70fab6b71efd in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10.so)
    frame #1: <unknown function> + 0xc0d4 (0x70fab6e9a0d4 in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
    frame #2: c10::cuda::CUDAKernelLaunchRegistry::CUDAKernelLaunchRegistry() + 0x9a (0x70fab6ed2bea in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
    frame #3: c10::cuda::CUDAKernelLaunchRegistry::get_singleton_ref() + 0x4a (0x70fab6ed2eba in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
    frame #4: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x55 (0x70fab6ed3b85 in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
    frame #5: c10::cuda::current_device() + 0x33 (0x70fab6ed4bc3 in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
    frame #6: at::cuda::getCurrentDeviceProperties() + 0x9 (0x70f983291939 in /opt/pyenv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
    frame #7: nvfuser::LaunchParams::assertValid() + 0x45 (0x70f8d7482dc5 in /opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/../nvfuser_common/lib/libnvfuser_codegen.so)
    frame #8: <unknown function> + 0x13d81c (0x70f8db30681c in /opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
    frame #9: <unknown function> + 0x62dbc (0x70f8db22bdbc in /opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
    frame #10: <unknown function> + 0x581c6 (0x70f8db2211c6 in /opt/pyenv/lib/python3.12/site-packages/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
    <omitting python frames>

    @rdspring1 rdspring1 requested a review from wujingyue February 19, 2026 05:44
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    Direct Bindings Python extension with direct mapping to NvFuser CPP objects.

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants

    Comments