Skip to content

Avoid CUDA context initialization during op compatibility checks at import#8078

Open
Achyuthan-S wants to merge 1 commit into
deepspeedai:masterfrom
Achyuthan-S:fix/import-fork-safety
Open

Avoid CUDA context initialization during op compatibility checks at import#8078
Achyuthan-S wants to merge 1 commit into
deepspeedai:masterfrom
Achyuthan-S:fix/import-fork-safety

Conversation

@Achyuthan-S

@Achyuthan-S Achyuthan-S commented Jun 19, 2026

Copy link
Copy Markdown

Summary

Importing DeepSpeed initialized a CUDA context in the parent process, which permanently breaks fork()-based multiprocessing. This makes import deepspeed fork-safe.

Fixes #7918.

Root cause

deepspeed/git_version_info.py runs builder.is_compatible() for every op at import time. Eight CUDA op builders call torch.cuda.get_device_properties(0).major inside is_compatible(). That call triggers torch.cuda._lazy_init() and creates a CUDA context in the parent. Any subsequent fork() whose child touches CUDA then fails with:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with
multiprocessing, you must use the 'spawn' start method

(torch.cuda.is_available() is NVML-backed and fork-safe on modern PyTorch — get_device_properties is the call that poisons the process.)

Fix

Add CUDAOpBuilder.cuda_capability_major(), a fork-safe capability probe that returns the device's compute-capability major only when it is safe to read:

  • skips the probe when no CUDA context exists yet (not torch.cuda.is_initialized()), so a plain import deepspeed never creates one;
  • skips it inside a forked child whose inherited context is invalid (torch.cuda._is_in_bad_fork()), mirroring the existing guard added in Avoid CUDA reinit error in CI tests #7977;
  • returns None in those cases.

All eight builders now route through this helper and skip the compute-capability check when the device cannot be probed safely, deferring it to build/load time (where a context already exists). The is_rocm_pytorch() / is_available() guards stay in the callers.

Behavior note

When the capability cannot be probed safely (e.g. at import before CUDA is initialized), the compute-capability gate in is_compatible() is skipped rather than failing. The real check still runs at build/load time once a context exists, so this only relaxes a redundant import-time check in exchange for fork safety.

Tests

  • Three unit tests for the helper's decision tree (not-initialized → skip, initialized → probe, bad-fork → skip), following the existing test_bad_fork_jit_* pattern in tests/unit/ops/test_op_builder.py (mocked torch.cuda, no GPU required).
  • test_import_deepspeed_does_not_initialize_cuda: a subprocess regression test asserting import deepspeed leaves CUDA uninitialized.

To validate on GPU: pytest --forked tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py (the repro from #7918) — should pass on a CUDA runner instead of hitting the re-initialize error.

Docs

Updated CONTRIBUTING.md and docs/contributing.md to clarify that --forked is safe now that import deepspeed no longer initializes CUDA — resolving the contradiction called out in the issue.

cc @tjruwase @loadams @tohtana

…ai#7918)

import deepspeed eagerly calls is_compatible() for all ops; eight builders
probed get_device_properties(0), which lazy-inits CUDA and breaks fork()-based
multiprocessing. Gate the probe on is_initialized() via a shared
CUDAOpBuilder.cuda_capability_major() helper, and clarify that pytest --forked
is safe now that import no longer initializes a CUDA context.

Fixes deepspeedai#7918

Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings June 19, 2026 11:05
@Achyuthan-S Achyuthan-S mentioned this pull request Jun 19, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a fork-safety issue where import deepspeed could initialize a CUDA context (via import-time op compatibility checks), breaking fork()-based multiprocessing. It introduces a fork-safe CUDA capability probe and updates CUDA op builders to avoid context creation during import.

Changes:

  • Add CUDAOpBuilder.cuda_capability_major() to safely query compute capability only when CUDA is already initialized and not in a bad-fork state.
  • Update affected CUDA op builders’ is_compatible() logic to use the helper and skip capability gating when probing would be unsafe.
  • Add unit/regression tests and update contributing docs to reflect that --forked is now safe with DeepSpeed imports.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/ops/test_op_builder.py Adds unit tests for the new helper and a subprocess regression test to ensure import deepspeed doesn’t initialize CUDA.
op_builder/builder.py Introduces CUDAOpBuilder.cuda_capability_major() with guards to avoid CUDA context initialization.
op_builder/transformer_inference.py Switches capability checks to the fork-safe helper and gates comparisons on None.
op_builder/spatial_inference.py Switches Ampere gating to the fork-safe helper and guards on None.
op_builder/ragged_utils.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/ragged_ops.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/inference_cutlass_builder.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/inference_core_ops.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/fp_quantizer.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/evoformer_attn.py Switches capability checks to the fork-safe helper and guards on None.
docs/contributing.md Updates contributing guidance to clarify that --forked is safe now that imports don’t initialize CUDA.
CONTRIBUTING.md Mirrors the contributing guidance update from docs/contributing.md.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +270 to +277
check = (
"import torch, deepspeed; "
"assert not torch.cuda.is_initialized(), " #ignore-cuda
"'import deepspeed initialized a CUDA context (issue #7918)'")
result = subprocess.run([sys.executable, "-c", check], capture_output=True, text=True)
if "ModuleNotFoundError" in result.stderr:
pytest.skip("deepspeed/torch not importable in a subprocess in this environment")
assert result.returncode == 0, result.stderr

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02b1c335cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +271 to +272
"import torch, deepspeed; "
"assert not torch.cuda.is_initialized(), " #ignore-cuda

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Verify fork safety, not just CUDA context state

This regression check can pass while the fork failure still exists: import deepspeed still runs op compatibility checks that call torch.cuda.is_available(), and PyTorch only documents that call as non-poisoning when PYTORCH_NVML_BASED_CUDA_CHECK=1 is set (https://docs.pytorch.org/docs/stable/generated/torch.cuda.is_available.html). Since is_available() can poison fork without making torch.cuda.is_initialized() true, CUDA-enabled environments can still fail in a forked child even though this assertion succeeds; the test should actually fork after import and touch CUDA, or the import path must avoid/use the NVML-safe availability check.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fork safety

2 participants