Avoid CUDA context initialization during op compatibility checks at import#8078
Avoid CUDA context initialization during op compatibility checks at import#8078Achyuthan-S wants to merge 1 commit into
Conversation
…ai#7918) import deepspeed eagerly calls is_compatible() for all ops; eight builders probed get_device_properties(0), which lazy-inits CUDA and breaks fork()-based multiprocessing. Gate the probe on is_initialized() via a shared CUDAOpBuilder.cuda_capability_major() helper, and clarify that pytest --forked is safe now that import no longer initializes a CUDA context. Fixes deepspeedai#7918 Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a fork-safety issue where import deepspeed could initialize a CUDA context (via import-time op compatibility checks), breaking fork()-based multiprocessing. It introduces a fork-safe CUDA capability probe and updates CUDA op builders to avoid context creation during import.
Changes:
- Add
CUDAOpBuilder.cuda_capability_major()to safely query compute capability only when CUDA is already initialized and not in a bad-fork state. - Update affected CUDA op builders’
is_compatible()logic to use the helper and skip capability gating when probing would be unsafe. - Add unit/regression tests and update contributing docs to reflect that
--forkedis now safe with DeepSpeed imports.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/ops/test_op_builder.py | Adds unit tests for the new helper and a subprocess regression test to ensure import deepspeed doesn’t initialize CUDA. |
| op_builder/builder.py | Introduces CUDAOpBuilder.cuda_capability_major() with guards to avoid CUDA context initialization. |
| op_builder/transformer_inference.py | Switches capability checks to the fork-safe helper and gates comparisons on None. |
| op_builder/spatial_inference.py | Switches Ampere gating to the fork-safe helper and guards on None. |
| op_builder/ragged_utils.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/ragged_ops.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/inference_cutlass_builder.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/inference_core_ops.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/fp_quantizer.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/evoformer_attn.py | Switches capability checks to the fork-safe helper and guards on None. |
| docs/contributing.md | Updates contributing guidance to clarify that --forked is safe now that imports don’t initialize CUDA. |
| CONTRIBUTING.md | Mirrors the contributing guidance update from docs/contributing.md. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| check = ( | ||
| "import torch, deepspeed; " | ||
| "assert not torch.cuda.is_initialized(), " #ignore-cuda | ||
| "'import deepspeed initialized a CUDA context (issue #7918)'") | ||
| result = subprocess.run([sys.executable, "-c", check], capture_output=True, text=True) | ||
| if "ModuleNotFoundError" in result.stderr: | ||
| pytest.skip("deepspeed/torch not importable in a subprocess in this environment") | ||
| assert result.returncode == 0, result.stderr |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02b1c335cd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "import torch, deepspeed; " | ||
| "assert not torch.cuda.is_initialized(), " #ignore-cuda |
There was a problem hiding this comment.
Verify fork safety, not just CUDA context state
This regression check can pass while the fork failure still exists: import deepspeed still runs op compatibility checks that call torch.cuda.is_available(), and PyTorch only documents that call as non-poisoning when PYTORCH_NVML_BASED_CUDA_CHECK=1 is set (https://docs.pytorch.org/docs/stable/generated/torch.cuda.is_available.html). Since is_available() can poison fork without making torch.cuda.is_initialized() true, CUDA-enabled environments can still fail in a forked child even though this assertion succeeds; the test should actually fork after import and touch CUDA, or the import path must avoid/use the NVML-safe availability check.
Useful? React with 👍 / 👎.
Summary
Importing DeepSpeed initialized a CUDA context in the parent process, which permanently breaks
fork()-based multiprocessing. This makesimport deepspeedfork-safe.Fixes #7918.
Root cause
deepspeed/git_version_info.pyrunsbuilder.is_compatible()for every op at import time. Eight CUDA op builders calltorch.cuda.get_device_properties(0).majorinsideis_compatible(). That call triggerstorch.cuda._lazy_init()and creates a CUDA context in the parent. Any subsequentfork()whose child touches CUDA then fails with:RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with
multiprocessing, you must use the 'spawn' start method
(
torch.cuda.is_available()is NVML-backed and fork-safe on modern PyTorch —get_device_propertiesis the call that poisons the process.)Fix
Add
CUDAOpBuilder.cuda_capability_major(), a fork-safe capability probe that returns the device's compute-capability major only when it is safe to read:not torch.cuda.is_initialized()), so a plainimport deepspeednever creates one;torch.cuda._is_in_bad_fork()), mirroring the existing guard added in Avoid CUDA reinit error in CI tests #7977;Nonein those cases.All eight builders now route through this helper and skip the compute-capability check when the device cannot be probed safely, deferring it to build/load time (where a context already exists). The
is_rocm_pytorch()/is_available()guards stay in the callers.Behavior note
When the capability cannot be probed safely (e.g. at import before CUDA is initialized), the compute-capability gate in
is_compatible()is skipped rather than failing. The real check still runs at build/load time once a context exists, so this only relaxes a redundant import-time check in exchange for fork safety.Tests
test_bad_fork_jit_*pattern intests/unit/ops/test_op_builder.py(mockedtorch.cuda, no GPU required).test_import_deepspeed_does_not_initialize_cuda: a subprocess regression test assertingimport deepspeedleaves CUDA uninitialized.To validate on GPU:
pytest --forked tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py(the repro from #7918) — should pass on a CUDA runner instead of hitting the re-initialize error.Docs
Updated
CONTRIBUTING.mdanddocs/contributing.mdto clarify that--forkedis safe now thatimport deepspeedno longer initializes CUDA — resolving the contradiction called out in the issue.cc @tjruwase @loadams @tohtana