Skip to content

[TIRX] Bind parallel loops to GPU threads before VerifyMemory#19363

Open
zhils wants to merge 4 commits intoapache:mainfrom
zhils:my-fix-branch
Open

[TIRX] Bind parallel loops to GPU threads before VerifyMemory#19363
zhils wants to merge 4 commits intoapache:mainfrom
zhils:my-fix-branch

Conversation

@zhils
Copy link
Copy Markdown

@zhils zhils commented Apr 6, 2026

VerifyMemory on GPU targets treats direct accesses outside thread environments as illegal. In the ScatterValue CUDA lowering path, topi.scatter_elements emits ForKind::kParallel loops without explicit thread bindings, which triggers false host-memory access failures (e.g. "Did you forget to bind?") during TIR verification.

This change adds a new tirx pass (BindParallelLoopsToThreads) and inserts it before VerifyMemory in the s_tir pipelines (including adreno). The pass rewrites parallel loops into blockIdx.x/threadIdx.x thread-extent regions, substitutes loop vars with global thread indices, and adds bounds checks for non-divisible extents. This preserves correctness while ensuring GPU kernels pass memory verification for this path.

`VerifyMemory` on GPU targets treats direct accesses outside thread environments as illegal.
In the ScatterValue CUDA lowering path, `topi.scatter_elements` emits `ForKind::kParallel`
loops without explicit thread bindings, which triggers false host-memory access failures
(e.g. "Did you forget to bind?") during TIR verification.

This change adds a new `tirx` pass (`BindParallelLoopsToThreads`) and inserts it before
`VerifyMemory` in the `s_tir` pipelines (including adreno). The pass rewrites parallel loops
into `blockIdx.x/threadIdx.x` thread-extent regions, substitutes loop vars with global thread
indices, and adds bounds checks for non-divisible extents. This preserves correctness while
ensuring GPU kernels pass memory verification for this path.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the BindParallelLoopsToThreads pass, which converts ForKind::kParallel loops into GPU block and thread bindings, and integrates this pass into the S-TIR pipelines. Additionally, it provides a configuration option to allow unsupported host compilers for NVCC on Windows and adds a functional test for scatter operations on CUDA. Review feedback identifies a critical issue regarding the handling of nested parallel loops which could lead to invalid GPU register bindings, an inconsistency in GPU device type definitions between files, and a minor code redundancy in the loop variable substitution logic.

zhils added 3 commits April 6, 2026 16:32
Fix three correctness/configuration issues in the GPU parallel-loop binding path used before VerifyMemory.
First, preserve non-zero loop mins by mapping parallel indices as min + global_idx instead of global_idx.
Second, avoid rewriting parallel loops when already inside a thread environment to prevent invalid nested bindings.
Third, register cuda.nvcc_allow_unsupported_compiler as a valid PassContext key so the NVCC workaround can be enabled via config without raising Invalid config option.

Made-with: Cursor
- Add kDLWebGPU to IsGPUDevice in verify_memory.cc
- Remove redundant Var wrapper in loop_partition.cc
- Fix nested parallel loop handling in bind_parallel_loops_to_threads.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant