[TIRX] Bind parallel loops to GPU threads before VerifyMemory#19363
Open
zhils wants to merge 4 commits intoapache:mainfrom
Open
[TIRX] Bind parallel loops to GPU threads before VerifyMemory#19363zhils wants to merge 4 commits intoapache:mainfrom
zhils wants to merge 4 commits intoapache:mainfrom
Conversation
`VerifyMemory` on GPU targets treats direct accesses outside thread environments as illegal. In the ScatterValue CUDA lowering path, `topi.scatter_elements` emits `ForKind::kParallel` loops without explicit thread bindings, which triggers false host-memory access failures (e.g. "Did you forget to bind?") during TIR verification. This change adds a new `tirx` pass (`BindParallelLoopsToThreads`) and inserts it before `VerifyMemory` in the `s_tir` pipelines (including adreno). The pass rewrites parallel loops into `blockIdx.x/threadIdx.x` thread-extent regions, substitutes loop vars with global thread indices, and adds bounds checks for non-divisible extents. This preserves correctness while ensuring GPU kernels pass memory verification for this path.
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces the BindParallelLoopsToThreads pass, which converts ForKind::kParallel loops into GPU block and thread bindings, and integrates this pass into the S-TIR pipelines. Additionally, it provides a configuration option to allow unsupported host compilers for NVCC on Windows and adds a functional test for scatter operations on CUDA. Review feedback identifies a critical issue regarding the handling of nested parallel loops which could lead to invalid GPU register bindings, an inconsistency in GPU device type definitions between files, and a minor code redundancy in the loop variable substitution logic.
Fix three correctness/configuration issues in the GPU parallel-loop binding path used before VerifyMemory. First, preserve non-zero loop mins by mapping parallel indices as min + global_idx instead of global_idx. Second, avoid rewriting parallel loops when already inside a thread environment to prevent invalid nested bindings. Third, register cuda.nvcc_allow_unsupported_compiler as a valid PassContext key so the NVCC workaround can be enabled via config without raising Invalid config option. Made-with: Cursor
- Add kDLWebGPU to IsGPUDevice in verify_memory.cc - Remove redundant Var wrapper in loop_partition.cc - Fix nested parallel loop handling in bind_parallel_loops_to_threads.cc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
VerifyMemoryon GPU targets treats direct accesses outside thread environments as illegal. In the ScatterValue CUDA lowering path,topi.scatter_elementsemitsForKind::kParallelloops without explicit thread bindings, which triggers false host-memory access failures (e.g. "Did you forget to bind?") during TIR verification.This change adds a new
tirxpass (BindParallelLoopsToThreads) and inserts it beforeVerifyMemoryin thes_tirpipelines (including adreno). The pass rewrites parallel loops intoblockIdx.x/threadIdx.xthread-extent regions, substitutes loop vars with global thread indices, and adds bounds checks for non-divisible extents. This preserves correctness while ensuring GPU kernels pass memory verification for this path.