cudax/stf: migrate stream/interfaces/ from cuda_safe_call to cuda_try by andralex · Pull Request #9268 · NVIDIA/cccl

andralex · 2026-06-05T01:35:14Z

Summary

Migrates the cudax/include/cuda/experimental/__stf/stream/interfaces/ data interfaces (hashtable, slice, slice reduction ops) from cuda_safe_call to cuda_try. Part of the ongoing STF cuda_safe_call -> cuda_try rollout; the large stream files (event_types.cuh, stream_ctx.cuh, stream_task.cuh) are handled in separate PRs.

Changes (3 files, 15 sites)

Templated cuda_try<F> for single-function calls: cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes (out-param -> returned cudaPointerAttributes).
Kept runtime-status cuda_try(...) for overload sets (cuda_runtime.h templated wrappers): cudaHostAlloc, cudaMallocAsync.
hashtable_linearprobing.cuh leak guard: in stream_data_allocate, after the device cudaMallocAsync succeeds, the buffer is freed via SCOPE(fail) if the following cudaMemsetAsync throws — closing the leak the new throw path would otherwise introduce. Adds scope_guard.cuh.

Validation

Built locally (cpp20): cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction (exercises the 2D cudaMemcpy2DAsync path), cudax.test.stf.reductions.reduce_sum — all compile and link.

Test plan

CI green on the cudax matrix
No success-path behavior change; new behavior is throw-vs-abort plus the alloc leak-guard

Covers the stream-backend data interfaces (hashtable, slice, slice reduction ops). Uses the templated cuda_try<F> form for single-function calls (cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes). cudaHostAlloc and cudaMallocAsync stay in the runtime-status form: both are overload sets (cuda_runtime.h templated wrappers), so cuda_try<F> cannot name them. hashtable stream_data_allocate: after the device cudaMallocAsync succeeds, the buffer is freed via SCOPE(fail) if the subsequent cudaMemsetAsync throws, so the new throw path does not leak the allocation. Adds scope_guard.cuh. Verified locally by building cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction (2D cudaMemcpy2DAsync path), and cudax.test.stf.reductions.reduce_sum.

andralex · 2026-06-05T01:35:17Z

placeholder

copy-pr-bot · 2026-06-05T01:35:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

andralex · 2026-06-05T01:38:44Z

/ok to test b6dd048

andralex · 2026-06-05T01:39:24Z

/ok to test 33ffac5

copy-pr-bot · 2026-06-05T01:39:27Z

/ok to test 33ffac5

@andralex, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

coderabbitai · 2026-06-05T01:40:35Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 715ae7ec-917d-44ca-8216-8c928aaef8bc

📥 Commits

Reviewing files that changed from the base of the PR and between 3ae7ba1 and 8dce860.

📒 Files selected for processing (2)

cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh
cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR migrates CUDA runtime error handling in the data interfaces under cudax/include/cuda/experimental/__stf/stream/interfaces/ from abort-on-error cuda_safe_call(...) to exception-based cuda_try(...). Three interface files are changed (hashtable_linearprobing.cuh, slice.cuh, slice_reduction_ops.cuh). Large stream files (event_types.cuh, stream_ctx.cuh, stream_task.cuh) are intentionally excluded for separate treatment.

Files modified: 3 | Sites changed: 15 | Lines changed: +33/-21

Changes

cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh
- Replaced CUDA runtime wrappers for copy/alloc/free/sync with cuda_try variants where applicable.
- Added scope_guard.cuh and introduced a SCOPE(fail) leak-guard: after a successful device cudaMallocAsync, a failing cudaMemsetAsync triggers cleanup that frees the allocated buffer (the cleanup call itself uses the legacy safe wrapper in this diff).
- Host-path allocation/deallocation now use cuda_try forms (including cuda_try before host allocation and cuda_try on free); device free uses cuda_try.
cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh
- Converted async memcpy calls across branches (cudaMemcpyAsync, cudaMemcpy2DAsync, and contiguous higher-dimension fallback) from cuda_safe_call(...) to the appropriate cuda_try forms, using templated cuda_try where a single-function template parameter is available.
- Rewrote memory-type detection to use cuda_try that returns cudaPointerAttributes instead of populating an out-parameter.
cudax/include/cuda/experimental/__stf/stream/interfaces/slice_reduction_ops.cuh
- Switched host-side stream synchronization in op and init_op from cuda_safe_call(cudaStreamSynchronize(...)) to cuda_try(...).

Notable implementation details / decisions

Introduced a templated cuda_try for single-function CUDA calls (used for cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes).
Retained the runtime-status cuda_try(...) form for overload sets provided by cuda_runtime.h (e.g., cudaHostAlloc, cudaMallocAsync, and in cases where overload sets cannot be referenced as a single template parameter).
Adjusted ordering in host deallocation to defer reporting certain stream-sync errors until after host free to preserve cleanup semantics.
Added a scope-based leak-guard (SCOPE(fail)) to free partially-allocated device buffers when initialization fails.

API / compatibility

No exported/public API signature changes.
Success-path behavior unchanged; error-path behavior now throws exceptions instead of aborting. The added leak-guard prevents device allocation leaks on initialization failures.

Validation / tests

Local C++20 build and link succeeded for: cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction (exercises cudaMemcpy2DAsync), and cudax.test.stf.reductions.reduce_sum.
CI expected to run on the cudax matrix.

Comments / commits

Comments captured in the PR: one author "placeholder" comment and eight author "/ok to test" comments (automated test triggers); no reviewer discussion recorded.
Commit messages include refinements to cuda_try usage (disambiguation of overloads via explicit template arguments where applicable), a follow-up to defer certain cudaStreamSynchronize errors until after cudaFreeHost in hashtable deallocation, and simplifications to slice::get_memory_type.

Walkthrough

important: Three STF stream-interface headers replace cuda_safe_call(...) with typed cuda_try<...>(...) across async copies, allocation, deallocation, pointer-attribute queries, and stream synchronizations; hashtable device allocation adds a SCOPE(fail) cleanup for partial failures.

Changes

STF Stream Interface Error Handling Refactor

Layer / File(s)	Summary
Hash table allocation, initialization, and error-safe deallocation `cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh`	Added `scope_guard.cuh`; async table copy switched to typed `cuda_try<cudaMemcpyAsync<...>>`; host allocation path uses `cuda_try<cudaStreamSynchronize>` + `cuda_try<cudaHostAlloc<...>>` then initialization; device path uses `cuda_try(cudaMallocAsync(...))`, `SCOPE(fail)` guard freeing via `cudaFreeAsync`, and `cuda_try<cudaMemsetAsync>(...)` for init; deallocation uses `cuda_try`-wrapped frees and adjusts sync/free ordering.
Slice async copies and pointer-attribute query `cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh`	`data_copy` branches (0D/1D/2D/contiguous) use typed `cuda_try` wrappers for `cudaMemcpyAsync`/`cudaMemcpy2DAsync`/fallback; `get_memory_type` uses `cuda_try<cudaPointerGetAttributes>(...)` to obtain attributes.
Reduction op stream synchronization `cudax/include/cuda/experimental/__stf/stream/interfaces/slice_reduction_ops.cuh`	Host-side `cudaStreamSynchronize` calls in `op()` and `init_op()` replaced with `cuda_try<cudaStreamSynchronize>(s)`.

Possibly related PRs

NVIDIA/cccl#9265: Related STF-wide migration of CUDA runtime/graph API calls from cuda_safe_call(...) to cuda_try<...>(...).

Suggested reviewers

caugonnet
srinivasyadav18

Suggested labels

stf

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…set) cudaMemcpyAsync is an overload set on CTK 13.2+ (cuda_runtime.h adds an alternate-spelling wrapper), so cuda_try<cudaMemcpyAsync> is ill-formed. Clang rejects it; GCC accepts it leniently, which is why local GCC builds missed it. Revert the cudaMemcpyAsync calls in the stream slice and hashtable interfaces to the runtime-status cuda_try(cudaMemcpyAsync(...)) form. The other (non-overloaded) calls keep the templated form.

andralex · 2026-06-05T01:51:16Z

/ok to test 3ae7ba1

andralex · 2026-06-05T22:29:27Z

/ok to test 3ae7ba1

andralex · 2026-06-08T21:41:52Z

/ok to test 4e868a7

andralex · 2026-06-08T22:54:03Z

/ok to test a6cfd45

andralex · 2026-06-09T00:32:44Z

/ok to test 0231898

Disambiguate overload-set alloc/copy calls with explicit template arguments where applicable, defer cudaStreamSynchronize errors until after cudaFreeHost in hashtable host deallocate, and simplify slice::get_memory_type.

andralex · 2026-06-09T01:13:54Z

/ok to test 8dce860

github-actions · 2026-06-09T02:17:14Z

😬 CI Workflow Results

🟥 Finished in 1h 01m: Pass: 41%/55 | Total: 11h 10m | Max: 58m 43s | Hits: 20%/39040

See results here.

andralex · 2026-06-09T13:03:21Z

/ok to test 1c49aa4

andralex requested a review from a team as a code owner June 5, 2026 01:35

andralex requested a review from srinivasyadav18 June 5, 2026 01:35

github-project-automation Bot added this to CCCL Jun 5, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 5, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 5, 2026

andralex enabled auto-merge (squash) June 5, 2026 01:37

This comment has been minimized.

Sign in to view

caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026

Merge branch 'main' into andralex/stf-cuda-try-stream-misc

4e868a7

Merge branch 'main' into andralex/stf-cuda-try-stream-misc

a6cfd45

This comment has been minimized.

Sign in to view

Merge branch 'main' into andralex/stf-cuda-try-stream-misc

0231898

cudax/stf: refine cuda_try usage in stream interface headers

8dce860

Disambiguate overload-set alloc/copy calls with explicit template arguments where applicable, defer cudaStreamSynchronize errors until after cudaFreeHost in hashtable host deallocate, and simplify slice::get_memory_type.

Merge branch 'main' into andralex/stf-cuda-try-stream-misc

1c49aa4

Conversation

andralex commented Jun 5, 2026

Summary

Changes (3 files, 15 sites)

Validation

Test plan

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notable implementation details / decisions

API / compatibility

Validation / tests

Comments / commits

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Suggested labels

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 5, 2026

Uh oh!

andralex commented Jun 8, 2026

Uh oh!

andralex commented Jun 8, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 9, 2026

Uh oh!

andralex commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

😬 CI Workflow Results

🟥 Finished in 1h 01m: Pass: 41%/55 | Total: 11h 10m | Max: 58m 43s | Hits: 20%/39040

Uh oh!

andralex commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading