Skip to content

cudax/stf: migrate stream/interfaces/ from cuda_safe_call to cuda_try#9268

Open
andralex wants to merge 7 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-stream-misc
Open

cudax/stf: migrate stream/interfaces/ from cuda_safe_call to cuda_try#9268
andralex wants to merge 7 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-stream-misc

Conversation

@andralex

@andralex andralex commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Migrates the cudax/include/cuda/experimental/__stf/stream/interfaces/ data interfaces (hashtable, slice, slice reduction ops) from cuda_safe_call to cuda_try. Part of the ongoing STF cuda_safe_call -> cuda_try rollout; the large stream files (event_types.cuh, stream_ctx.cuh, stream_task.cuh) are handled in separate PRs.

Changes (3 files, 15 sites)

  • Templated cuda_try<F> for single-function calls: cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes (out-param -> returned cudaPointerAttributes).
  • Kept runtime-status cuda_try(...) for overload sets (cuda_runtime.h templated wrappers): cudaHostAlloc, cudaMallocAsync.
  • hashtable_linearprobing.cuh leak guard: in stream_data_allocate, after the device cudaMallocAsync succeeds, the buffer is freed via SCOPE(fail) if the following cudaMemsetAsync throws — closing the leak the new throw path would otherwise introduce. Adds scope_guard.cuh.

Validation

Built locally (cpp20): cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction (exercises the 2D cudaMemcpy2DAsync path), cudax.test.stf.reductions.reduce_sum — all compile and link.

Test plan

  • CI green on the cudax matrix
  • No success-path behavior change; new behavior is throw-vs-abort plus the alloc leak-guard

Covers the stream-backend data interfaces (hashtable, slice, slice reduction
ops). Uses the templated cuda_try<F> form for single-function calls
(cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize,
cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes).

cudaHostAlloc and cudaMallocAsync stay in the runtime-status form: both are
overload sets (cuda_runtime.h templated wrappers), so cuda_try<F> cannot name
them.

hashtable stream_data_allocate: after the device cudaMallocAsync succeeds, the
buffer is freed via SCOPE(fail) if the subsequent cudaMemsetAsync throws, so the
new throw path does not leak the allocation. Adds scope_guard.cuh.

Verified locally by building cudax.test.stf.hashtable.test,
cudax.test.stf.reductions.slice2d_reduction (2D cudaMemcpy2DAsync path), and
cudax.test.stf.reductions.reduce_sum.
@andralex andralex requested a review from a team as a code owner June 5, 2026 01:35
@andralex andralex requested a review from srinivasyadav18 June 5, 2026 01:35
@andralex

andralex commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

placeholder

@copy-pr-bot

copy-pr-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 5, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 5, 2026
@andralex andralex enabled auto-merge (squash) June 5, 2026 01:37
@andralex

andralex commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test b6dd048

@andralex

andralex commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 33ffac5

@copy-pr-bot

copy-pr-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

/ok to test 33ffac5

@andralex, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 715ae7ec-917d-44ca-8216-8c928aaef8bc

📥 Commits

Reviewing files that changed from the base of the PR and between 3ae7ba1 and 8dce860.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh
  • cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR migrates CUDA runtime error handling in the data interfaces under cudax/include/cuda/experimental/__stf/stream/interfaces/ from abort-on-error cuda_safe_call(...) to exception-based cuda_try(...). Three interface files are changed (hashtable_linearprobing.cuh, slice.cuh, slice_reduction_ops.cuh). Large stream files (event_types.cuh, stream_ctx.cuh, stream_task.cuh) are intentionally excluded for separate treatment.

Files modified: 3 | Sites changed: 15 | Lines changed: +33/-21

Changes

  • cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh

    • Replaced CUDA runtime wrappers for copy/alloc/free/sync with cuda_try variants where applicable.
    • Added scope_guard.cuh and introduced a SCOPE(fail) leak-guard: after a successful device cudaMallocAsync, a failing cudaMemsetAsync triggers cleanup that frees the allocated buffer (the cleanup call itself uses the legacy safe wrapper in this diff).
    • Host-path allocation/deallocation now use cuda_try forms (including cuda_try before host allocation and cuda_try on free); device free uses cuda_try.
  • cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh

    • Converted async memcpy calls across branches (cudaMemcpyAsync, cudaMemcpy2DAsync, and contiguous higher-dimension fallback) from cuda_safe_call(...) to the appropriate cuda_try forms, using templated cuda_try where a single-function template parameter is available.
    • Rewrote memory-type detection to use cuda_try that returns cudaPointerAttributes instead of populating an out-parameter.
  • cudax/include/cuda/experimental/__stf/stream/interfaces/slice_reduction_ops.cuh

    • Switched host-side stream synchronization in op and init_op from cuda_safe_call(cudaStreamSynchronize(...)) to cuda_try(...).

Notable implementation details / decisions

  • Introduced a templated cuda_try for single-function CUDA calls (used for cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes).
  • Retained the runtime-status cuda_try(...) form for overload sets provided by cuda_runtime.h (e.g., cudaHostAlloc, cudaMallocAsync, and in cases where overload sets cannot be referenced as a single template parameter).
  • Adjusted ordering in host deallocation to defer reporting certain stream-sync errors until after host free to preserve cleanup semantics.
  • Added a scope-based leak-guard (SCOPE(fail)) to free partially-allocated device buffers when initialization fails.

API / compatibility

  • No exported/public API signature changes.
  • Success-path behavior unchanged; error-path behavior now throws exceptions instead of aborting. The added leak-guard prevents device allocation leaks on initialization failures.

Validation / tests

  • Local C++20 build and link succeeded for: cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction (exercises cudaMemcpy2DAsync), and cudax.test.stf.reductions.reduce_sum.
  • CI expected to run on the cudax matrix.

Comments / commits

  • Comments captured in the PR: one author "placeholder" comment and eight author "/ok to test" comments (automated test triggers); no reviewer discussion recorded.
  • Commit messages include refinements to cuda_try usage (disambiguation of overloads via explicit template arguments where applicable), a follow-up to defer certain cudaStreamSynchronize errors until after cudaFreeHost in hashtable deallocation, and simplifications to slice::get_memory_type.

Walkthrough

important: Three STF stream-interface headers replace cuda_safe_call(...) with typed cuda_try<...>(...) across async copies, allocation, deallocation, pointer-attribute queries, and stream synchronizations; hashtable device allocation adds a SCOPE(fail) cleanup for partial failures.

Changes

STF Stream Interface Error Handling Refactor

Layer / File(s) Summary
Hash table allocation, initialization, and error-safe deallocation
cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh
Added scope_guard.cuh; async table copy switched to typed cuda_try<cudaMemcpyAsync<...>>; host allocation path uses cuda_try<cudaStreamSynchronize> + cuda_try<cudaHostAlloc<...>> then initialization; device path uses cuda_try(cudaMallocAsync(...)), SCOPE(fail) guard freeing via cudaFreeAsync, and cuda_try<cudaMemsetAsync>(...) for init; deallocation uses cuda_try-wrapped frees and adjusts sync/free ordering.
Slice async copies and pointer-attribute query
cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh
data_copy branches (0D/1D/2D/contiguous) use typed cuda_try wrappers for cudaMemcpyAsync/cudaMemcpy2DAsync/fallback; get_memory_type uses cuda_try<cudaPointerGetAttributes>(...) to obtain attributes.
Reduction op stream synchronization
cudax/include/cuda/experimental/__stf/stream/interfaces/slice_reduction_ops.cuh
Host-side cudaStreamSynchronize calls in op() and init_op() replaced with cuda_try<cudaStreamSynchronize>(s).

Possibly related PRs

  • NVIDIA/cccl#9265: Related STF-wide migration of CUDA runtime/graph API calls from cuda_safe_call(...) to cuda_try<...>(...).

Suggested reviewers

  • caugonnet
  • srinivasyadav18

Suggested labels

stf


Comment @coderabbitai help to get the list of available commands and usage tips.

…set)

cudaMemcpyAsync is an overload set on CTK 13.2+ (cuda_runtime.h adds an
alternate-spelling wrapper), so cuda_try<cudaMemcpyAsync> is ill-formed.
Clang rejects it; GCC accepts it leniently, which is why local GCC builds
missed it. Revert the cudaMemcpyAsync calls in the stream slice and
hashtable interfaces to the runtime-status cuda_try(cudaMemcpyAsync(...))
form. The other (non-overloaded) calls keep the templated form.
@andralex

andralex commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 3ae7ba1

@github-actions

This comment has been minimized.

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026
@andralex

andralex commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 3ae7ba1

@andralex

andralex commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 4e868a7

@andralex

andralex commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test a6cfd45

@github-actions

This comment has been minimized.

@andralex

andralex commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 0231898

Disambiguate overload-set alloc/copy calls with explicit template
arguments where applicable, defer cudaStreamSynchronize errors until
after cudaFreeHost in hashtable host deallocate, and simplify
slice::get_memory_type.
@andralex

andralex commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 8dce860

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 1h 01m: Pass: 41%/55 | Total: 11h 10m | Max: 58m 43s | Hits: 20%/39040

See results here.

@andralex

andralex commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 1c49aa4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants