Skip to content

[WIP] Improve reducer API for memory allocation & compute#9290

Draft
acosmicflamingo wants to merge 4 commits into
NVIDIA:mainfrom
acosmicflamingo:improve-mem-compute-api
Draft

[WIP] Improve reducer API for memory allocation & compute#9290
acosmicflamingo wants to merge 4 commits into
NVIDIA:mainfrom
acosmicflamingo:improve-mem-compute-api

Conversation

@acosmicflamingo

Copy link
Copy Markdown
Contributor

Description

Resolves #5638

Add 'get_temp_storage_bytes' and 'compute' functions to _Reduce class to improve API usability (as suggested by @kkraus14 and @shwina).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

* Start migrating Reduce.__call__ dunder methods to more descriptive
  API (as suggested by @kkraus14 and @shwina)

* Add 'get_temp_storage_bytes' and 'compute' method stubs

* Display deprecation warnings when users use utilize Reduce.__call__
  dunder method (and add respective tests)
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 8, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 8, 2026
@acosmicflamingo acosmicflamingo force-pushed the improve-mem-compute-api branch from 3d56a30 to b17dadc Compare June 8, 2026 14:44
@acosmicflamingo

acosmicflamingo commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@shwina I imagine that we want to avoid something like this, correct?

diff a/python/cuda_cccl/cuda/compute/algorithms/_reduce.py 
     b/python/cuda_cccl/cuda/compute/algorithms/_reduce.py
@@ -35,6 +35,8 @@ from ..typing import (
 
 class _Reduce:
     __slots__ = [
+        "d_in",
+        "d_out",
         "d_in_cccl",
         "d_out_cccl",
         "h_init_cccl",
@@ -52,6 +54,8 @@ class _Reduce:
         h_init: np.ndarray | GpuStruct,
         determinism: Determinism,
     ):
+        self.d_in = d_in
+        self.d_out = d_out
         self.d_in_cccl = cccl.to_cccl_input_iter(d_in)
         self.d_out_cccl = cccl.to_cccl_output_iter(d_out)
         self.h_init_cccl = cccl.to_cccl_value(h_init)
@@ -152,8 +156,8 @@ class _Reduce:
         h_init: np.ndarray | GpuStruct,
         stream=None,
     ):
-        set_cccl_iterator_state(self.d_in_cccl, d_in)
-        set_cccl_iterator_state(self.d_out_cccl, d_out)
+        set_cccl_iterator_state(self.d_in_cccl, self.d_in)
+        set_cccl_iterator_state(self.d_out_cccl, self.d_out)
 
         # Update op state for stateful ops
         op_adapter = make_op_adapter(op)

On the surface, it would pave the way for implementing the API we seek. However, make_reduce_into utilizes the cache_with_registered_key_functions decorator, and caching can pose an issue if we're now saving self.d_in and self.d_out attributes in our reducers. It explains why in my test suite, I could have two identical tests (or three, four, etc) and only the first would pass an assert statement. In the subsequent tests when I'm creating a new reducer, it fetches the reducer created from the first test, and mutates those self.d_in and self.d_out attributes and doesn't actually touch the d_in and d_out arrays that were passed to make_reduce_into in the current test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

cuda.cccl.parallel: Add named APIs to do temporary memory allocation and compute respectively,

1 participant