Skip to content

run to run scan warpspeed impl sm100+#9263

Open
srinivasyadav18 wants to merge 5 commits into
NVIDIA:mainfrom
srinivasyadav18:run_to_run_opt_ws_sm100
Open

run to run scan warpspeed impl sm100+#9263
srinivasyadav18 wants to merge 5 commits into
NVIDIA:mainfrom
srinivasyadav18:run_to_run_opt_ws_sm100

Conversation

@srinivasyadav18

@srinivasyadav18 srinivasyadav18 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Description

closes #7556

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@srinivasyadav18 srinivasyadav18 requested a review from a team as a code owner June 4, 2026 19:09
@srinivasyadav18 srinivasyadav18 requested a review from pauleonix June 4, 2026 19:09
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 22ca4ed4-be8b-4f0c-bab4-028e5a8e487d

📥 Commits

Reviewing files that changed from the base of the PR and between 5b9637b and 14e5c19.

📒 Files selected for processing (4)
  • cub/cub/detail/warpspeed/look_ahead.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
  • cub/cub/device/dispatch/tuning/tuning_scan.cuh
🚧 Files skipped from review as they are similar to previous changes (4)
  • cub/cub/device/dispatch/kernels/kernel_scan.cuh
  • cub/cub/device/dispatch/tuning/tuning_scan.cuh
  • cub/cub/detail/warpspeed/look_ahead.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR implements run-to-run (deterministic) support for the warpspeed scan optimization on SM100+ targets by adding a stable reduction-order path to the warpspeed lookahead logic and plumbing that choice through the scan dispatch and tuning logic. The change threads a compile-time StableReductionOrder boolean through the warpspeed closure, kernel body, and dispatch so the warpspeed path can use a deterministic lookahead variant when required.

Confirmed changes (key points)

  • Deterministic warpspeed lookahead:

    • Added warpIncrementalLookaheadStable<numTileStatesPerThread, AccumT, ScanOpT>(...) in cub/cub/detail/warpspeed/look_ahead.cuh.
    • New includes (e.g., <cuda/std/__algorithm/min.h>) present; lookahead logic uses warp-level intrinsics and a stable reduction ordering that anchors to 32-multiple boundaries, computes contiguous runs of valid aggregates, and only performs warp reductions for fixed expected counts to preserve deterministic order.
    • Local temporary-storage declarations adjusted to use explicit TempStorage types in reduction helpers.
    • The stable function updates previous-state variables via reference parameters when advancing.
  • Warpspeed scan integration:

    • warpspeed_scan_closure template in cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh now accepts template parameter bool StableReductionOrder = false and uses if constexpr to select warpIncrementalLookaheadStable when StableReductionOrder is true, otherwise the legacy warpIncrementalLookahead.
    • device_scan_warpspeed_body is updated to accept and forward StableReductionOrder to the closure.
  • Kernel dispatch:

    • DeviceScanKernel (cub/cub/device/dispatch/kernels/kernel_scan.cuh) now includes the StableReductionOrder template parameter (default false) and forwards it to device_scan_warpspeed_body at the warpspeed dispatch site.
  • Policy / tuning:

    • Tuning logic in cub/cub/device/dispatch/tuning/tuning_scan.cuh allows selecting warpspeed even when a stable reduction order is requested, but only when compute capability >= 10.0 (SM100+). Previously warpspeed was excluded when stable reduction order was required.

Repository inspection notes

  • The files referenced in the raw summary are present and show the expected edits (lookahead stable function, StableReductionOrder template plumbing, and tuning change). Snippets confirming function names, template parameters, and policy gating were observed in the repository outputs.
  • The PR description references closing issue #7556 and the commit history includes warp-ballot related changes, but the PR does not include added tests or documentation updates.

Tests / Docs / Checklist (remaining work)

The PR does not include tests, benchmarks, or documentation updates required by issue #7556. Before merging, add:

  • Overloads / APIs (or environment-based overloads) exposing run_to_run options for DeviceScan::InclusiveScan and DeviceScan::ExclusiveScan as specified by the issue.
  • Tests covering deterministic run-to-run DeviceScan behavior (inclusive + exclusive).
  • A benchmark exercising and measuring run-to-run DeviceScan (warpspeed stable path).
  • Documentation updates describing the StableReductionOrder option and any required environment flags or API changes.

Related issue

Closes #7556: Productize run-to-run DeviceScan

important:

Walkthrough

Adds a deterministic warpIncrementalLookaheadStable, threads a compile-time StableReductionOrder flag through the warpspeed closure and dispatch, and relaxes policy gating to permit warpspeed on sm_100+ when stable reduction order is requested.

Changes

Stable Warpspeed Scan Implementation

Layer / File(s) Summary
Stable lookahead function
cub/cub/detail/warpspeed/look_ahead.cuh
New warpIncrementalLookaheadStable deterministically anchors reduction progress to 32-tile boundaries, enforces fixed reduction order via expected tile count, updates idxTilePrev and aggrExclusiveCtaPrev by reference, returns the exclusive aggregate, and a local warp-reduce temp-storage declaration was adjusted.
Warpspeed kernel stable routing
cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh, cub/cub/device/dispatch/kernels/kernel_scan.cuh
warpspeed_scan_closure and device_scan_warpspeed_body gain StableReductionOrder template parameter; lookahead helper conditionally calls warpIncrementalLookaheadStable (stable path) or warpIncrementalLookahead (non-stable), and updates previous-state variables in the path-specific location.
Dispatch threading and policy gating
cub/cub/device/dispatch/tuning/tuning_scan.cuh
Policy selector now allows warpspeed when stable reduction is requested on compute capability >= sm_100; DeviceScanKernel forwards StableReductionOrder to warpspeed dispatch.

Assessment against linked issues

Objective Addressed Explanation
Enable DeviceScan stable reduction path with warpspeed for run-to-run determinism [#7556] PR implements internal stable lookahead and plumbing but does not expose public DeviceScan API overloads or runtime/env flags for run-to-run selection.

Suggested reviewers

  • miscco
  • fbusato
  • elstehle

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cub/cub/device/dispatch/tuning/tuning_scan.cuh (1)

1038-1047: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: Update the inline rationale for the require_stable_reduction_ordercc >= {10, 0} gate: warpIncrementalLookaheadStable is available for __cccl_ptx_isa >= 860 (sm_90+), but the scan policy selector only produces a scan_warpspeed_policy when cc >= {10, 0} (otherwise get_warpspeed_policy returns {}), so stable warpspeed on sm_90+ is blocked by warpspeed policy/tuning availability—not by stable lookahead codegen availability.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dfcdb20c-106f-4ae5-a688-9e19e5475411

📥 Commits

Reviewing files that changed from the base of the PR and between 316f9cc and cbd13bb.

📒 Files selected for processing (4)
  • cub/cub/detail/warpspeed/look_ahead.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
  • cub/cub/device/dispatch/tuning/tuning_scan.cuh

@srinivasyadav18

Copy link
Copy Markdown
Contributor Author

/ok to test cbd13bb

@github-actions

This comment has been minimized.

@srinivasyadav18

Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@srinivasyadav18 srinivasyadav18 changed the title run to run warpspeed impl sm100+ run to run scan warpspeed impl sm100+ Jun 5, 2026
Comment thread cub/cub/detail/warpspeed/look_ahead.cuh Outdated
Comment thread cub/cub/detail/warpspeed/look_ahead.cuh Outdated
Comment thread cub/cub/detail/warpspeed/look_ahead.cuh Outdated
Comment thread cub/cub/detail/warpspeed/look_ahead.cuh Outdated
Comment thread cub/cub/detail/warpspeed/look_ahead.cuh Outdated
Comment thread cub/cub/detail/warpspeed/look_ahead.cuh Outdated
Comment thread cub/cub/detail/warpspeed/look_ahead.cuh
@srinivasyadav18

Copy link
Copy Markdown
Contributor Author

/ok to test 5b9637b

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 21m: Pass: 100%/284 | Total: 11d 16h | Max: 2h 20m | Hits: 19%/969497

See results here.

@srinivasyadav18 srinivasyadav18 force-pushed the run_to_run_opt_ws_sm100 branch from 5b9637b to 14e5c19 Compare June 9, 2026 12:51
@srinivasyadav18

Copy link
Copy Markdown
Contributor Author

/ok to test 14e5c19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Productize run-to-run DeviceScan

5 participants