Skip to content

fix: CUDA bitpacked sliced output allocation#8622

Merged
0ax1 merged 1 commit into
developfrom
ad/fix-cuda-bitpacked-slice-offset
Jun 29, 2026
Merged

fix: CUDA bitpacked sliced output allocation#8622
0ax1 merged 1 commit into
developfrom
ad/fix-cuda-bitpacked-slice-offset

Conversation

@0ax1

@0ax1 0ax1 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Decode sliced bit-packed arrays in padded coordinates by sizing and launching for offset + len. This keeps the returned offset..offset+len device slice in bounds and ensures the final touched 1024-value chunk is decoded.

@0ax1 0ax1 requested a review from a team June 29, 2026 13:45
@0ax1

0ax1 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

report:

# GPU bit-packed decode overruns its output buffer for sliced arrays

**Component:** `vortex-cuda` — `decode_bitpacked` (`vortex-cuda/src/kernel/encodings/bitpacked.rs`)
**Affects:** `develop` · **Severity:** panic during GPU scan

## Problem

GPU-decoding a **sliced** bit-packed array (non-zero `offset`) overruns the output device buffer:

```rust
// allocation ignores `offset`
let output_slice = ctx.device_alloc::<A>(len.next_multiple_of(1024))?;
...
// slice uses `offset`
output_buf.slice_typed::<A>(offset..(offset + len))

A sliced bit-packed array carries a non-zero offset (FastLanes records the slice start within the
first 1024-block). The allocation is len.next_multiple_of(1024) but the slice end is offset + len,
so when offset > len.next_multiple_of(1024) - len the slice runs past the allocation and panics at
device_buffer.rs:320: Slice range end {offset+len} exceeds allocation size {…}.

Likely fix: allocate (offset + len).next_multiple_of(1024) (and confirm the kernel writes into
the same offset-based window).

Reproduction

Add to the tests module of bitpacked.rs, then run on a GPU box:
cargo nextest run -p vortex-cuda --all-features -E 'test(test_sliced_offset_overruns_output)'

#[crate::test]
fn test_sliced_offset_overruns_output() -> VortexResult<()> {
    use crate::executor::CudaArrayExt;

    let mut ctx = vortex_array::array_session().create_execution_ctx();
    let mut cuda_ctx = CudaSession::create_execution_ctx(&crate::cuda_session())
        .vortex_expect("failed to create execution context");

    // 2048 values (two 1024-blocks); all < 64 so they fit in 6 bits (no patches).
    let array = PrimitiveArray::new((0u32..64).cycle().take(2048).collect::<Buffer<_>>(), NonNullable);
    let bp = BitPacked::encode(&array.into_array(), 6, &mut ctx)?;

    // Slice to a 1024-long window at offset 1 -> offset = 1, len = 1024.
    // Decoder allocates 1024 but slices 1..1025.
    let sliced = bp.into_array().slice(1..1025)?;

    let gpu_result = block_on(async {
        sliced.clone().execute_cuda(&mut cuda_ctx).await
            .vortex_expect("GPU decompression failed")
            .into_host().await.map(|a| a.into_array())
    })?;

    assert_arrays_eq!(sliced, gpu_result, &mut ctx);
    Ok(())
}

Output:

panicked at vortex-cuda/src/device_buffer.rs:320:9:
Slice range end 4100 exceeds allocation size 4096

(4100 = (1 + 1024) × 4 bytes; the overrun equals the slice offset.)

Impact

Dictionary columns slice their bit-packed codes child, producing these non-zero-offset arrays, so
any GPU scan decoding such a column hits this. Found scanning the ClickBench Title column on a
GH200 (offset = 382, len = 106496).

@0ax1 0ax1 added the changelog/fix A bug fix label Jun 29, 2026
@0ax1 0ax1 requested a review from robert3005 June 29, 2026 13:46
Decode sliced bit-packed arrays in padded coordinates by sizing and launching for offset + len. This keeps the returned offset..offset+len device slice in bounds and ensures the final touched 1024-value chunk is decoded.

Signed-off-by: "Alexander Droste" <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/fix-cuda-bitpacked-slice-offset branch from 6927825 to f8a35cc Compare June 29, 2026 13:47
@0ax1

0ax1 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the heads up on this one: @gargiulofrancesco !

@codspeed-hq

codspeed-hq Bot commented Jun 29, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 16.39%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 2 improved benchmarks
❌ 1 regressed benchmark
✅ 1592 untouched benchmarks
⏩ 4 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation slice_empty_vortex 339.4 ns 397.8 ns -14.66%
Simulation chunked_bool_canonical_into[(1000, 10)] 26.3 µs 15.9 µs +65.8%
Simulation encode_varbin[(1000, 32)] 163.7 µs 146.9 µs +11.45%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ad/fix-cuda-bitpacked-slice-offset (f8a35cc) with develop (a9f77d1)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@0ax1 0ax1 enabled auto-merge (squash) June 29, 2026 13:51
@0ax1 0ax1 merged commit 5ea73bd into develop Jun 29, 2026
73 checks passed
@0ax1 0ax1 deleted the ad/fix-cuda-bitpacked-slice-offset branch June 29, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants