fix: CUDA bitpacked sliced output allocation by 0ax1 · Pull Request #8622 · vortex-data/vortex

0ax1 · 2026-06-29T13:45:36Z

Decode sliced bit-packed arrays in padded coordinates by sizing and launching for offset + len. This keeps the returned offset..offset+len device slice in bounds and ensures the final touched 1024-value chunk is decoded.

0ax1 · 2026-06-29T13:45:51Z

report:

# GPU bit-packed decode overruns its output buffer for sliced arrays

**Component:** `vortex-cuda` — `decode_bitpacked` (`vortex-cuda/src/kernel/encodings/bitpacked.rs`)
**Affects:** `develop` · **Severity:** panic during GPU scan

## Problem

GPU-decoding a **sliced** bit-packed array (non-zero `offset`) overruns the output device buffer:

```rust
// allocation ignores `offset`
let output_slice = ctx.device_alloc::<A>(len.next_multiple_of(1024))?;
...
// slice uses `offset`
output_buf.slice_typed::<A>(offset..(offset + len))

A sliced bit-packed array carries a non-zero offset (FastLanes records the slice start within the
first 1024-block). The allocation is len.next_multiple_of(1024) but the slice end is offset + len,
so when offset > len.next_multiple_of(1024) - len the slice runs past the allocation and panics at
device_buffer.rs:320: Slice range end {offset+len} exceeds allocation size {…}.

Likely fix: allocate (offset + len).next_multiple_of(1024) (and confirm the kernel writes into
the same offset-based window).

Reproduction

Add to the tests module of bitpacked.rs, then run on a GPU box:
cargo nextest run -p vortex-cuda --all-features -E 'test(test_sliced_offset_overruns_output)'

#[crate::test]
fn test_sliced_offset_overruns_output() -> VortexResult<()> {
    use crate::executor::CudaArrayExt;

    let mut ctx = vortex_array::array_session().create_execution_ctx();
    let mut cuda_ctx = CudaSession::create_execution_ctx(&crate::cuda_session())
        .vortex_expect("failed to create execution context");

    // 2048 values (two 1024-blocks); all < 64 so they fit in 6 bits (no patches).
    let array = PrimitiveArray::new((0u32..64).cycle().take(2048).collect::<Buffer<_>>(), NonNullable);
    let bp = BitPacked::encode(&array.into_array(), 6, &mut ctx)?;

    // Slice to a 1024-long window at offset 1 -> offset = 1, len = 1024.
    // Decoder allocates 1024 but slices 1..1025.
    let sliced = bp.into_array().slice(1..1025)?;

    let gpu_result = block_on(async {
        sliced.clone().execute_cuda(&mut cuda_ctx).await
            .vortex_expect("GPU decompression failed")
            .into_host().await.map(|a| a.into_array())
    })?;

    assert_arrays_eq!(sliced, gpu_result, &mut ctx);
    Ok(())
}

Output:

panicked at vortex-cuda/src/device_buffer.rs:320:9:
Slice range end 4100 exceeds allocation size 4096

(4100 = (1 + 1024) × 4 bytes; the overrun equals the slice offset.)

Impact

Dictionary columns slice their bit-packed codes child, producing these non-zero-offset arrays, so
any GPU scan decoding such a column hits this. Found scanning the ClickBench Title column on a
GH200 (offset = 382, len = 106496).

Decode sliced bit-packed arrays in padded coordinates by sizing and launching for offset + len. This keeps the returned offset..offset+len device slice in bounds and ensures the final touched 1024-value chunk is decoded. Signed-off-by: "Alexander Droste" <alexander.droste@protonmail.com>

0ax1 · 2026-06-29T13:47:41Z

Thanks for the heads up on this one: @gargiulofrancesco !

codspeed-hq · 2026-06-29T13:48:00Z

Merging this PR will improve performance by 16.39%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 2 improved benchmarks
❌ 1 regressed benchmark
✅ 1592 untouched benchmarks
⏩ 4 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`slice_empty_vortex`	339.4 ns	397.8 ns	-14.66%
⚡	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	26.3 µs	15.9 µs	+65.8%
⚡	Simulation	`encode_varbin[(1000, 32)]`	163.7 µs	146.9 µs	+11.45%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing ad/fix-cuda-bitpacked-slice-offset (f8a35cc) with develop (a9f77d1)}

4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

0ax1 requested a review from a team June 29, 2026 13:45

0ax1 added the changelog/fix A bug fix label Jun 29, 2026

0ax1 requested a review from robert3005 June 29, 2026 13:46

0ax1 force-pushed the ad/fix-cuda-bitpacked-slice-offset branch from 6927825 to f8a35cc Compare June 29, 2026 13:47

robert3005 approved these changes Jun 29, 2026

View reviewed changes

0ax1 enabled auto-merge (squash) June 29, 2026 13:51

0ax1 merged commit 5ea73bd into develop Jun 29, 2026
73 checks passed

0ax1 deleted the ad/fix-cuda-bitpacked-slice-offset branch June 29, 2026 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: CUDA bitpacked sliced output allocation#8622

fix: CUDA bitpacked sliced output allocation#8622
0ax1 merged 1 commit into
developfrom
ad/fix-cuda-bitpacked-slice-offset

0ax1 commented Jun 29, 2026

Uh oh!

0ax1 commented Jun 29, 2026

Uh oh!

0ax1 commented Jun 29, 2026

Uh oh!

codspeed-hq Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

0ax1 commented Jun 29, 2026

Uh oh!

0ax1 commented Jun 29, 2026

Reproduction

Impact

Uh oh!

0ax1 commented Jun 29, 2026

Uh oh!

codspeed-hq Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 16.39%

Performance Changes

Footnotes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq Bot commented Jun 29, 2026 •

edited

Loading