Skip to content

Fix TSan data race with FFI tests#7244

Draft
myrrc wants to merge 1 commit intodevelopfrom
myrrc/ffi-sink-race
Draft

Fix TSan data race with FFI tests#7244
myrrc wants to merge 1 commit intodevelopfrom
myrrc/ffi-sink-race

Conversation

@myrrc
Copy link
Copy Markdown
Contributor

@myrrc myrrc commented Apr 1, 2026

There are TSan data race reports like

https://github.com/vortex-data/vortex/actions/runs/23804465237/job/69373850775?pr=7018

Significant parts of stack trace

Read of size 8 at 0xffff888084a0 by thread T40:

#8 vx_array_sink_open_file::{closure#0}::{closure#0} vortex-ffi/src/sink.rs:75
#9 <vortex_io::runtime::handle::Handle>::spawn::<vx_array_sink_open_file::{closure#0}::{closure#0}, ...> vortex-io/src/runtime/handle.rs:73
#31 vx_array_sink_close::{closure#0} vortex-ffi/src/sink.rs:110
#33 vx_array_sink_close vortex-ffi/src/sink.rs:106
#34 test_sink_basic_workflow vortex-ffi/src/sink.rs:166

Previous write of size 8 at 0xffff888084a0 by thread T42:

#28 <vortex_io::runtime::current::CurrentThreadRuntime>::block_on::<vx_array_sink_close::{closure#0}::{closure#0}, ...> vortex-io/src/runtime/current.rs:102
#29 vx_array_sink_close::{closure#0} vortex-ffi/src/sink.rs:110
#31 vx_array_sink_close vortex-ffi/src/sink.rs:106
#32 test_sink_multiple_arrays

Location is heap block of size 112 at 0xffff88808490 allocated by thread T42:

#13 <vortex_file::writer::VortexWriteOptions>::write::<&mut async_fs::File, ...>::{closure#0} vortex-file/src/writer.rs:137
#14 vx_array_sink_open_file::{closure#0}::{closure#0} vortex-ffi/src/sink.rs:75
#15 <vortex_io::runtime::handle::Handle>::spawn::<vx_array_sink_open_file::{closure#0}::{closure#0}, ...>::{closure#0} vortex-io/src/runtime/handle.rs:73
#37 vx_array_sink_close::{closure#0} vortex-ffi/src/sink.rs:110
#40 test_sink_multiple_arrays vortex-ffi/src/sink.rs:208

which happens in sink FFI tests but can be narrowed down to

fn test(path: String) {
    let session = VortexSession::default().with_handle(RUNTIME.handle());
    let dtype = DType::Primitive(vortex::dtype::PType::I32, false.into());
    let (mut sink, rx) = mpsc::channel(32);
    let array_stream = ArrayStreamAdapter::new(dtype.clone(), rx.into_stream());

    let array = PrimitiveArray::new(buffer![3i32], Validity::NonNullable);
    RUNTIME.block_on(async move {
        sink.send(Ok(array.into())).await.unwrap();
        sink.close_channel();

        let mut file = File::create(path).await.unwrap();
        session
            .write_options()
            .write(&mut file, array_stream)
            .await.unwrap();
        file.sync_all()
            .await
            .map_err(|e| vortex_err!("sync error: {e}")).unwrap();
    });
}
#[test] fn test_f1() { test("1".to_string()); }
#[test] fn test_f2() { test("2".to_string()); }

The underlying issue is that RUNTIME in FFI is a smol::Executor, and as we don't
have a background thread pool, and we use CurrentThreadRuntime, it runs on host
program's threads, in our case, cargo test's. If t1 and t2 are scheduled to
run on different threads, this triggers TSan in vortex-io's spawn() (case 1,
oneshot) and in vortex-file write_internal (case 2, kanal::bounded_async) which
don't protect multiple different thread consumers to read the state.

The solution here is to benchmark replacing oneshot with futures-rs oneshot
channel

@myrrc myrrc added the changelog/fix A bug fix label Apr 1, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 1, 2026

Merging this PR will degrade performance by 15.51%

❌ 2 regressed benchmarks
✅ 1104 untouched benchmarks
⏩ 1522 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 317.8 ns 376.1 ns -15.51%
Simulation bitwise_not_vortex_buffer_mut[1024] 477.2 ns 535.6 ns -10.89%

Comparing myrrc/ffi-sink-race (044dfff) with develop (3ea259e)2

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on develop (8060ae0) during the generation of this report, so 3ea259e was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@myrrc
Copy link
Copy Markdown
Contributor Author

myrrc commented Apr 1, 2026

@myrrc
Copy link
Copy Markdown
Contributor Author

myrrc commented Apr 1, 2026

The second traceback is about the (possibly) same issue in kanal while writing a file.
I suspect the biggest issue is our usage of smol::Executor without a background thread pool with our FFI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant