Conversation
This extension adds a new built-in function to perform barrier synchronization across the work-group even if some of the work-items are not "alive" anymore due to having returned from the kernel.
|
|
|
Ping @kpet @bashbaug @Kerilk @karolherbst. |
|
I wonder if this is necessary and if the OpenCL spec could be relaxed instead here. The |
|
@karolherbst: interesting. I didn't know SPIR_V changed the barrier semantics in v1.7. I don't see the wording spelled out in the spec explicitly. This is a pretty drastic change, which basically makes v1.7 backwards incompatible with v1.6 for targets which do not implement the "active/alive only" semantics. There could be devices we don't know of where it's (significantly) more expensive to implement. Also vectorizing WGs of kernels with such barriers on CPU/SIMD, especially on non-predicated vector ISAs induces overheads. The cases should be compile-time analyzable though. |
I doubt it's problematic for anything not being a CPU, because the threading model is just entirely different there and compares more to masked/predicated SIMD instructions. But maybe it's best to discuss this at the WG meeting and ask everybody to check if anybody sees any problems with it from a hardware perspective. Would be a bit problematic for CPU implementations, so maybe for those it might make sense to keep it explicit. |
|
Couple of thoughts and corrections:
If this is correct, we should file an issue to add the right validation rule for Workgroup scope barriers in the OpenCL SPIR-V environment spec as well. |
I assume it's a problem on older ones? |
|
@bashbaug thanks for the clarifications. I suggest we start with a new built-in and consider converting it to a main spec requirement in the future when there are no more relevant devices where requiring the semantics is a problem. Having the semantics as the default barrier semantics, in case of CPUs/SIMD vectorization it would add a bit of control flow analysis to detect the cases when predication is not needed. I think it's nothing to be too worried about for the most of the cases. How I see this used is for using it only when generating from inputs which might have the semantics in the language (HIP/CUDA). Even in those cases it makes sense to CF-analyze the kernel first to find out if it really needs the semantics. |
This extension adds a new built-in function to perform barrier synchronization across the work-group even if some of the work-items are not "alive" anymore due to having returned from the kernel.