Speed up node boot times by parallelizing buffer acquisition#19025
Speed up node boot times by parallelizing buffer acquisition#19025jtuglu1 wants to merge 2 commits intoapache:masterfrom
Conversation
Brokers currently allocate buffers serially on boot. For large amounts of buffer (100+ buffers) this can mean waiting for several minutes to acquire the memory needed. This change parallelizes the acquisition of the buffers.
1fed5c0 to
408e2ff
Compare
|
I will need to adjust approach since the test failures are due to deadlock/starvation on usage of common pool for the |
Just a quick clarification - isn't this applicable to all servers, including Historicals and Peons? Or did you notice the bottleneck primarily on the Brokers? |
Yeah this will speed up all servers that pre-alloc some # of buffers. |
processing/src/main/java/org/apache/druid/collections/DefaultBlockingPool.java
Outdated
Show resolved
Hide resolved
|
@jtuglu1 , to address the CI failures, would it make sense to not fully parallelize the allocation and instead use batches of size equal to the number of cores? Claude seems to think that this would reduce the contention in the JVM direct mem allocator. |
I think this can still, in the worst case, run into issues since you're not guaranteeing a completion deadline on the buffer alloc tasks. This means you can still occupy the common thread pool threads which might cause other deadlock issues. To resolve this, I've created a temporary FJP to perform the allocs. Normally, doing this sort of thing would be prohibitive, however FJP threads are created lazily and there are only at most 2 production usages of this pool per node, so we're spinning up at most 2 dedicated, short-lived allocation pools per node only once (on boot) which I think is reasonable, LMK if you disagree. An alternative would be to make a static, shared FJP in the class. |
|
@gianm @clintropolis any thoughts here? |
ade2804 to
05e7af5
Compare
Description
Brokers/Historicals/Peons currently allocate buffers serially on boot. For large quantities of larger buffers (100+ buffers @ ~2GB per buffer) this can mean waiting for several minutes (in our case, upwards of 6mins for brokers, 5+ seconds for peons) just to acquire the memory needed, which isn't great. This is because it is effectively doing 100 sequential malloc/mmap calls each needing 2GB zero'd out memory. This change parallelizes the acquisition of the buffers proportional to the number of cores available on the machine.
IntStreamthreadpool is temporary and released once finished (this happens before the broker comes online and is serving queries anyways).This is very helpful for both deployments and auto-scaling as it means newly-added nodes can more quickly begin providing value to the cluster. This, for example, can help with task launch times. For applications that run with
-XX:+AlwaysPreTouch, this will be even slower as the JVM pre-allocate/touch the necessary memory pages.Once compiler version is ≥ 21, we can consider MemorySegment API and allocate a single "wide" buffer and slice into that. That should be one large malloc/mmap call which should return much faster.
Benchmarks
Allocate 10, 100 merge buffers at @ 2GB each, using optimal JVM memory flags on JDK 21 compiled on JDK 17:
Before
After
Overall, this results in a measured ~10x reduction in launch time in the worst case, to 100x in the best case. Brokers now boot ~7mins faster and peons ~5s faster on our workfloads.
Benchmark File
Release note
Speed up broker boot times by parallelizing merger buffer initialization
This PR has: