Conversation
CompactionTaskTest TaskLockHelperTest
|
It looks like this and #18996 are aiming at similar goals but are taking different approaches. A big one is that #18996 only works with MSQ compaction and this one only works with non-MSQ compaction tasks. I am wondering if they can coexist. re: this piece,
#18996 deals with the replace issue by using the "upgrade" system that was introduced for concurrent replace (system from #14407, #15039, #15684). The segments that are not being compacted are carried through without modification ("upgraded"). It deals with the MultipleIntervalSegmentSpec issue by using a new feature in |
|
Thanks for pointing this out @gianm, I see that #18996 happen to fix compaction on the MSQ side, and that's pretty neat! I do not have much experience with MSQ, given that we are still using Druid v27 (yea, its old... but we are upgrading soon). In our production servers, we used this PR by making a script to select segments, and issue minor compaction specs. There are still further plans to incorporate segment selection with automatic compaction. Would like to ask what is the direction for handling specific segments? I see that there are some discussions about |
Fixes #9712
Motivation
Submitting a compaction task with
SpecificSegmentsSpec(segment IDs) would cause Druid to lock, read, and rewrite all segments in the umbrella interval, defeating the purpose of targeting specific segments.This results in very long compaction tasks, as the entire interval's segments are being considered for compaction. After the changes are being introduced, we are able to select multiple small segments to compact instead of processing all segments in the interval. This reduces the time taken for compaction from ~3h to ~5min.
Description
This PR adds support for minor compaction: the ability to compact a specific subset of segments within a time chunk rather than all segments in the interval. Previously,
The core problem spans multiple layers of the compaction pipeline:
CompactionTask#findSegmentsToLockand all sub-taskfindSegmentsToLock()methods retrieve every segment in the umbrella interval via RetrieveUsedSegmentsAction, meaning the task acquires locks far broader than necessary.NativeCompactionRunner#createIoConfigalways passes null for segmentIds to DruidInputSource, so the input source reads the full interval regardless of the input spec.retrieveRelevantTimelineHolders()usesSegmentTimeline.lookup()which requires ONLY_COMPLETE partitions... a filtered subset of segments appears incomplete and will be silently excluded.CompactionTask.SegmentProvider#checkSegmentswithTIME_CHUNKlock granularity delegates toSpecificSegmentsSpec.validateSegments()which requires an exact match between the spec's segments and all segments in the interval. This guarantees a failure when we give any proper subset of segments in the interval.Changes and Explanations
dropExisting conflict guard
A constructor-level validation in CompactionTask now rejects the combination of SpecificSegmentsSpec with dropExisting = true, since dropExisting semantics replace all segments in the interval — directly contradicting minor compaction intent.
Segment filtering in lock acquisition
CompactionTask.findSegmentsToLock()now filters the result ofRetrieveUsedSegmentsActionto only the segment IDs present inSpecificSegmentsSpec. The same filtering is applied inIndexTask,ParallelIndexSupervisorTask, andSinglePhaseSubTaskviaCTX_KEY_SPECIFIC_SEGMENTS_TO_COMPACTpropagated fromNativeCompactionRunner#createContextForSubtask().This follows the existing pattern of passing compaction metadata through
CTX_KEY_APPENDERATOR_TRACKING_TASK_ID.CompactionTask.SegmentProvidercaches for TIME_CHUNK granularity incheckSegments()The intuition behind this approach is:
SegmentProvider#findSegmentsis first being called, followed bySegmentProvider#checkSegments.findSegmentsis called, we do not know which lock granularity is being used.findSegmentsasallSegmentsInInterval, then later use this field when we encounter a TIME_CHUNK lock granularity.Honestly, I am not too satisfied with how I approached this problem, owing to the fact that developers now need to keep a temporal relationship between
findSegmentsandcheckSegments. Would love to hear about any alternatives to this problem!Segment-ID-based input for DruidInputSource
NativeCompactionRunner#createIoConfignow detectsSpecificSegmentsSpecand resolves the segment ID strings intoWindowedSegmentIdobjects, passing them toDruidInputSourceinstead of the interval.DruidInputSourcealready supports this code path, but it was never wired up from the compaction side.Timeline lookup with incomplete partitions
retrieveRelevantTimelineHolders()now callslookupWithIncompletePartitions()(i.e. Partitions.INCOMPLETE_OK) when the input spec isSpecificSegmentsSpec.Without this, a filtered segment set that doesn't cover all partitions in the interval produces an empty timeline result and the compaction silently does nothing.
Compaction using MSQ engine
MSQ compaction is fundamentally incompatible with minor compaction introduced by this change: it forces dropExisting = true, uses REPLACE ingestion mode (which acquires TIME_CHUNK locks covering the full interval), and queries via MultipleIntervalSegmentSpec. A validation check is added in MSQCompactionRunner.validateCompactionTask() to reject SpecificSegmentsSpec with an explicit error message rather than failing in an opaque way downstream.
For compaction using MSQ, please see #18996.
Release note
Compaction tasks using SpecificSegmentsSpec (segment ID list) now correctly compact only the specified segments instead of all segments in the umbrella interval. This new feature is unsupported in MSQ.
Key changed/added classes in this PR
CompactionTaskNativeCompactionRunnerIndexTaskParallelIndexSupervisorTaskParallelIndexSupervisorTaskSinglePhaseSubTaskMSQCompactionRunnerCompactionTaskTest/TaskLockHelperTestThis PR has: