Skip to content

Minor compaction#19016

Open
GWphua wants to merge 3 commits intoapache:masterfrom
GWphua:minor-compaction
Open

Minor compaction#19016
GWphua wants to merge 3 commits intoapache:masterfrom
GWphua:minor-compaction

Conversation

@GWphua
Copy link
Contributor

@GWphua GWphua commented Feb 12, 2026

Fixes #9712

Motivation

Submitting a compaction task with SpecificSegmentsSpec (segment IDs) would cause Druid to lock, read, and rewrite all segments in the umbrella interval, defeating the purpose of targeting specific segments.

This results in very long compaction tasks, as the entire interval's segments are being considered for compaction. After the changes are being introduced, we are able to select multiple small segments to compact instead of processing all segments in the interval. This reduces the time taken for compaction from ~3h to ~5min.

Description

This PR adds support for minor compaction: the ability to compact a specific subset of segments within a time chunk rather than all segments in the interval. Previously,

The core problem spans multiple layers of the compaction pipeline:

  1. Locking: CompactionTask#findSegmentsToLock and all sub-task findSegmentsToLock() methods retrieve every segment in the umbrella interval via RetrieveUsedSegmentsAction, meaning the task acquires locks far broader than necessary.
  2. Input resolution: NativeCompactionRunner#createIoConfig always passes null for segmentIds to DruidInputSource, so the input source reads the full interval regardless of the input spec.
  3. Timeline lookup: retrieveRelevantTimelineHolders() uses SegmentTimeline.lookup() which requires ONLY_COMPLETE partitions... a filtered subset of segments appears incomplete and will be silently excluded.
  4. Validation: CompactionTask.SegmentProvider#checkSegments with TIME_CHUNK lock granularity delegates to SpecificSegmentsSpec.validateSegments() which requires an exact match between the spec's segments and all segments in the interval. This guarantees a failure when we give any proper subset of segments in the interval.

Changes and Explanations

dropExisting conflict guard

A constructor-level validation in CompactionTask now rejects the combination of SpecificSegmentsSpec with dropExisting = true, since dropExisting semantics replace all segments in the interval — directly contradicting minor compaction intent.


Segment filtering in lock acquisition

CompactionTask.findSegmentsToLock() now filters the result of RetrieveUsedSegmentsAction to only the segment IDs present in SpecificSegmentsSpec. The same filtering is applied in IndexTask, ParallelIndexSupervisorTask, and SinglePhaseSubTask via CTX_KEY_SPECIFIC_SEGMENTS_TO_COMPACT propagated from NativeCompactionRunner#createContextForSubtask().

This follows the existing pattern of passing compaction metadata through CTX_KEY_APPENDERATOR_TRACKING_TASK_ID.


CompactionTask.SegmentProvider caches for TIME_CHUNK granularity in checkSegments()

The intuition behind this approach is:

  1. SegmentProvider#findSegments is first being called, followed by SegmentProvider#checkSegments.
  2. When findSegments is called, we do not know which lock granularity is being used.
  3. Time granularity requires all segments in the interval, while segment granularity requires only the input segments
  4. Save all segments in interval in findSegments as allSegmentsInInterval, then later use this field when we encounter a TIME_CHUNK lock granularity.

Honestly, I am not too satisfied with how I approached this problem, owing to the fact that developers now need to keep a temporal relationship between findSegments and checkSegments. Would love to hear about any alternatives to this problem!


Segment-ID-based input for DruidInputSource

NativeCompactionRunner#createIoConfig now detects SpecificSegmentsSpec and resolves the segment ID strings into WindowedSegmentId objects, passing them to DruidInputSource instead of the interval.

DruidInputSource already supports this code path, but it was never wired up from the compaction side.


Timeline lookup with incomplete partitions

retrieveRelevantTimelineHolders() now calls lookupWithIncompletePartitions() (i.e. Partitions.INCOMPLETE_OK) when the input spec is SpecificSegmentsSpec.

Without this, a filtered segment set that doesn't cover all partitions in the interval produces an empty timeline result and the compaction silently does nothing.


Compaction using MSQ engine

MSQ compaction is fundamentally incompatible with minor compaction introduced by this change: it forces dropExisting = true, uses REPLACE ingestion mode (which acquires TIME_CHUNK locks covering the full interval), and queries via MultipleIntervalSegmentSpec. A validation check is added in MSQCompactionRunner.validateCompactionTask() to reject SpecificSegmentsSpec with an explicit error message rather than failing in an opaque way downstream.

For compaction using MSQ, please see #18996.

Release note

Compaction tasks using SpecificSegmentsSpec (segment ID list) now correctly compact only the specified segments instead of all segments in the umbrella interval. This new feature is unsupported in MSQ.


Key changed/added classes in this PR
  • CompactionTask
  • NativeCompactionRunner
  • IndexTask
  • ParallelIndexSupervisorTask
  • ParallelIndexSupervisorTask
  • SinglePhaseSubTask
  • MSQCompactionRunner
  • CompactionTaskTest / TaskLockHelperTest

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 12, 2026
inputInterval = interval;
}

if (inputInterval != null && !compactionIOConfig.isAllowNonAlignedInterval()) {

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note

Invoking
CompactionIOConfig.isAllowNonAlignedInterval
should be avoided because it has been deprecated.
@gianm
Copy link
Contributor

gianm commented Feb 12, 2026

It looks like this and #18996 are aiming at similar goals but are taking different approaches. A big one is that #18996 only works with MSQ compaction and this one only works with non-MSQ compaction tasks. I am wondering if they can coexist.

re: this piece,

MSQ compaction is fundamentally incompatible with minor compaction: it forces dropExisting = true, uses REPLACE ingestion mode (which acquires TIME_CHUNK locks covering the full interval), and queries via MultipleIntervalSegmentSpec.

#18996 deals with the replace issue by using the "upgrade" system that was introduced for concurrent replace (system from #14407, #15039, #15684). The segments that are not being compacted are carried through without modification ("upgraded"). It deals with the MultipleIntervalSegmentSpec issue by using a new feature in TableInputSpec to be able to reference specific segments (#18922).

@gianm gianm mentioned this pull request Feb 12, 2026
10 tasks
@GWphua
Copy link
Contributor Author

GWphua commented Feb 13, 2026

Thanks for pointing this out @gianm, I see that #18996 happen to fix compaction on the MSQ side, and that's pretty neat! I do not have much experience with MSQ, given that we are still using Druid v27 (yea, its old... but we are upgrading soon).

In our production servers, we used this PR by making a script to select segments, and issue minor compaction specs. There are still further plans to incorporate segment selection with automatic compaction.

Would like to ask what is the direction for handling specific segments? I see that there are some discussions about SpecificSegmentsSpec feeling somewhat unused... If the new feature in TableInputSpec is applicable for my use case, I would be happy to collaborate and make changes on my side 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable auto minor compaction

3 participants

Comments