Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6#18772
Open
xiangfu0 wants to merge 2 commits into
Open
Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6#18772xiangfu0 wants to merge 2 commits into
xiangfu0 wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18772 +/- ##
=========================================
Coverage 64.76% 64.77%
Complexity 1322 1322
=========================================
Files 3393 3393
Lines 211079 211087 +8
Branches 33159 33161 +2
=========================================
+ Hits 136707 136732 +25
+ Misses 63341 63322 -19
- Partials 11031 11033 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an optional document-count cap for chunk flushing in VarByteChunkForwardIndexWriterV6, allowing callers to control chunk granularity by both byte budget (chunkSize) and a new targetDocsPerChunk parameter, while keeping the on-disk format/version unchanged.
Changes:
- Introduces a protected
shouldStartNewChunk(int sizeRequired)hook (and a helper for current-chunk doc count) in V4 so subclasses can add extra chunk-flush predicates without duplicatingputBytes(). - Extends V6 with a 4-arg constructor and a docs-per-chunk cap that triggers chunk flushes once the current chunk reaches the configured doc count.
- Adds a V6 unit test intended to validate docs-per-chunk behavior and round-trip correctness.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV4.java | Extracts the “start new chunk” predicate into an overridable hook and exposes current-chunk doc count for subclasses. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV6.java | Adds optional targetDocsPerChunk support via shouldStartNewChunk override and a new constructor overload. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/VarByteChunkV6Test.java | Adds a test for docs-per-chunk capping and value round-tripping. |
V6 currently flushes a chunk only when the next entry would overflow the chunkSize-byte buffer. This adds an optional targetDocsPerChunk parameter so a chunk can also be bounded by document count, letting callers control chunk granularity independently of the byte budget. The cap defaults to -1 (DISABLE_DOCS_PER_CHUNK), preserving the existing purely byte-driven behavior; the on-disk format and version (6) are unchanged. The buffer-overflow flush predicate is extracted into a protected shouldStartNewChunk() hook in V4 (mirroring the existing writeChunkHeader hook) so V6 adds the docs cap without duplicating putBytes(). A unit test verifies each capped chunk holds exactly targetDocsPerChunk docs and round-trips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…counts - Fail fast in the constructor when targetDocsPerChunk is neither -1 (disabled) nor positive, instead of silently treating 0/other negatives as disabled. - Strengthen the test to verify per-chunk document counts (not just the chunk count) by reading each chunk's first docId from the metadata and asserting it equals chunk * targetDocsPerChunk; use 1000 docs with a cap of 30 so the last chunk holds a remainder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
99a96e1 to
32584c5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
VarByteChunkForwardIndexWriterV6currently closes a chunk only when the nextentry would overflow the
chunkSize-byte buffer. This PR adds an optionaltargetDocsPerChunkparameter so a chunk can additionally be bounded by documentcount, letting callers control chunk granularity independently of the byte budget.
VarByteChunkForwardIndexWriterV6(file, compressionType, chunkSize, targetDocsPerChunk). The existing 3-arg constructor delegates withtargetDocsPerChunk = -1(DISABLE_DOCS_PER_CHUNK).targetDocsPerChunk > 0, a chunk is flushed once it holds that many docs,even if the byte buffer isn't full; otherwise behavior is unchanged.
shouldStartNewChunk(int)hook inVarByteChunkForwardIndexWriterV4(mirroringthe existing
writeChunkHeaderhook), so V6 adds the cap without duplicatingputBytes().Motivation
For raw string/bytes columns, the ZSTD compression ratio depends heavily on how
many repeated values fall within a single chunk (the dedup window). Being able to
bound a chunk by document count — not only by bytes — gives finer control over the
size/granularity tradeoff for repetitive columns.
Constructor examples
Configuring V6 with a chunk size (table config)
Select this writer and the chunk size per column via
fieldConfigList:{ "fieldConfigList": [ { "name": "myColumn", "encodingType": "RAW", "indexes": { "forward": { "compressionCodec": "ZSTANDARD", "rawIndexWriterVersion": 6, "targetMaxChunkSize": "1MB", "targetDocsPerChunk": -1 } } } ] }rawIndexWriterVersion: 6selects this writer.targetMaxChunkSizeis the chunk byte budget. The effective chunk size ismax(min(maxLength × targetDocsPerChunk, targetMaxChunkSize), 4KB), sotargetDocsPerChunk: -1makes the chunk size driven purely bytargetMaxChunkSize(otherwise a narrow column is capped at
maxLength × targetDocsPerChunk, e.g.~32 KB for a 32-byte column at the default
1000).Note: the table-config
targetDocsPerChunkderives the chunk byte size; the newconstructor parameter
targetDocsPerChunk(a hard per-chunk doc cap) is awriter-level API and is not wired into table config in this PR.
Backward compatibility
-1reproduces the exact current behavior.6) are unchanged; the target chunk sizeremains self-describing in the file header, so existing and new indexes stay
mutually readable.
Testing
VarByteChunkV6Test#testTargetDocsPerChunkCapsChunkasserts each capped chunkholds exactly
targetDocsPerChunkdocs and that values round-trip.-1default path).🤖 Generated with Claude Code