feat: introduce SegmentRangeReader interface and PartialSegmentFileMapperV10#19282
Open
clintropolis wants to merge 1 commit intoapache:masterfrom
Open
feat: introduce SegmentRangeReader interface and PartialSegmentFileMapperV10#19282clintropolis wants to merge 1 commit intoapache:masterfrom
clintropolis wants to merge 1 commit intoapache:masterfrom
Conversation
…pperV10 changes: * adds new `SegmentRangeReader` extension point interface for byte-range reads from segment files in deep storage * adds `PartialSegmentFileMapperV10` a `SegmentFileMapper` implementation that downloads internal files on demand from deep storage via `SegmentRangeReader`, not wired to anything yet other than tests * extracted `SegmentFileMetadataReader` which is a shared utility for parsing V10 header + metadata from any `InputStream` from `SegmentFileMapperV10.create()` so it can be shared with `PartialSegmentFileMapperV10` * adds `openRangeReader()` method to `LoadSpec` with a default implementation that returns null * `SegmentFileMetadata` now interns string keys in files and column descriptor maps using `SmooshedFileMapper.STRING_INTERNER`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds the building blocks for supporting partial segment download when using vsf segment cache, introducing a new
SegmentRangeReaderinterface which will allow deep storage extensions to provide byte-range reads from segment files in deep storage.To consume this interface, a new
PartialSegmentFileMapperV10class has been added that works by fetching the 'header' portion of a v10 segment (that is not externally compressed, e.g. .zip) during creating and storing it to disk so that it has the metadata and positions of all of the internal files of the segment which make up the columns. In addition to this file, we also append a bitmap (one bit per internal file of the segment) which is mmapped read-write and updated with a single-byte read-modify-write under a lock whenever an internal file is fetched.Fetched internal files are stored in separate local 'container' files, which correspond to the containers of the v10 format so that we can just re-use the positions of all of the internal files within the containers. The container files themselves are created as 'sparse' files at the original container size; downloaded file bytes are written at their original offsets via
RandomAccessFile, and the read-only mmap sees writes through the shared page cache.Follow-ups to this PR will begin the work of wiring this stuff up to actually be used in the segment cache and to ultimately allow query engines to specify what segment parts they need to allow fetching the minimum amount of data possible in order for query processing.
Initially at least, I am thinking for projections to be the level of 'granularity' for how the segment chunks are accounted for in the segment cache (so like the 'size' in the cache will be the size of the whole projection, it will just be lazily filled in as downloaded), so I will also be doing a follow-up to better organize the projections into containers in
SegmentFileBuilderV10instead of just filling whole containers at a time so that we have an easy way to map eviction to deleting these container files.changes:
SegmentRangeReaderextension point interface for byte-range reads from segment files in deep storagePartialSegmentFileMapperV10aSegmentFileMapperimplementation that downloads internal files on demand from deep storage viaSegmentRangeReader, not wired to anything yet other than testsSegmentFileMetadataReaderwhich is a shared utility for parsing V10 header + metadata from anyInputStreamfromSegmentFileMapperV10.create()so it can be shared withPartialSegmentFileMapperV10openRangeReader()method toLoadSpecwith a default implementation that returns nullSegmentFileMetadatanow interns string keys in files and column descriptor maps usingSmooshedFileMapper.STRING_INTERNER