Add deletion vector loader#14294
Conversation
Greptile SummaryThis PR adds a deletion vector loader for Delta Lake tables that reads serialized deletion vectors (roaring bitmaps) from disk and converts them to the standard format for GPU processing. The implementation handles two Delta serialization formats (portable and native), performing CRC validation and format conversion as needed. Key changes:
Notes:
Confidence Score: 4/5
Important Files Changed
Last reviewed commit: fdcb9d2 |
Signed-off-by: Jihoon Son <ghoonson@gmail.com>
…k/sql/delta/deletionvectors/RapidsStoredBitmap.scala Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
| dvDescriptor: DeletionVectorDescriptor, | ||
| tableDataPath: Path | ||
| ) { | ||
| require(dvDescriptor.isOnDisk, "Only on-disk deletion vectors are supported") |
There was a problem hiding this comment.
Is this check necessary? I think embeded deletetion vector should be processed similarly.
There was a problem hiding this comment.
My understanding is that the inline deletion vectors are used only for the CRC reader or when the bitmap is empty, but are never used to store non-empty bitmaps. I don't think we should support something that is never used. The empty bitmap is being handled in the load() function of this class. As for the CRC reader, see where DeletionVectorDescriptor.inlineInLog is called.
There was a problem hiding this comment.
I thought it's used for small bitmap that doesn't need to land in disk. Anyway, we could do it in a follow up if necessary.
There was a problem hiding this comment.
I double-checked the Delta code base as @gerashegalov mentioned this as well, but I don't see any code using the inline deletion vector besides the CRC reader and the empty bitmap. I also did a simple experiment that I made a small deletion vector with one row deleted for the TPC-DS web_sales table at sf=100, but Delta did make a deletion vector file. So, I'm quite convinced now that the inline deletion vector is never used for our use case. That being said, I could be still wrong and we will fix this if this turns out to be an issue.
|
build |
|
build |
|
build |
|
build |
|
build |
|
@gerashegalov @liurenjie1024 thanks for the review! |
Fixes NVIDIA#14263 ### Description This PR adds a deletion vector loader for Delta scan that loads serialized deletion vectors (roaring bitmaps) into host memory. As Delta serialization formats are different from the standard serialization format described in NVIDIA#14263, the loader performs the format conversion and returns a `HostMemoryBuffer` containing the bitmap serialized in the standard format. Note that many pieces of the code are adopted from Delta classes such as `DeletionVectorStore` and `StoredBitmap`. The loader has not been wired up yet. It will be done in some follow-up PR. The loader has been tested within my [branch](https://github.com/jihoonson/spark-rapids/tree/native-dv). This PR depends on NVIDIA/spark-rapids-jni#4285. ### Checklists - [ ] This PR has added documentation for new or modified features or behaviors. - [ ] This PR has added new tests or modified existing tests to cover new code paths. (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.) - [ ] Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description. --------- Signed-off-by: Jihoon Son <ghoonson@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Fixes #14263
Description
This PR adds a deletion vector loader for Delta scan that loads serialized deletion vectors (roaring bitmaps) into host memory. As Delta serialization formats are different from the standard serialization format described in #14263, the loader performs the format conversion and returns a
HostMemoryBuffercontaining the bitmap serialized in the standard format.Note that many pieces of the code are adopted from Delta classes such as
DeletionVectorStoreandStoredBitmap.The loader has not been wired up yet. It will be done in some follow-up PR. The loader has been tested within my branch.
This PR depends on NVIDIA/spark-rapids-jni#4285.
Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)