Skip to content

Add deletion vector loader#14294

Merged
jihoonson merged 15 commits intoNVIDIA:mainfrom
jihoonson:dv-loader
Feb 27, 2026
Merged

Add deletion vector loader#14294
jihoonson merged 15 commits intoNVIDIA:mainfrom
jihoonson:dv-loader

Conversation

@jihoonson
Copy link
Copy Markdown
Collaborator

@jihoonson jihoonson commented Feb 18, 2026

Fixes #14263

Description

This PR adds a deletion vector loader for Delta scan that loads serialized deletion vectors (roaring bitmaps) into host memory. As Delta serialization formats are different from the standard serialization format described in #14263, the loader performs the format conversion and returns a HostMemoryBuffer containing the bitmap serialized in the standard format.

Note that many pieces of the code are adopted from Delta classes such as DeletionVectorStore and StoredBitmap.

The loader has not been wired up yet. It will be done in some follow-up PR. The loader has been tested within my branch.

This PR depends on NVIDIA/spark-rapids-jni#4285.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

@jihoonson jihoonson marked this pull request as draft February 18, 2026 23:20
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 18, 2026

Greptile Summary

This PR adds a deletion vector loader for Delta Lake tables that reads serialized deletion vectors (roaring bitmaps) from disk and converts them to the standard format for GPU processing. The implementation handles two Delta serialization formats (portable and native), performing CRC validation and format conversion as needed.

Key changes:

  • Added RapidsDeletionVectorStore trait and RapidsHadoopDVStore implementation for loading deletion vectors from HDFS
  • Created DeltaSerializedBitmapLoader with format-specific loaders (DeltaPortableFormatLoader, DeltaNativeFormatLoader) that strip Delta's magic numbers and convert to standard roaring bitmap format
  • Added RapidsDeletionVectorStoredBitmap wrapper for deletion vector descriptors with support for empty and on-disk bitmaps
  • Exposed loadDeletionVector public API in RapidsDeletionVectorUtils with filter type validation

Notes:

  • Code properly uses closeOnExcept and withResource patterns for resource management throughout
  • Native format conversion requires deserialization and re-serialization (acknowledged as sub-optimal but acceptable for rare legacy format)
  • Previous review concerns about resource leaks, pattern matching, and CRC validation have been addressed
  • Depends on spark-rapids-jni PR Add support for regexp_extract on the GPU #4285 for Hash.hostCrc32 method
  • Not yet wired into the execution path (planned for follow-up PR)

Confidence Score: 4/5

  • Safe to merge with low risk - well-structured code with proper error handling, though not yet active in execution path
  • Code demonstrates good engineering practices with proper resource management, CRC validation, and meaningful error messages. Previous review issues have been addressed. Score is 4 rather than 5 because: (1) the loader is not yet wired up so runtime behavior is untested in this PR, (2) depends on external JNI PR, and (3) no tests included in this PR (though tested in separate branch)
  • No files require special attention - all three files show consistent quality and proper patterns

Important Files Changed

Filename Overview
delta-lake/common/src/main/delta-33x-40x/scala/org/apache/spark/sql/delta/deletionvectors/RapidsDeletionVectorStore.scala New deletion vector loader with format conversion (portable/native to standard), includes CRC validation and proper resource management
delta-lake/common/src/main/delta-33x-40x/scala/org/apache/spark/sql/delta/deletionvectors/RapidsStoredBitmap.scala Simple wrapper for deletion vector descriptors with proper resource handling for empty and on-disk bitmaps
delta-lake/common/src/main/delta-33x-40x/scala/com/nvidia/spark/rapids/delta/common/GpuDeltaParquetFileFormatBase.scala Added public API to load deletion vectors from descriptors with filter type validation

Last reviewed commit: fdcb9d2

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

@jihoonson jihoonson marked this pull request as ready for review February 20, 2026 19:08
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

…k/sql/delta/deletionvectors/RapidsStoredBitmap.scala

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@jihoonson jihoonson self-assigned this Feb 24, 2026
@jihoonson jihoonson requested a review from a team February 24, 2026 22:36
dvDescriptor: DeletionVectorDescriptor,
tableDataPath: Path
) {
require(dvDescriptor.isOnDisk, "Only on-disk deletion vectors are supported")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary? I think embeded deletetion vector should be processed similarly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the inline deletion vectors are used only for the CRC reader or when the bitmap is empty, but are never used to store non-empty bitmaps. I don't think we should support something that is never used. The empty bitmap is being handled in the load() function of this class. As for the CRC reader, see where DeletionVectorDescriptor.inlineInLog is called.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's used for small bitmap that doesn't need to land in disk. Anyway, we could do it in a follow up if necessary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double-checked the Delta code base as @gerashegalov mentioned this as well, but I don't see any code using the inline deletion vector besides the CRC reader and the empty bitmap. I also did a simple experiment that I made a small deletion vector with one row deleted for the TPC-DS web_sales table at sf=100, but Delta did make a deletion vector file. So, I'm quite convinced now that the inline deletion vector is never used for our use case. That being said, I could be still wrong and we will fix this if this turns out to be an issue.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sameerz sameerz added the task Work required that improves the product but is not user facing label Feb 25, 2026
@jihoonson
Copy link
Copy Markdown
Collaborator Author

build

@jihoonson
Copy link
Copy Markdown
Collaborator Author

build

@jihoonson
Copy link
Copy Markdown
Collaborator Author

build

@jihoonson
Copy link
Copy Markdown
Collaborator Author

build

gerashegalov
gerashegalov previously approved these changes Feb 27, 2026
Copy link
Copy Markdown
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nits

@jihoonson
Copy link
Copy Markdown
Collaborator Author

build

@jihoonson jihoonson merged commit 8b6e24c into NVIDIA:main Feb 27, 2026
45 checks passed
@jihoonson
Copy link
Copy Markdown
Collaborator Author

@gerashegalov @liurenjie1024 thanks for the review!

thirtiseven pushed a commit to thirtiseven/spark-rapids that referenced this pull request Mar 12, 2026
Fixes NVIDIA#14263 

### Description

This PR adds a deletion vector loader for Delta scan that loads
serialized deletion vectors (roaring bitmaps) into host memory. As Delta
serialization formats are different from the standard serialization
format described in NVIDIA#14263, the loader performs the format conversion
and returns a `HostMemoryBuffer` containing the bitmap serialized in the
standard format.

Note that many pieces of the code are adopted from Delta classes such as
`DeletionVectorStore` and `StoredBitmap`.

The loader has not been wired up yet. It will be done in some follow-up
PR. The loader has been tested within my
[branch](https://github.com/jihoonson/spark-rapids/tree/native-dv).

This PR depends on NVIDIA/spark-rapids-jni#4285.

### Checklists

- [ ] This PR has added documentation for new or modified features or
behaviors.
- [ ] This PR has added new tests or modified existing tests to cover
new code paths.
(Please explain in the PR description how the new code paths are tested,
such as names of the new/existing tests that cover them.)
- [ ] Performance testing has been performed and its results are added
in the PR description. Or, an issue has been filed with a link in the PR
description.

---------

Signed-off-by: Jihoon Son <ghoonson@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

task Work required that improves the product but is not user facing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling the discrepancy between the Delta serialization format and the standard serialization format

5 participants