[fs] Optimize cloud file listing with per-page filtering and early termination in PinotFS#17847
Open
anshul98ks123 wants to merge 1 commit intoapache:masterfrom
Open
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17847 +/- ##
============================================
- Coverage 63.29% 63.24% -0.05%
- Complexity 1466 1478 +12
============================================
Files 3189 3189
Lines 192038 192092 +54
Branches 29420 29434 +14
============================================
- Hits 121546 121493 -53
- Misses 60977 61078 +101
- Partials 9515 9521 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6cb3c2b to
728cc84
Compare
…rmination in PinotFS
728cc84 to
640b044
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue(s)
The existing
PinotFS.listFilesWithMetadata(URI, boolean)eagerly fetches all objects from cloud storage into memory before any filtering is applied. For buckets with millions of objects, this causes OOM risk, high API costs, and unnecessary latency — even when the caller only needs a handful of matching files.Description
This pull request adds a new paginated, filtered
listFilesWithMetadataoverload to thePinotFSSPI and provides optimized implementations for S3, GCS, and ADLS Gen2 that apply per-page filtering and early termination — stopping cloud API calls as soon as enough matching files are found.New SPI method:
The default implementation falls back to the existing 2-arg method (fetch-all → filter in memory), so existing
PinotFSimplementations remain backward-compatible without changes.The Problem
listFilesWithMetadata(uri, true)— list ALL files recursivelyWhen the bucket contains millions of objects but the caller only needs 10 matching files (e.g., preview), the full listing is wasteful:
ListObjectsV2must exhaust all pages before returningPage<Blob>pages are fetched eagerlyPagedIterable<PathItem>is fully consumedSolution
Each cloud implementation now overrides the new 4-arg method to:
Predicate<String>as it's receiveds3Object.key().endsWith("/"),blob.getName().endsWith("/"),item.isDirectory()maxResultsis reached, abandoning further API callsImplementation details per cloud provider:
continuationToken+isTruncatedloopfor(page objects) + outerwhile(pages)Page<Blob>.getNextPage()loopfor(blobs) + stops callinggetNextPage()PagedIterable<PathItem>lazy iteratorbreakfromfor-eachabandons the iterator, stopping further Azure API callsCode example (S3):
Testing
S3PinotFSPaginatedListTest— 14 tests: single-page/multi-page listing, early termination (across pages, within a page,maxResults=1),continuationTokenpassing, predicate filtering, directory skipping, metadata attributes, S3a scheme support, prefix sent to S3GcsPinotFSPaginatedListTest— 9 tests: single-page/multi-page, early termination (verifiespage2.getValues()is never called), filtering, directory skipping (including prefix directory markers), null update time handling, metadata attributesADLSGen2PinotFSPaginatedListTest— 10 tests: all-match, early termination,maxResults=1, filtering, directory skipping, empty listings, metadata attributes,IOExceptionwrappingDataLakeStorageException, combined filter with early termination