Skip to content

[fs] Optimize cloud file listing with per-page filtering and early termination in PinotFS#17847

Open
anshul98ks123 wants to merge 1 commit intoapache:masterfrom
anshul98ks123:preview-pagination-list-metadata
Open

[fs] Optimize cloud file listing with per-page filtering and early termination in PinotFS#17847
anshul98ks123 wants to merge 1 commit intoapache:masterfrom
anshul98ks123:preview-pagination-list-metadata

Conversation

@anshul98ks123
Copy link
Contributor

@anshul98ks123 anshul98ks123 commented Mar 10, 2026

Issue(s)

The existing PinotFS.listFilesWithMetadata(URI, boolean) eagerly fetches all objects from cloud storage into memory before any filtering is applied. For buckets with millions of objects, this causes OOM risk, high API costs, and unnecessary latency — even when the caller only needs a handful of matching files.


Description

This pull request adds a new paginated, filtered listFilesWithMetadata overload to the PinotFS SPI and provides optimized implementations for S3, GCS, and ADLS Gen2 that apply per-page filtering and early termination — stopping cloud API calls as soon as enough matching files are found.

New SPI method:

default List<FileMetadata> listFilesWithMetadata(
    URI fileUri, boolean recursive,
    Predicate<String> pathFilter, int maxResults)
    throws IOException

The default implementation falls back to the existing 2-arg method (fetch-all → filter in memory), so existing PinotFS implementations remain backward-compatible without changes.


The Problem

Step Operation Cost
1 listFilesWithMetadata(uri, true) — list ALL files recursively O(all files in prefix)
2 Caller applies glob/exclude filter in memory O(n)
3 Caller takes first N results

When the bucket contains millions of objects but the caller only needs 10 matching files (e.g., preview), the full listing is wasteful:

  • S3: Paginated ListObjectsV2 must exhaust all pages before returning
  • GCS: All Page<Blob> pages are fetched eagerly
  • ADLS: PagedIterable<PathItem> is fully consumed

Solution

Each cloud implementation now overrides the new 4-arg method to:

  1. Apply the filter per page — test each object against the Predicate<String> as it's received
  2. Skip directories explicitlys3Object.key().endsWith("/"), blob.getName().endsWith("/"), item.isDirectory()
  3. Terminate early — break out of the pagination loop once maxResults is reached, abandoning further API calls

Implementation details per cloud provider:

Provider Pagination Mechanism Early Termination
S3 continuationToken + isTruncated loop Breaks inner for (page objects) + outer while (pages)
GCS Page<Blob>.getNextPage() loop Breaks inner for (blobs) + stops calling getNextPage()
ADLS Gen2 PagedIterable<PathItem> lazy iterator break from for-each abandons the iterator, stopping further Azure API calls

Code example (S3):

@Override
public List<FileMetadata> listFilesWithMetadata(URI fileUri, boolean recursive,
    Predicate<String> pathFilter, int maxResults) throws IOException {
  List<FileMetadata> result = new ArrayList<>();
  String continuationToken = null;
  boolean isDone = false;
  while (!isDone && result.size() < maxResults) {
    // Build and execute ListObjectsV2Request with continuationToken
    ListObjectsV2Response response = ...;
    for (S3Object s3Object : response.contents()) {
      if (s3Object.key().endsWith(DELIMITER)) continue; // skip directories
      String filePath = scheme + "://" + host + "/" + key;
      if (pathFilter.test(filePath)) {
        result.add(buildFileMetadata(s3Object, filePath));
        if (result.size() >= maxResults) break; // early termination
      }
    }
    isDone = !response.isTruncated();
    continuationToken = response.nextContinuationToken();
  }
  return result;
}

Testing

  • S3PinotFSPaginatedListTest — 14 tests: single-page/multi-page listing, early termination (across pages, within a page, maxResults=1), continuationToken passing, predicate filtering, directory skipping, metadata attributes, S3a scheme support, prefix sent to S3
  • GcsPinotFSPaginatedListTest — 9 tests: single-page/multi-page, early termination (verifies page2.getValues() is never called), filtering, directory skipping (including prefix directory markers), null update time handling, metadata attributes
  • ADLSGen2PinotFSPaginatedListTest — 10 tests: all-match, early termination, maxResults=1, filtering, directory skipping, empty listings, metadata attributes, IOException wrapping DataLakeStorageException, combined filter with early termination

@codecov-commenter
Copy link

codecov-commenter commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 68.51852% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.24%. Comparing base (5e4d27e) to head (640b044).

Files with missing lines Patch % Lines
.../java/org/apache/pinot/spi/filesystem/PinotFS.java 0.00% 9 Missing ⚠️
.../org/apache/pinot/plugin/filesystem/S3PinotFS.java 82.22% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17847      +/-   ##
============================================
- Coverage     63.29%   63.24%   -0.05%     
- Complexity     1466     1478      +12     
============================================
  Files          3189     3189              
  Lines        192038   192092      +54     
  Branches      29420    29434      +14     
============================================
- Hits         121546   121493      -53     
- Misses        60977    61078     +101     
- Partials       9515     9521       +6     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.22% <68.51%> (-0.04%) ⬇️
java-21 63.21% <68.51%> (-0.04%) ⬇️
temurin 63.24% <68.51%> (-0.05%) ⬇️
unittests 63.24% <68.51%> (-0.05%) ⬇️
unittests1 55.56% <0.00%> (-0.04%) ⬇️
unittests2 34.26% <68.51%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@anshul98ks123 anshul98ks123 force-pushed the preview-pagination-list-metadata branch from 6cb3c2b to 728cc84 Compare March 12, 2026 10:17
@anshul98ks123 anshul98ks123 force-pushed the preview-pagination-list-metadata branch from 728cc84 to 640b044 Compare March 12, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants