Skip to content

feat: vendor-pluggable S3 credentials for native scans#4309

Open
mbutrovich wants to merge 26 commits into
apache:mainfrom
mbutrovich:credential_provider
Open

feat: vendor-pluggable S3 credentials for native scans#4309
mbutrovich wants to merge 26 commits into
apache:mainfrom
mbutrovich:credential_provider

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented May 13, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

Comet's native scan paths (object_store for raw Parquet, opendal via iceberg-rust for Iceberg) bypass Spark's Hadoop S3A credential infrastructure entirely. Vendors with custom credential mechanisms (per-path STS, REST-vended creds, etc.) cannot reach Comet through any existing SPI: AWSCredentialsProvider.getCredentials() is parameterless, Hadoop S3A custom signers never return credentials outside the signing pipeline (the secret is an HMAC key, not present in the signed output), and Spark's CloudCredentialsProvider yields a single JWT per service name with no path argument. This PR adds a narrow, S3-specific SPI that vendors implement once, plus JNI plumbing to call it from native code.

Activation is config-driven and modeled on parquet.crypto.factory.class (PME KMS, #2447). The user names a single vendor class in a Spark/Hadoop config. There is no META-INF/services discovery and no Comet-specific on/off knob. Vendors that need to dispatch across multiple credential backends do so inside the named class, the same way Iceberg's DecryptionPropertiesFactory already does for Parquet keys.

Design: what Comet owns vs. what vendors own

The SPI is intentionally narrow. Comet owns the JNI shape and a one-shot init hook; vendors own everything that varies across backends.

What Comet owns

  1. A construction point. ensureInitialized(fqcn, dispatchKey, catalogProperties) lazily reflects the named class once per (fqcn, dispatchKey) and calls initialize(map) exactly once.
  2. A keying decision. dispatchKey is the Spark V2 catalog name on the Iceberg path and the bucket on the Parquet path, so two catalogs sharing one provider FQCN get isolated instances.
  3. A per-call invocation. getCredentialsForPath(bucket, path, mode) hands the vendor a per-request context. Comet does not cache the result on the JVM side.

What vendors own

  • Caching policy. Iceberg vendors get CachedSupplier inside VendedCredentialsProvider for free; vendors with custom STS write whatever cache fits.
  • Refresh trigger. Vendor decides when to refresh (proactive timer, on-demand at expiry, on 403 retry).
  • Distribution of driver-only state. Read it from initialize's catalogProperties (Comet has already serialized it through the native plan op), call back to a vendor service from the executor, or run a Spark broadcast inside the class.
  • Multi-backend dispatch. Vendor that handles both REST-vended and static creds inside one class does the if/else internally.
  • Expiry semantics. Vendors return expirationEpochMillis per credential. Comet forwards it to object_store and opendal, which already have their own credential caches and use the expiry to schedule the next refresh. The field is metadata pass-through, not a Comet-owned cache.

What is in this PR

  • Java SPI under org.apache.comet.cloud.s3 (in common): CometS3CredentialProvider (with default initialize(Map)), CometS3Credentials, CometS3AccessMode, and a CometS3CredentialDispatcher keyed by (FQCN, dispatchKey) with ensureInitialized(...) and a hot-path getCredentialsForPath(...).
  • Rust CometS3CredentialBridge implementing object_store::CredentialProvider and reqsign_core::ProvideCredential, plus a JNI handle in native/jni-bridge. Bridge invokes ensureInitialized once at construction and caches the four constant String arguments as JNI global refs to avoid per-call new_string allocations on the hot path.
  • Activation keys: fs.s3a.comet.credential.provider.class (with per-bucket override) for the Parquet object_store path, and s3.comet.credential.provider.class on the Spark catalog property for the Iceberg opendal path. Iceberg path uses the Spark V2 catalog name as dispatchKey; Parquet path uses the bucket.
  • The full unfiltered FileIO property bag crosses JNI as catalog_properties. The storage-prefix filter (s3./gcs./adls./client.) is applied native-side in iceberg_scan.rs::load_file_io immediately before FileIOBuilder.with_prop, so the bridge sees credentials.uri, OAuth tokens, and any vendor-custom keys without a parallel field or a driver-side broadcast.
  • IcebergScanExec derives a redacting Debug so plan dumps and tracing do not leak the property bag.
  • iceberg-rust pin bumped to 83b4595 for reqsign-core 3.0 and the CustomAwsCredentialLoader API. testcontainers bumped to 1.21.4 + docker-java 3.7.1 (modern daemons require Docker API >= 1.40).
  • Reference IcebergRESTVendedS3Provider (test scope, Spark 4.x build only) wrapping Iceberg's VendedCredentialsProvider. Placed under spark/src/test/spark-4.x/java so it is excluded from the Spark 3.4 build (where iceberg-spark-runtime-3.4_* does not expose VendedCredentialsProvider on the test classpath) and the Spark 3.5 build (where Comet pins Iceberg 1.8.1, predating the in-properties short-circuit added in AWS: Don't fetch credential from endpoint if properties contain a valid credential iceberg#12504, 1.9.0). Demonstrates that caching/refresh stay vendor-side. A future PR can promote this to a runtime artifact behind an optional iceberg-aws dependency.
  • New user-guide page covers operator setup (enable, verify, troubleshoot), vendor SPI contract (returns-or-throws, in-class dispatch, initialize(Map) lifecycle, multi-catalog isolation via dispatchKey), and explicit out-of-scope notes (Comet does not own a TTL cache, broadcast cache, or per-plan invariants).

How are these changes tested?

  • JUnit CometS3CredentialDispatcherTest: round-trip, ensureInitialized idempotence, distinct dispatchKey isolation, missing-class / wrong-interface / no-arg-ctor / empty-FQCN failure modes, get-without-init guard.
  • JUnit IcebergRESTVendedS3ProviderTest (Spark 4.x build only, exercised against Iceberg 1.10.0): wrapper instantiates, returns vended creds via the in-properties short-circuit path, errors before init.
  • End-to-end CometS3CredentialBridgeSuite (Minio): Parquet on S3, Iceberg on S3, REST + SPI integration that asserts a non-storage-prefix sentinel key from the vended-creds bag reaches initialize(Map) (so the SPI receives arbitrary REST keys end-to-end), and multi-catalog isolation across two catalogs sharing one FQCN.
  • CometS3CredentialBridgeSuite is added to dev/ci/check-suites.py ignore list (manual, like the other Docker-dependent S3 suites).

@mbutrovich mbutrovich changed the title Credential provider feat: Spark custom credential providers for native scans May 13, 2026
@mbutrovich mbutrovich force-pushed the credential_provider branch from b549155 to 0cd8a36 Compare May 14, 2026 14:33
@mbutrovich mbutrovich moved this from Todo to In progress in Comet Development May 14, 2026
Comment thread common/src/main/java/org/apache/comet/cloud/CometCloudCredentialDispatcher.java Outdated
@mbutrovich mbutrovich changed the title feat: Spark custom credential providers for native scans feat: vendor-pluggable S3 credentials for Comet native scans May 14, 2026
@mbutrovich mbutrovich changed the title feat: vendor-pluggable S3 credentials for Comet native scans feat: vendor-pluggable S3 credentials for native scans May 14, 2026
@mbutrovich mbutrovich marked this pull request as ready for review May 14, 2026 16:18
… activation (fs.s3a.comet.credential.provider.class for Parquet, s3.comet.credential.provider.class for Iceberg), so the bridge is opt-in per Spark config rather than implicit on classpath presence.
@karuppayya
Copy link
Copy Markdown
Contributor

karuppayya commented May 18, 2026

edit:
deleted this comment since i posted in incorrect PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants