feat: vendor-pluggable S3 credentials for native scans#4309
Open
mbutrovich wants to merge 26 commits into
Open
feat: vendor-pluggable S3 credentials for native scans#4309mbutrovich wants to merge 26 commits into
mbutrovich wants to merge 26 commits into
Conversation
# Conflicts: # docs/source/contributor-guide/index.md
b549155 to
0cd8a36
Compare
andygrove
reviewed
May 14, 2026
… activation (fs.s3a.comet.credential.provider.class for Parquet, s3.comet.credential.provider.class for Iceberg), so the bridge is opt-in per Spark config rather than implicit on classpath presence.
Contributor
|
edit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
Comet's native scan paths (
object_storefor raw Parquet,opendalviaiceberg-rustfor Iceberg) bypass Spark's Hadoop S3A credential infrastructure entirely. Vendors with custom credential mechanisms (per-path STS, REST-vended creds, etc.) cannot reach Comet through any existing SPI:AWSCredentialsProvider.getCredentials()is parameterless, Hadoop S3A custom signers never return credentials outside the signing pipeline (the secret is an HMAC key, not present in the signed output), and Spark'sCloudCredentialsProvideryields a single JWT per service name with no path argument. This PR adds a narrow, S3-specific SPI that vendors implement once, plus JNI plumbing to call it from native code.Activation is config-driven and modeled on
parquet.crypto.factory.class(PME KMS, #2447). The user names a single vendor class in a Spark/Hadoop config. There is noMETA-INF/servicesdiscovery and no Comet-specific on/off knob. Vendors that need to dispatch across multiple credential backends do so inside the named class, the same way Iceberg'sDecryptionPropertiesFactoryalready does for Parquet keys.Design: what Comet owns vs. what vendors own
The SPI is intentionally narrow. Comet owns the JNI shape and a one-shot init hook; vendors own everything that varies across backends.
What Comet owns
ensureInitialized(fqcn, dispatchKey, catalogProperties)lazily reflects the named class once per(fqcn, dispatchKey)and callsinitialize(map)exactly once.dispatchKeyis the Spark V2 catalog name on the Iceberg path and the bucket on the Parquet path, so two catalogs sharing one provider FQCN get isolated instances.getCredentialsForPath(bucket, path, mode)hands the vendor a per-request context. Comet does not cache the result on the JVM side.What vendors own
CachedSupplierinsideVendedCredentialsProviderfor free; vendors with custom STS write whatever cache fits.403retry).initialize'scatalogProperties(Comet has already serialized it through the native plan op), call back to a vendor service from the executor, or run a Spark broadcast inside the class.expirationEpochMillisper credential. Comet forwards it toobject_storeand opendal, which already have their own credential caches and use the expiry to schedule the next refresh. The field is metadata pass-through, not a Comet-owned cache.What is in this PR
org.apache.comet.cloud.s3(incommon):CometS3CredentialProvider(withdefault initialize(Map)),CometS3Credentials,CometS3AccessMode, and aCometS3CredentialDispatcherkeyed by(FQCN, dispatchKey)withensureInitialized(...)and a hot-pathgetCredentialsForPath(...).CometS3CredentialBridgeimplementingobject_store::CredentialProviderandreqsign_core::ProvideCredential, plus a JNI handle innative/jni-bridge. Bridge invokesensureInitializedonce at construction and caches the four constant String arguments as JNI global refs to avoid per-callnew_stringallocations on the hot path.fs.s3a.comet.credential.provider.class(with per-bucket override) for the Parquetobject_storepath, ands3.comet.credential.provider.classon the Spark catalog property for the Icebergopendalpath. Iceberg path uses the Spark V2 catalog name asdispatchKey; Parquet path uses the bucket.catalog_properties. The storage-prefix filter (s3./gcs./adls./client.) is applied native-side iniceberg_scan.rs::load_file_ioimmediately beforeFileIOBuilder.with_prop, so the bridge seescredentials.uri, OAuth tokens, and any vendor-custom keys without a parallel field or a driver-side broadcast.IcebergScanExecderives a redactingDebugso plan dumps and tracing do not leak the property bag.83b4595for reqsign-core 3.0 and theCustomAwsCredentialLoaderAPI. testcontainers bumped to 1.21.4 + docker-java 3.7.1 (modern daemons require Docker API >= 1.40).IcebergRESTVendedS3Provider(test scope, Spark 4.x build only) wrapping Iceberg'sVendedCredentialsProvider. Placed underspark/src/test/spark-4.x/javaso it is excluded from the Spark 3.4 build (whereiceberg-spark-runtime-3.4_*does not exposeVendedCredentialsProvideron the test classpath) and the Spark 3.5 build (where Comet pins Iceberg 1.8.1, predating the in-properties short-circuit added in AWS: Don't fetch credential from endpoint if properties contain a valid credential iceberg#12504, 1.9.0). Demonstrates that caching/refresh stay vendor-side. A future PR can promote this to a runtime artifact behind an optionaliceberg-awsdependency.initialize(Map)lifecycle, multi-catalog isolation viadispatchKey), and explicit out-of-scope notes (Comet does not own a TTL cache, broadcast cache, or per-plan invariants).How are these changes tested?
CometS3CredentialDispatcherTest: round-trip,ensureInitializedidempotence, distinctdispatchKeyisolation, missing-class / wrong-interface / no-arg-ctor / empty-FQCN failure modes, get-without-init guard.IcebergRESTVendedS3ProviderTest(Spark 4.x build only, exercised against Iceberg 1.10.0): wrapper instantiates, returns vended creds via the in-properties short-circuit path, errors before init.CometS3CredentialBridgeSuite(Minio): Parquet on S3, Iceberg on S3, REST + SPI integration that asserts a non-storage-prefix sentinel key from the vended-creds bag reachesinitialize(Map)(so the SPI receives arbitrary REST keys end-to-end), and multi-catalog isolation across two catalogs sharing one FQCN.CometS3CredentialBridgeSuiteis added todev/ci/check-suites.pyignore list (manual, like the other Docker-dependent S3 suites).