FIx #28768: surface external secret read failures instead of misrouting to create#28767
FIx #28768: surface external secret read failures instead of misrouting to create#28767IceS2 wants to merge 3 commits into
Conversation
…ting to create ExternalSecretsManager.existSecret caught every exception from the external secret backend and reported the secret as absent. A read that failed for any reason other than not-found (e.g. missing decrypt permission on the secret's KMS key) was therefore routed to the create path, which then fails with "secret already exists" and masks the real cause. existSecret now distinguishes a genuine not-found, via a per-provider classifier, from other read failures. Non-not-found failures are re-thrown with the secret id, the provider, and the underlying cause, so the operator sees the actual problem (permissions, KMS, throttling) rather than a misleading already-exists error. The store path now labels the exact secret being written, so nested encrypt errors point at the leaf field rather than the recursion prefix. The encrypt wrapper preserves the cause and is null-safe. Adds per-provider not-found classifier tests (AWS, SSM, GCP, Azure) and read/store failure tests on the AWS path.
✅ PR checks passedThe linked issue has a description and all required Shipping project fields set. Thanks! |
There was a problem hiding this comment.
Pull request overview
This PR improves correctness and operator visibility for external secrets managers by ensuring “secret existence” checks only return false for genuine not-found cases, while surfacing other read failures (e.g., missing decrypt/KMS permissions) instead of misrouting to non-idempotent create paths.
Changes:
- Refactors
ExternalSecretsManager.existSecretto distinguish not-found from other read failures via per-providerisNotFoundException(...), and rethrow read failures with actionable context. - Improves encryption error propagation by preserving exception causes and making error messages null-safe via
SecretsManager.exceptionMessage(...). - Adds/extends unit tests across AWS Secrets Manager, AWS SSM, GCP, and Azure KV for not-found classification and upsert routing behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java | Core behavior change: not-found classification + fail-fast on non-not-found read failures; improved wrapping for store failures. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/SecretsManager.java | Preserve encryption exception causes; introduce null-safe exceptionMessage(...) helper used in encryption errors. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/AWSSecretsManager.java | Implements AWS “not found” classifier using ResourceNotFoundException. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/AWSSSMSecretsManager.java | Implements SSM “not found” classifier using ParameterNotFoundException. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/GCPSecretsManager.java | Implements GCP “not found” classifier using gax NotFoundException. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/AzureKVSecretsManager.java | Implements Azure Key Vault “not found” classifier using ResourceNotFoundException. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/KubernetesSecretsManager.java | Adds Kubernetes “not found” classifier (404 ApiException) to satisfy new abstract API. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/InMemorySecretsManager.java | Adds in-memory “not found” classifier aligned with its getSecret-throws-on-missing behavior. |
| openmetadata-service/src/test/java/org/openmetadata/service/secrets/AWSSecretsManagerTest.java | Adds tests for not-found handling, fail-fast on non-not-found read failures, and leaf secret id surfaced in store failure. |
| openmetadata-service/src/test/java/org/openmetadata/service/secrets/AWSSSMSecretsManagerTest.java | Adds classifier test for SSM not-found vs other errors. |
| openmetadata-service/src/test/java/org/openmetadata/service/secrets/GCPSecretsManagerTest.java | Adds classifier test for gax NotFoundException vs other errors. |
| openmetadata-service/src/test/java/org/openmetadata/service/secrets/AzureKVSecretsManagerTest.java | New test validating Azure not-found classification behavior. |
| private SecretsManagerException storeFailure(String secretName, RuntimeException cause) { | ||
| return new SecretsManagerException( | ||
| String.format( | ||
| "Failed to store secret [%s] in %s: %s", | ||
| secretName, getSecretsManagerProvider().value(), exceptionMessage(cause)), | ||
| cause); |
| public boolean existSecret(String secretName) { | ||
| boolean exists = false; | ||
| try { | ||
| boolean exists = getSecret(secretName) != null; | ||
| exists = getSecret(secretName) != null; | ||
| sleep(); | ||
| return exists; | ||
| } catch (Exception e) { | ||
| return false; | ||
| } catch (RuntimeException e) { | ||
| if (!isNotFoundException(e)) { | ||
| throw readFailure(secretName, e); | ||
| } | ||
| } | ||
| return exists; | ||
| } |
There was a problem hiding this comment.
💡 Edge Case: isNotFoundException only inspects top-level exception
Each provider's isNotFoundException uses a plain instanceof check on the top-level exception passed from existSecret's catch (RuntimeException e) block. I verified that all current providers throw their not-found type unwrapped from getSecret (AWS ResourceNotFoundException, SSM ParameterNotFoundException, Azure ResourceNotFoundException, GCP gax NotFoundException), so the classification is correct today.
However this is fragile: if any provider (or a future SDK upgrade) starts wrapping the absent-secret error inside another exception (e.g. GCP already wraps IOException from client creation into SecretsManagerUpdateException), a genuinely missing secret would be classified as a read failure and existSecret would throw via readFailure instead of returning false and routing to create — re-introducing a failure on the normal first-create path. Consider walking the cause chain when classifying, so a wrapped not-found is still recognized.
Recognize a not-found exception even when wrapped by another exception.:
// In existSecret, classify against the whole cause chain:
} catch (RuntimeException e) {
if (!isNotFoundInChain(e)) {
throw readFailure(secretName, e);
}
}
...
private boolean isNotFoundInChain(Throwable t) {
for (Throwable c = t; c != null; c = c.getCause()) {
if (c instanceof Exception ex && isNotFoundException(ex)) {
return true;
}
}
return false;
}
- Apply fix
Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎
🟡 Playwright Results — all passed (16 flaky)✅ 4266 passed · ❌ 0 failed · 🟡 16 flaky · ⏭️ 88 skipped
🟡 16 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
…eep; drop per-field existence read for SSM/Azure Replace the fixed Thread.sleep(100) throttle in ExternalSecretsManager with an injectable, Guava-backed SecretsManagerRateLimiter (token bucket). It gates before each backend call, only blocks when the configured rate is exceeded, and is interrupt-safe -- no more UnhandledServerException on interrupt or a fixed 200ms penalty on single-field connections. The per-provider magic 100 is gone; InMemory uses a no-op limiter. Make storeOrUpdateSecret overridable. SSM and Azure now skip the GetSecret existence probe and use a read-free idempotent write: - Azure: setSecret is already an upsert (updateSecret merely delegated to it), so the existence read was pure waste. - SSM: attempt a tagged create and fall back to PutParameter(overwrite=true) on ParameterAlreadyExistsException. This halves the per-field call count for those providers and removes the read/decrypt-permission requirement on the write path. AWS Secrets Manager, GCP and Kubernetes keep the existence-check default, preserving the read-failure surfacing on their write paths. Fix AWSSSMSecretsManager sending Tags together with Overwrite=true: AWS SSM rejects PutParameter when both are set (400), so every update of an existing SSM-managed secret failed. Tags are now applied only on create. Tests: - SecretsManagerRateLimiterTest and ExternalSecretsManagerUnitTest (recording fake backend; no mocks of our own classes). - Reworked SSM mock tests and a new Azure injected-client test asserting no existence read happens on the write path. - New Docker-gated LocalStack Testcontainers ITs for AWS Secrets Manager and SSM (org.testcontainers:localstack), which caught the Overwrite+Tags bug that the HashMap-based mocks could not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| static final double DEFAULT_PERMITS_PER_SECOND = 10.0; | ||
|
|
||
| private final SecretsManagerRateLimiter rateLimiter; | ||
|
|
||
| protected ExternalSecretsManager( | ||
| SecretsManagerProvider secretsManagerProvider, SecretsConfig secretsConfig) { | ||
| this( | ||
| secretsManagerProvider, | ||
| secretsConfig, | ||
| SecretsManagerRateLimiter.perSecond(DEFAULT_PERMITS_PER_SECOND)); | ||
| } | ||
|
|
||
| @VisibleForTesting | ||
| protected ExternalSecretsManager( | ||
| SecretsManagerProvider secretsManagerProvider, |
There was a problem hiding this comment.
💡 Performance: Global secret write rate hardcoded at 10/s caps concurrent throughput
ExternalSecretsManager now paces every external secret call through a single SecretsManagerRateLimiter held on the provider singleton (DEFAULT_PERMITS_PER_SECOND = 10.0). Because the secrets manager is a process-wide singleton, this token bucket is shared across all threads, so the aggregate write rate for the whole process is capped at 10 calls/sec regardless of concurrency.
The previous implementation used a per-call Thread.sleep(100) that each thread incurred independently, so N concurrent encrypt operations could progress at ~N*10/sec. The new shared limiter is more correct for protecting a per-account provider quota, but it is a behavioral change: bulk/parallel operations (e.g. mass service creation or ingestion startup that encrypts many connections concurrently) will now serialize at 10/sec globally. A connection with K password fields on a read-probing provider (AWS Secrets Manager, GCP, K8s) consumes 2 permits per field (existence read + write), so e.g. 30 fields ≈ 6s of throttling per connection, and concurrent saves queue behind one another.
The rate is hardcoded with no configuration hook. Consider exposing permitsPerSecond via SecretsManagerConfiguration (or a system property) so operators on higher-quota accounts can raise it, and documenting that the limit is now process-global rather than per-thread.
Was this helpful? React with 👍 / 👎
Code Review 👍 Approved with suggestions 0 resolved / 3 findingsRefactors external secret lookups to distinguish missing secrets from access failures, surfacing the root cause rather than misrouting to create. Consider broadening 💡 Edge Case: isNotFoundException only inspects top-level exception📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:76-87 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/GCPSecretsManager.java:123-131 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/AWSSecretsManager.java:94-97 Each provider's However this is fragile: if any provider (or a future SDK upgrade) starts wrapping the absent-secret error inside another exception (e.g. GCP already wraps Recognize a not-found exception even when wrapped by another exception.💡 Quality: InMemory isNotFoundException classifies any SecretsManagerException
💡 Performance: Global secret write rate hardcoded at 10/s caps concurrent throughput📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:30-44 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:85-93 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:103-114 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:132-135 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/SecretsManagerRateLimiter.java:36-44
The previous implementation used a per-call The rate is hardcoded with no configuration hook. Consider exposing 🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Problem
When an external secrets manager (AWS Secrets Manager, SSM, GCP, Azure KV) is configured, updating a connection secret can fail with a misleading error:
The secret genuinely exists, yet the server attempts to create it. The real cause (e.g. the role can call
secretsmanager:*but lackskms:Decrypton the secret's KMS key) is never surfaced.Root cause
ExternalSecretsManager.existSecretdecides between create and update by reading the secret, but caught every exception and returnedfalse:So a read that fails for any reason other than not-found (missing decrypt permission, KMS key policy, throttling) is treated as "absent" and routed to
createSecret, which then fails with already exists — masking the actual problem.createSecretis not idempotent, so once existence is wrongly reported the 400 is guaranteed.Change
existSecretdistinguishes a genuine not-found from other read failures. Each provider implementsisNotFoundException(...)for its own absent-secret type (AWSResourceNotFoundException, SSMParameterNotFoundException, GCP gaxNotFoundException, AzureResourceNotFoundException, K8sApiException 404). A genuine not-found still routes to create; any other read failure is re-thrown with the secret id, the provider, and the underlying cause..../authtype/apikey) rather than the recursion prefix (.../elasticsearch), so the failing target is unambiguous.This converts an uncertain read into a fail-fast with the real reason, instead of a misleading already-exists error. The AWS SDK already retries transient throttling/5xx, so anything reaching
existSecretis a persistent failure that should be surfaced.Operator-facing result
A missing-permission case now reads:
Tests
isNotFoundExceptionclassifier tests: AWS, SSM, GCP, Azure (newAzureKVSecretsManagerTest).existSecretfalse; non-not-found read failure -> rethrow, nocreateSecret; store failure names the leaf secret id, not the recursion prefix.Fixes: #28768