Skip to content

aws: msk_iam: pass TLS context to credential provider chain#11912

Open
jamesdangercarpenter wants to merge 1 commit into
fluent:masterfrom
jamesdangercarpenter:fix/msk-iam-irsa-tls-context
Open

aws: msk_iam: pass TLS context to credential provider chain#11912
jamesdangercarpenter wants to merge 1 commit into
fluent:masterfrom
jamesdangercarpenter:fix/msk-iam-irsa-tls-context

Conversation

@jamesdangercarpenter

@jamesdangercarpenter jamesdangercarpenter commented Jun 4, 2026

Copy link
Copy Markdown

Summary

Native aws_msk_iam authentication is currently unusable with IRSA (IAM Roles for Service Accounts) on EKS.

The OAUTHBEARER token-refresh callback builds its AWS credential provider chain with a NULL TLS context:

flb_standard_chain_provider_create(config->flb_config, NULL, ...)

The provider chain reaches AWS endpoints over HTTPS — in particular the STS AssumeRoleWithWebIdentity call that IRSA depends on. With no TLS context that call cannot connect, and the chain silently falls back to the EC2 instance metadata service, picking up the node instance role instead of the pod's IRSA role. The MSK token is then signed with the wrong principal and the broker rejects it:

SASL authentication error: Access denied

Fix

Create a dedicated TLS instance for each credential provider — verifying peers against the system trust store, exactly as the other AWS output plugins (out_cloudwatch_logs, out_s3, out_kinesis_*) do — and tear it down together with the provider. A separate instance is used at each provider-creation site, since TLS instances cannot be shared between providers (see the note in flb_aws_credentials.h).

This is a focused fix for the credential-fetch TLS bug only; it does not change the configuration surface or token-signing logic.

Testing

  • Compiles cleanly.
  • Smoke-tested on EKS with IRSA against a live MSK cluster — confirm the STS AssumeRoleWithWebIdentity path is used (pod role, not node role) and authentication succeeds.

Fixes #11255

Summary by CodeRabbit

  • Bug Fixes
    • Improved AWS MSK IAM OAuth token flow by adding robust TLS context handling during credential/provider initialization and token refresh. TLS contexts are now created and reliably cleaned up on success and error paths, reducing resource leaks and improving connection reliability during authentication.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e24adf7a-7599-42c0-b0ec-974d47bd3098

📥 Commits

Reviewing files that changed from the base of the PR and between 5f5a1d6 and eadd8a7.

📒 Files selected for processing (1)
  • src/aws/flb_aws_msk_iam.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/aws/flb_aws_msk_iam.c

📝 Walkthrough

Walkthrough

This PR adds creation and proper cleanup of per-request TLS contexts (cred_tls) and passes them into AWS credentials provider creation for both MSK IAM token payload generation and OAuth token refresh paths.

Changes

MSK IAM TLS Integration

Layer / File(s) Summary
TLS-enabled AWS credentials in MSK IAM payload builder
src/aws/flb_aws_msk_iam.c
Adds #include <flb_tls.h>, declares cred_tls, creates it via flb_tls_create(...), passes it to flb_standard_chain_provider_create(...) instead of NULL, and ensures cred_tls is destroyed on both success and error paths in build_msk_iam_payload.
TLS-enabled AWS credentials in OAuth token refresh
src/aws/flb_aws_msk_iam.c
Declares cred_tls, creates it via flb_tls_create(...), passes it to flb_standard_chain_provider_create(...) for credential initialization and retrieval, and destroys cred_tls during oauthbearer_token_refresh_cb cleanup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop with a tiny TLS tote,
A context tucked for each secure note,
Two code paths carry it snug and tight,
Then clean it up when the call sleeps at night. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely describes the main change: passing a TLS context to the credential provider chain in the AWS MSK IAM module.
Linked Issues check ✅ Passed The pull request successfully addresses the root cause identified in issue #11255 by creating dedicated TLS contexts for credential provider chains, enabling successful STS AssumeRoleWithWebIdentity calls required for IRSA support.
Out of Scope Changes check ✅ Passed All changes are scoped to the credential-fetch TLS bug in the MSK IAM provider chain; no configuration, token-signing logic, or unrelated modifications are present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
src/aws/flb_aws_msk_iam.c

src/aws/flb_aws_msk_iam.c:20:10: fatal error: 'fluent-bit/flb_info.h' file not found
20 | #include <fluent-bit/flb_info.h>
| ^~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/eadd8a7c4642b513c8c35fb770a0562689495dd1-684f4d04730c2bb3/tmp/clang_command_.tmp.2d74eb.txt
++Contents of '/tmp/coderabbit-infer/eadd8a7c4642b513c8c35fb770a0562689495dd1-684f4d04730c2bb3/tmp/clang_command_.tmp.2d74eb.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit-obj" "-mr

... [truncated 736 characters] ...

fer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fskip-odr-check-in-gmf"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/684f4d04730c2bb3/file.o" "-x" "c"
"src/aws/flb_aws_msk_iam.c" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jamesdangercarpenter jamesdangercarpenter marked this pull request as ready for review June 9, 2026 06:05
@jamesdangercarpenter jamesdangercarpenter requested a review from a team as a code owner June 9, 2026 06:05

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f5a1d661b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/aws/flb_aws_msk_iam.c

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/aws/flb_aws_msk_iam.c (1)

248-267: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Resource leak: cred_tls not destroyed on early return paths.

The newly-introduced cred_tls (line 229) is not cleaned up when these three error paths return directly. All should use goto error to ensure proper cleanup.

🐛 Proposed fix
     if (temp_provider->provider_vtable->init(temp_provider) != 0) {
         flb_error("[aws_msk_iam] build_msk_iam_payload: failed to initialize AWS credentials provider");
         flb_aws_provider_destroy(temp_provider);
-        return NULL;
+        temp_provider = NULL;  /* Prevent double-free in error handler */
+        goto error;
     }
 
     /* Get credentials */
     creds = temp_provider->provider_vtable->get_credentials(temp_provider);
     if (!creds) {
         flb_error("[aws_msk_iam] build_msk_iam_payload: failed to get credentials");
-        flb_aws_provider_destroy(temp_provider);
-        return NULL;
+        goto error;
     }
 
     if (!creds->access_key_id || !creds->secret_access_key) {
         flb_error("[aws_msk_iam] build_msk_iam_payload: incomplete credentials");
-        flb_aws_credentials_destroy(creds);
-        flb_aws_provider_destroy(temp_provider);
-        return NULL;
+        goto error;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aws/flb_aws_msk_iam.c` around lines 248 - 267, In build_msk_iam_payload,
the early returns after temp_provider->provider_vtable->init,
temp_provider->provider_vtable->get_credentials, and the incomplete-credentials
check leak cred_tls; replace those direct returns with a jump to the existing
error cleanup block (use goto error) so that cred_tls is destroyed along with
flb_aws_credentials_destroy(creds) and flb_aws_provider_destroy(temp_provider);
ensure the error path explicitly frees cred_tls and leaves other cleanup calls
intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/aws/flb_aws_msk_iam.c`:
- Around line 248-267: In build_msk_iam_payload, the early returns after
temp_provider->provider_vtable->init,
temp_provider->provider_vtable->get_credentials, and the incomplete-credentials
check leak cred_tls; replace those direct returns with a jump to the existing
error cleanup block (use goto error) so that cred_tls is destroyed along with
flb_aws_credentials_destroy(creds) and flb_aws_provider_destroy(temp_provider);
ensure the error path explicitly frees cred_tls and leaves other cleanup calls
intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 04e5ea10-5dad-4661-8b17-fc010b2b56f7

📥 Commits

Reviewing files that changed from the base of the PR and between f617ed1 and 5f5a1d6.

📒 Files selected for processing (1)
  • src/aws/flb_aws_msk_iam.c

The MSK IAM OAUTHBEARER callback created its AWS credential provider chain
with a NULL TLS context:

    flb_standard_chain_provider_create(config->flb_config, NULL, ...)

The provider chain reaches AWS endpoints over HTTPS - in particular the STS
AssumeRoleWithWebIdentity call used by IRSA (IAM Roles for Service Accounts)
on EKS. With no TLS context that call cannot connect, and the chain silently
falls back to the EC2 instance metadata service, picking up the node role
instead of the pod's IRSA role. The MSK token is then signed with the wrong
principal and the broker rejects it with "SASL authentication error: Access
denied", making IRSA unusable with native aws_msk_iam.

Create a dedicated TLS instance (verifying peers against the system trust
store, as the other AWS plugins do) for each credential provider and tear it
down together with the provider. A separate instance is used at each
provider-creation site since TLS instances cannot be shared between
providers.

Fixes fluent#11255

Signed-off-by: James Carpenter <james@james-carpenter.net>
@jamesdangercarpenter jamesdangercarpenter force-pushed the fix/msk-iam-irsa-tls-context branch from 5f5a1d6 to eadd8a7 Compare June 9, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

1 participant