fix: reduce migration noise and fix regressions in Go helm step by stevenolen · Pull Request #267 · posit-dev/ptd

stevenolen · 2026-04-22T12:42:56Z

What

Fixes several regressions introduced during the Python-to-Go helm step migration (PR #262) that caused unnecessary resource churn or incorrect infrastructure:

1. EC2NodeClass subnet/security group selectors were always empty

The Go code read subnet_ids / security_group_ids from ptd.yaml, but no workload config has these fields. This produced empty subnetSelectorTerms/securityGroupSelectorTerms, which Karpenter's webhook rejected with a validation error.

Fix: call DescribeCluster to fetch the live EKS VPC config (subnet IDs, SG IDs, cluster security group ID). Also calls EC2 DescribeSecurityGroups to find FSX and EFS NFS SGs by name prefix, matching Python.

2. NodePool disruption fields defaulted to empty string

ConsolidationPolicy and ConsolidateAfter are plain string fields that default to "" when not specified in YAML. Karpenter rejects empty values.

Fix: apply the same defaults Python used — "WhenEmptyOrUnderutilized" and "5m".

3. session_taints: true was ignored

The Go code only included taints from the explicit taints: list, missing the synthesized workload-type: session NoSchedule taint that Python generated from session_taints: true.

Fix: synthesize the session taint when session_taints: true is set, matching Python.

4. Service account names lost their .posit.team suffix

Python's Roles enum appended .posit.team to all SA names (e.g., loki.posit.team, alloy.posit.team, aws-load-balancer-controller.posit.team). The Go migration dropped these suffixes, breaking IRSA trust policies.

Fix: restore .posit.team suffix on all six SA names (LBC, FSX controller, FSX node, Loki, Mimir, Alloy).

5. FSX role name was wrong

Python generated aws-fsx-openzfs-csi-driver.{name}.posit.team; Go was generating fsx-openzfs.{name}.posit.team, causing IAM role lookup failures.

Fix: use the correct prefix.

6. posit.team/managed-by label was missing from all Kubernetes resources

Python added {"posit.team/managed-by": "ptd.pulumi_resources.aws_workload_helm"} to every resource via required_tags. The Go migration dropped this label entirely.

Fix: add awsHelmChartCR wrapper (AWS-specific version of helmChartCR) that includes the label on HelmChart resources; add the label directly to all other resource types (Namespaces, ConfigMaps, Secrets, Ingresses, NodePools, EC2NodeClasses, PriorityClasses, Deployments).

7. public_load_balancer defaulted to false instead of true

Python's AWSWorkload dataclass defaults public_load_balancer=True, meaning internet-facing ALBs by default. Go's bool zero value is false, flipping the scheme to internal for workloads that don't explicitly set this field.

Fix: change to *bool; nil (unset) is treated as true to match Python.

8. Minor label and value details dropped

app: "traefik" label on Traefik Ingress resources (both single and per-site)
runAsUser: null in Alloy securityContext (Python always included this key)
ALB tag string ordering (posit.team/true-name, posit.team/environment, Name) matches Python

Changes

lib/aws/ec2.go (new): GetNFSSecurityGroupID — finds a security group by name prefix in a VPC
lib/aws/eks.go: GetClusterVPCConfig — fetches subnet IDs, SG IDs, and VPC ID from DescribeCluster
lib/steps/helm_aws.go: VPC config pre-fetch; disruption defaults; session taint synthesis; SA name fixes; FSX role name fix; awsHelmChartCR wrapper; label additions; public_load_balancer nil check; Traefik ingress app label; Alloy runAsUser: nil; tag ordering
lib/types/workload.go: PublicLoadBalancer changed from bool to *bool
lib/go.mod / cmd/go.mod: add github.com/aws/aws-sdk-go-v2/service/ec2 dependency

Test

Dry-run against four staging workloads (two AWS, one Azure) after all fixes:

AWS workloads: remaining diff is label additions (intentional), provider version (pre-existing from fix: correct Azure Alloy ConfigMap alias and helm chart YAML indentation #262), Alloy ConfigMap recreation (Pulumi URN change from Python custom type, pre-existing from fix: correct Azure Alloy ConfigMap alias and helm chart YAML indentation #262), and other cosmetic rendering differences from fix: correct Azure Alloy ConfigMap alias and helm chart YAML indentation #262 — no Karpenter validation errors, no SA/role name mismatches, no scheme changes
Azure workload: 0 diff (29 unchanged) — Azure step unaffected

Previously the helm step used subnet_ids and security_group_ids from ptd.yaml (which no workload has), producing empty selector terms that failed Karpenter validation. Now the step calls DescribeCluster to fetch the live VPC config and also looks up FSX/EFS NFS security groups via EC2 DescribeSecurityGroups, matching the Python code's behavior. Also fixes two related bugs in the NodePool spec: - consolidationPolicy/consolidateAfter defaulted to "" when not in YAML; now defaults to "WhenEmptyOrUnderutilized"/"5m" matching Python - session_taints: true was ignored; now synthesizes the workload-type session taint as Python did

claude · 2026-04-22T12:43:15Z

Claude finished @stevenolen's task —— View job

Review Complete

Read review guidelines
Read PR diff and changed files
Create pending review with inline comments
Submit review

Submitted a non-blocking review with 4 findings:

Critical: Overprovisioning tolerations in awsHelmKarpenterOverprovisioning still use nodePool.Taints instead of the synthesized taint list — session taint pods will be unschedulable.
Important: DescribeSecurityGroups in ec2.go is unpaginated; a server-side group-name filter would be safer.
Suggestion: Rename GetNFSSecurityGroupID → GetSecurityGroupIDByNamePrefix for clarity.
Suggestion: Three DescribeCluster callers in eks.go could share a helper (tech debt, not blocking).

claude

Good bugfix PR — the three issues identified (empty subnet/SG selectors, empty disruption defaults, missing session taints) are real and the approach is sound. A few items to address:

Critical: Overprovisioning tolerations miss session taints

awsHelmKarpenterOverprovisioning (line ~1552) builds tolerations from nodePool.Taints, not the synthesized allTaints that includes the session taint. When session_taints: true is set, overprovisioning pods won't tolerate the workload-type: session NoSchedule taint, making them unschedulable on those Karpenter nodes.

Consider pre-computing the full taint list (including the synthesized session taint) before passing it to both awsHelmKarpenter and awsHelmKarpenterOverprovisioning, or duplicating the synthesis logic.

Important: `DescribeSecurityGroups` is unpaginated (`lib/aws/ec2.go:20-27`)

DescribeSecurityGroups paginates (default max 1000 results). This VPC likely has fewer SGs today, but it's fragile. Adding a server-side group-name filter eliminates both the pagination risk and the linear scan:

Filters: []ec2types.Filter{
    {
        Name:   aws.String("vpc-id"),
        Values: []string{vpcID},
    },
    {
        Name:   aws.String("group-name"),
        Values: []string{namePrefix + "*"},
    },
},

Suggestion: Function naming (`lib/aws/ec2.go:14`)

GetNFSSecurityGroupID is domain-specific ("NFS") but the implementation is a generic "find SG by name prefix" lookup. Consider GetSecurityGroupIDByNamePrefix — the caller conveys intent via the namePrefix argument.

Suggestion: DescribeCluster tech debt (`lib/aws/eks.go:111`)

GetClusterVPCConfig is the third function calling DescribeCluster (alongside GetClusterEndpoint and GetClusterInfo). Not blocking for this bugfix, but worth noting — these could share a single helper that returns the full response.

- Fix FSX role name: aws-fsx-openzfs-csi-driver.{name}.posit.team - Fix SA names: add .posit.team suffix to LBC, FSX controller/node, Loki, Mimir, and Alloy service accounts (matching Python Roles enum) - Add posit.team/managed-by label to all Kubernetes resources via awsHelmChartCR wrapper and direct metadata additions on non-HelmChart resources (namespaces, ConfigMaps, Secrets, NodePools, EC2NodeClasses, PriorityClasses, Deployments)

…Python Python AWSWorkload dataclass defaults public_load_balancer=True. Go's bool zero value (false) caused the ALB scheme to flip from internet-facing to internal for workloads that don't explicitly set this field. Change to *bool so nil (unset) is treated as true, matching Python behavior.

- Add app=traefik label to Traefik Ingress (was present in Python output) - Match Python tag ordering: posit.team/true-name, posit.team/environment, Name - Set runAsUser: nil in Alloy securityContext (Python always included this key)

- alloyClusterName now uses "default_{fqn}-control-plane" matching Python's eks_cluster_name() — prevents breaking existing Grafana queries - Re-enable server-side apply on the k8s provider (was set in Python) - Bump pulumi-kubernetes SDK from v4.21.1 to v4.25.0 to match the pulumi_kubernetes Python package version and avoid provider downgrade

… migrated steps)

…er stable

stevenolen · 2026-04-22T15:51:38Z

@claude review

claude · 2026-04-22T15:51:53Z

Claude finished @stevenolen's task —— View job

Review Complete

Read PR diff and changed files
Check status of previous review findings
Look for new issues
Submit review

All 8 fixes look correct and well-structured. The awsHelmChartCR wrapper, VPC config pre-fetch, disruption defaults, session taint synthesis, SA name restoration, FSX role name fix, *bool for PublicLoadBalancer, and label/tag fixes all match the described Python behavior.

One outstanding issue from my previous review that is still present:

Overprovisioning tolerations still use nodePool.Taints instead of synthesized taint list (lib/steps/helm_aws.go:1603-1611)

In awsHelmKarpenterOverprovisioning, tolerations are built from nodePool.Taints:

tolerations := make(corev1.TolerationArray, 0, len(nodePool.Taints))
for _, taint := range nodePool.Taints {

But in awsHelmKarpenter (line 1444-1458), the session taint is synthesized into allTaints when session_taints: true. The overprovisioning function never sees this synthesized taint, so overprovisioning pods won't tolerate workload-type: session NoSchedule and will be unschedulable on those Karpenter nodes.

Fix would be to either: (a) replicate the session taint synthesis in awsHelmKarpenterOverprovisioning, or (b) pre-compute the full taint list per node pool and pass it through KarpenterConfig.

Everything else LGTM. The non-blocking suggestions from the first review (function naming, server-side SG filter, DescribeCluster dedup) are still valid as future tech debt items but not blocking.

…g NodePool logic

Python's eks_cluster_name() produces "default_{fqn}-control-plane" where fqn = compoundName + "-" + release (e.g. ganso01-staging-20250328). The Go migration was missing the release segment, causing Alloy to ship metrics under the wrong cluster label and breaking Grafana dashboards.

claude Bot reviewed Apr 22, 2026

View reviewed changes

stevenolen added 3 commits April 22, 2026 09:02

stevenolen changed the title ~~fix: fetch Karpenter subnet/SG IDs from EKS cluster VPC config~~ fix: reduce migration noise and fix regressions in Go helm step Apr 22, 2026

stevenolen added 3 commits April 22, 2026 09:44

fix: remove EnableServerSideApply from k8s provider (not set in other…

f103837

… migrated steps)

revert: restore pulumi-kubernetes SDK to v4.21.1 to keep Azure provid…

51b33c0

…er stable

stevenolen marked this pull request as ready for review April 22, 2026 15:40

stevenolen requested a review from a team as a code owner April 22, 2026 15:40

fix: use deterministic workload port instead of hardcoded 1080 for proxy

99dfe02

timtalbot previously approved these changes Apr 22, 2026

View reviewed changes

fix: synthesize session taint in overprovisioning tolerations matchin…

3ca8f49

…g NodePool logic

stevenolen dismissed timtalbot’s stale review via 3ca8f49 April 22, 2026 15:54

timtalbot previously approved these changes Apr 22, 2026

View reviewed changes

stevenolen added this pull request to the merge queue Apr 22, 2026

stevenolen removed this pull request from the merge queue due to a manual request Apr 22, 2026

stevenolen dismissed timtalbot’s stale review via 82c3c60 April 22, 2026 16:42

timtalbot approved these changes Apr 22, 2026

View reviewed changes

stevenolen added this pull request to the merge queue Apr 22, 2026

Merged via the queue into main with commit 108cf40 Apr 22, 2026
8 checks passed

stevenolen deleted the fix-karpenter-selectors branch April 22, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce migration noise and fix regressions in Go helm step#267

fix: reduce migration noise and fix regressions in Go helm step#267
stevenolen merged 10 commits into
mainfrom
fix-karpenter-selectors

stevenolen commented Apr 22, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

stevenolen commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stevenolen commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Test

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Complete

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Critical: Overprovisioning tolerations miss session taints

Important: DescribeSecurityGroups is unpaginated (lib/aws/ec2.go:20-27)

Suggestion: Function naming (lib/aws/ec2.go:14)

Suggestion: DescribeCluster tech debt (lib/aws/eks.go:111)

Uh oh!

stevenolen commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Complete

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stevenolen commented Apr 22, 2026 •

edited

Loading

claude Bot commented Apr 22, 2026 •

edited

Loading

Important: `DescribeSecurityGroups` is unpaginated (`lib/aws/ec2.go:20-27`)

Suggestion: Function naming (`lib/aws/ec2.go:14`)

Suggestion: DescribeCluster tech debt (`lib/aws/eks.go:111`)

claude Bot commented Apr 22, 2026 •

edited

Loading