Skip to content

fix: reduce migration noise and fix regressions in Go helm step#267

Merged
stevenolen merged 10 commits into
mainfrom
fix-karpenter-selectors
Apr 22, 2026
Merged

fix: reduce migration noise and fix regressions in Go helm step#267
stevenolen merged 10 commits into
mainfrom
fix-karpenter-selectors

Conversation

@stevenolen
Copy link
Copy Markdown
Contributor

@stevenolen stevenolen commented Apr 22, 2026

What

Fixes several regressions introduced during the Python-to-Go helm step migration (PR #262) that caused unnecessary resource churn or incorrect infrastructure:

1. EC2NodeClass subnet/security group selectors were always empty

The Go code read subnet_ids / security_group_ids from ptd.yaml, but no workload config has these fields. This produced empty subnetSelectorTerms/securityGroupSelectorTerms, which Karpenter's webhook rejected with a validation error.

Fix: call DescribeCluster to fetch the live EKS VPC config (subnet IDs, SG IDs, cluster security group ID). Also calls EC2 DescribeSecurityGroups to find FSX and EFS NFS SGs by name prefix, matching Python.

2. NodePool disruption fields defaulted to empty string

ConsolidationPolicy and ConsolidateAfter are plain string fields that default to "" when not specified in YAML. Karpenter rejects empty values.

Fix: apply the same defaults Python used — "WhenEmptyOrUnderutilized" and "5m".

3. session_taints: true was ignored

The Go code only included taints from the explicit taints: list, missing the synthesized workload-type: session NoSchedule taint that Python generated from session_taints: true.

Fix: synthesize the session taint when session_taints: true is set, matching Python.

4. Service account names lost their .posit.team suffix

Python's Roles enum appended .posit.team to all SA names (e.g., loki.posit.team, alloy.posit.team, aws-load-balancer-controller.posit.team). The Go migration dropped these suffixes, breaking IRSA trust policies.

Fix: restore .posit.team suffix on all six SA names (LBC, FSX controller, FSX node, Loki, Mimir, Alloy).

5. FSX role name was wrong

Python generated aws-fsx-openzfs-csi-driver.{name}.posit.team; Go was generating fsx-openzfs.{name}.posit.team, causing IAM role lookup failures.

Fix: use the correct prefix.

6. posit.team/managed-by label was missing from all Kubernetes resources

Python added {"posit.team/managed-by": "ptd.pulumi_resources.aws_workload_helm"} to every resource via required_tags. The Go migration dropped this label entirely.

Fix: add awsHelmChartCR wrapper (AWS-specific version of helmChartCR) that includes the label on HelmChart resources; add the label directly to all other resource types (Namespaces, ConfigMaps, Secrets, Ingresses, NodePools, EC2NodeClasses, PriorityClasses, Deployments).

7. public_load_balancer defaulted to false instead of true

Python's AWSWorkload dataclass defaults public_load_balancer=True, meaning internet-facing ALBs by default. Go's bool zero value is false, flipping the scheme to internal for workloads that don't explicitly set this field.

Fix: change to *bool; nil (unset) is treated as true to match Python.

8. Minor label and value details dropped

  • app: "traefik" label on Traefik Ingress resources (both single and per-site)
  • runAsUser: null in Alloy securityContext (Python always included this key)
  • ALB tag string ordering (posit.team/true-name, posit.team/environment, Name) matches Python

Changes

  • lib/aws/ec2.go (new): GetNFSSecurityGroupID — finds a security group by name prefix in a VPC
  • lib/aws/eks.go: GetClusterVPCConfig — fetches subnet IDs, SG IDs, and VPC ID from DescribeCluster
  • lib/steps/helm_aws.go: VPC config pre-fetch; disruption defaults; session taint synthesis; SA name fixes; FSX role name fix; awsHelmChartCR wrapper; label additions; public_load_balancer nil check; Traefik ingress app label; Alloy runAsUser: nil; tag ordering
  • lib/types/workload.go: PublicLoadBalancer changed from bool to *bool
  • lib/go.mod / cmd/go.mod: add github.com/aws/aws-sdk-go-v2/service/ec2 dependency

Test

Dry-run against four staging workloads (two AWS, one Azure) after all fixes:

Previously the helm step used subnet_ids and security_group_ids from
ptd.yaml (which no workload has), producing empty selector terms that
failed Karpenter validation. Now the step calls DescribeCluster to
fetch the live VPC config and also looks up FSX/EFS NFS security groups
via EC2 DescribeSecurityGroups, matching the Python code's behavior.

Also fixes two related bugs in the NodePool spec:
- consolidationPolicy/consolidateAfter defaulted to "" when not in YAML;
  now defaults to "WhenEmptyOrUnderutilized"/"5m" matching Python
- session_taints: true was ignored; now synthesizes the workload-type
  session taint as Python did
@claude
Copy link
Copy Markdown

claude Bot commented Apr 22, 2026

Claude finished @stevenolen's task —— View job


Review Complete

  • Read review guidelines
  • Read PR diff and changed files
  • Create pending review with inline comments
  • Submit review

Submitted a non-blocking review with 4 findings:

  1. Critical: Overprovisioning tolerations in awsHelmKarpenterOverprovisioning still use nodePool.Taints instead of the synthesized taint list — session taint pods will be unschedulable.
  2. Important: DescribeSecurityGroups in ec2.go is unpaginated; a server-side group-name filter would be safer.
  3. Suggestion: Rename GetNFSSecurityGroupIDGetSecurityGroupIDByNamePrefix for clarity.
  4. Suggestion: Three DescribeCluster callers in eks.go could share a helper (tech debt, not blocking).

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good bugfix PR — the three issues identified (empty subnet/SG selectors, empty disruption defaults, missing session taints) are real and the approach is sound. A few items to address:


Critical: Overprovisioning tolerations miss session taints

awsHelmKarpenterOverprovisioning (line ~1552) builds tolerations from nodePool.Taints, not the synthesized allTaints that includes the session taint. When session_taints: true is set, overprovisioning pods won't tolerate the workload-type: session NoSchedule taint, making them unschedulable on those Karpenter nodes.

Consider pre-computing the full taint list (including the synthesized session taint) before passing it to both awsHelmKarpenter and awsHelmKarpenterOverprovisioning, or duplicating the synthesis logic.


Important: DescribeSecurityGroups is unpaginated (lib/aws/ec2.go:20-27)

DescribeSecurityGroups paginates (default max 1000 results). This VPC likely has fewer SGs today, but it's fragile. Adding a server-side group-name filter eliminates both the pagination risk and the linear scan:

Filters: []ec2types.Filter{
    {
        Name:   aws.String("vpc-id"),
        Values: []string{vpcID},
    },
    {
        Name:   aws.String("group-name"),
        Values: []string{namePrefix + "*"},
    },
},

Suggestion: Function naming (lib/aws/ec2.go:14)

GetNFSSecurityGroupID is domain-specific ("NFS") but the implementation is a generic "find SG by name prefix" lookup. Consider GetSecurityGroupIDByNamePrefix — the caller conveys intent via the namePrefix argument.


Suggestion: DescribeCluster tech debt (lib/aws/eks.go:111)

GetClusterVPCConfig is the third function calling DescribeCluster (alongside GetClusterEndpoint and GetClusterInfo). Not blocking for this bugfix, but worth noting — these could share a single helper that returns the full response.

- Fix FSX role name: aws-fsx-openzfs-csi-driver.{name}.posit.team
- Fix SA names: add .posit.team suffix to LBC, FSX controller/node,
  Loki, Mimir, and Alloy service accounts (matching Python Roles enum)
- Add posit.team/managed-by label to all Kubernetes resources via
  awsHelmChartCR wrapper and direct metadata additions on non-HelmChart
  resources (namespaces, ConfigMaps, Secrets, NodePools, EC2NodeClasses,
  PriorityClasses, Deployments)
…Python

Python AWSWorkload dataclass defaults public_load_balancer=True. Go's
bool zero value (false) caused the ALB scheme to flip from internet-facing
to internal for workloads that don't explicitly set this field.

Change to *bool so nil (unset) is treated as true, matching Python behavior.
- Add app=traefik label to Traefik Ingress (was present in Python output)
- Match Python tag ordering: posit.team/true-name, posit.team/environment, Name
- Set runAsUser: nil in Alloy securityContext (Python always included this key)
@stevenolen stevenolen changed the title fix: fetch Karpenter subnet/SG IDs from EKS cluster VPC config fix: reduce migration noise and fix regressions in Go helm step Apr 22, 2026
- alloyClusterName now uses "default_{fqn}-control-plane" matching
  Python's eks_cluster_name() — prevents breaking existing Grafana queries
- Re-enable server-side apply on the k8s provider (was set in Python)
- Bump pulumi-kubernetes SDK from v4.21.1 to v4.25.0 to match the
  pulumi_kubernetes Python package version and avoid provider downgrade
@stevenolen stevenolen marked this pull request as ready for review April 22, 2026 15:40
@stevenolen stevenolen requested a review from a team as a code owner April 22, 2026 15:40
timtalbot
timtalbot previously approved these changes Apr 22, 2026
@stevenolen
Copy link
Copy Markdown
Contributor Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 22, 2026

Claude finished @stevenolen's task —— View job


Review Complete

  • Read PR diff and changed files
  • Check status of previous review findings
  • Look for new issues
  • Submit review

All 8 fixes look correct and well-structured. The awsHelmChartCR wrapper, VPC config pre-fetch, disruption defaults, session taint synthesis, SA name restoration, FSX role name fix, *bool for PublicLoadBalancer, and label/tag fixes all match the described Python behavior.

One outstanding issue from my previous review that is still present:

Overprovisioning tolerations still use nodePool.Taints instead of synthesized taint list (lib/steps/helm_aws.go:1603-1611)

In awsHelmKarpenterOverprovisioning, tolerations are built from nodePool.Taints:

tolerations := make(corev1.TolerationArray, 0, len(nodePool.Taints))
for _, taint := range nodePool.Taints {

But in awsHelmKarpenter (line 1444-1458), the session taint is synthesized into allTaints when session_taints: true. The overprovisioning function never sees this synthesized taint, so overprovisioning pods won't tolerate workload-type: session NoSchedule and will be unschedulable on those Karpenter nodes.

Fix would be to either: (a) replicate the session taint synthesis in awsHelmKarpenterOverprovisioning, or (b) pre-compute the full taint list per node pool and pass it through KarpenterConfig.

Everything else LGTM. The non-blocking suggestions from the first review (function naming, server-side SG filter, DescribeCluster dedup) are still valid as future tech debt items but not blocking.


timtalbot
timtalbot previously approved these changes Apr 22, 2026
@stevenolen stevenolen added this pull request to the merge queue Apr 22, 2026
@stevenolen stevenolen removed this pull request from the merge queue due to a manual request Apr 22, 2026
Python's eks_cluster_name() produces "default_{fqn}-control-plane" where
fqn = compoundName + "-" + release (e.g. ganso01-staging-20250328).
The Go migration was missing the release segment, causing Alloy to ship
metrics under the wrong cluster label and breaking Grafana dashboards.
@stevenolen stevenolen added this pull request to the merge queue Apr 22, 2026
Merged via the queue into main with commit 108cf40 Apr 22, 2026
8 checks passed
@stevenolen stevenolen deleted the fix-karpenter-selectors branch April 22, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants