fix: reduce migration noise and fix regressions in Go helm step#267
Conversation
Previously the helm step used subnet_ids and security_group_ids from ptd.yaml (which no workload has), producing empty selector terms that failed Karpenter validation. Now the step calls DescribeCluster to fetch the live VPC config and also looks up FSX/EFS NFS security groups via EC2 DescribeSecurityGroups, matching the Python code's behavior. Also fixes two related bugs in the NodePool spec: - consolidationPolicy/consolidateAfter defaulted to "" when not in YAML; now defaults to "WhenEmptyOrUnderutilized"/"5m" matching Python - session_taints: true was ignored; now synthesizes the workload-type session taint as Python did
|
Claude finished @stevenolen's task —— View job Review Complete
Submitted a non-blocking review with 4 findings:
|
There was a problem hiding this comment.
Good bugfix PR — the three issues identified (empty subnet/SG selectors, empty disruption defaults, missing session taints) are real and the approach is sound. A few items to address:
Critical: Overprovisioning tolerations miss session taints
awsHelmKarpenterOverprovisioning (line ~1552) builds tolerations from nodePool.Taints, not the synthesized allTaints that includes the session taint. When session_taints: true is set, overprovisioning pods won't tolerate the workload-type: session NoSchedule taint, making them unschedulable on those Karpenter nodes.
Consider pre-computing the full taint list (including the synthesized session taint) before passing it to both awsHelmKarpenter and awsHelmKarpenterOverprovisioning, or duplicating the synthesis logic.
Important: DescribeSecurityGroups is unpaginated (lib/aws/ec2.go:20-27)
DescribeSecurityGroups paginates (default max 1000 results). This VPC likely has fewer SGs today, but it's fragile. Adding a server-side group-name filter eliminates both the pagination risk and the linear scan:
Filters: []ec2types.Filter{
{
Name: aws.String("vpc-id"),
Values: []string{vpcID},
},
{
Name: aws.String("group-name"),
Values: []string{namePrefix + "*"},
},
},Suggestion: Function naming (lib/aws/ec2.go:14)
GetNFSSecurityGroupID is domain-specific ("NFS") but the implementation is a generic "find SG by name prefix" lookup. Consider GetSecurityGroupIDByNamePrefix — the caller conveys intent via the namePrefix argument.
Suggestion: DescribeCluster tech debt (lib/aws/eks.go:111)
GetClusterVPCConfig is the third function calling DescribeCluster (alongside GetClusterEndpoint and GetClusterInfo). Not blocking for this bugfix, but worth noting — these could share a single helper that returns the full response.
- Fix FSX role name: aws-fsx-openzfs-csi-driver.{name}.posit.team
- Fix SA names: add .posit.team suffix to LBC, FSX controller/node,
Loki, Mimir, and Alloy service accounts (matching Python Roles enum)
- Add posit.team/managed-by label to all Kubernetes resources via
awsHelmChartCR wrapper and direct metadata additions on non-HelmChart
resources (namespaces, ConfigMaps, Secrets, NodePools, EC2NodeClasses,
PriorityClasses, Deployments)
…Python Python AWSWorkload dataclass defaults public_load_balancer=True. Go's bool zero value (false) caused the ALB scheme to flip from internet-facing to internal for workloads that don't explicitly set this field. Change to *bool so nil (unset) is treated as true, matching Python behavior.
- Add app=traefik label to Traefik Ingress (was present in Python output) - Match Python tag ordering: posit.team/true-name, posit.team/environment, Name - Set runAsUser: nil in Alloy securityContext (Python always included this key)
- alloyClusterName now uses "default_{fqn}-control-plane" matching
Python's eks_cluster_name() — prevents breaking existing Grafana queries
- Re-enable server-side apply on the k8s provider (was set in Python)
- Bump pulumi-kubernetes SDK from v4.21.1 to v4.25.0 to match the
pulumi_kubernetes Python package version and avoid provider downgrade
|
@claude review |
|
Claude finished @stevenolen's task —— View job Review Complete
All 8 fixes look correct and well-structured. The One outstanding issue from my previous review that is still present: Overprovisioning tolerations still use In tolerations := make(corev1.TolerationArray, 0, len(nodePool.Taints))
for _, taint := range nodePool.Taints {But in Fix would be to either: (a) replicate the session taint synthesis in Everything else LGTM. The non-blocking suggestions from the first review (function naming, server-side SG filter, DescribeCluster dedup) are still valid as future tech debt items but not blocking. |
Python's eks_cluster_name() produces "default_{fqn}-control-plane" where
fqn = compoundName + "-" + release (e.g. ganso01-staging-20250328).
The Go migration was missing the release segment, causing Alloy to ship
metrics under the wrong cluster label and breaking Grafana dashboards.
What
Fixes several regressions introduced during the Python-to-Go helm step migration (PR #262) that caused unnecessary resource churn or incorrect infrastructure:
1. EC2NodeClass subnet/security group selectors were always empty
The Go code read
subnet_ids/security_group_idsfromptd.yaml, but no workload config has these fields. This produced emptysubnetSelectorTerms/securityGroupSelectorTerms, which Karpenter's webhook rejected with a validation error.Fix: call
DescribeClusterto fetch the live EKS VPC config (subnet IDs, SG IDs, cluster security group ID). Also calls EC2DescribeSecurityGroupsto find FSX and EFS NFS SGs by name prefix, matching Python.2. NodePool disruption fields defaulted to empty string
ConsolidationPolicyandConsolidateAfterare plainstringfields that default to""when not specified in YAML. Karpenter rejects empty values.Fix: apply the same defaults Python used —
"WhenEmptyOrUnderutilized"and"5m".3.
session_taints: truewas ignoredThe Go code only included taints from the explicit
taints:list, missing the synthesizedworkload-type: session NoScheduletaint that Python generated fromsession_taints: true.Fix: synthesize the session taint when
session_taints: trueis set, matching Python.4. Service account names lost their
.posit.teamsuffixPython's
Rolesenum appended.posit.teamto all SA names (e.g.,loki.posit.team,alloy.posit.team,aws-load-balancer-controller.posit.team). The Go migration dropped these suffixes, breaking IRSA trust policies.Fix: restore
.posit.teamsuffix on all six SA names (LBC, FSX controller, FSX node, Loki, Mimir, Alloy).5. FSX role name was wrong
Python generated
aws-fsx-openzfs-csi-driver.{name}.posit.team; Go was generatingfsx-openzfs.{name}.posit.team, causing IAM role lookup failures.Fix: use the correct prefix.
6.
posit.team/managed-bylabel was missing from all Kubernetes resourcesPython added
{"posit.team/managed-by": "ptd.pulumi_resources.aws_workload_helm"}to every resource viarequired_tags. The Go migration dropped this label entirely.Fix: add
awsHelmChartCRwrapper (AWS-specific version ofhelmChartCR) that includes the label on HelmChart resources; add the label directly to all other resource types (Namespaces, ConfigMaps, Secrets, Ingresses, NodePools, EC2NodeClasses, PriorityClasses, Deployments).7.
public_load_balancerdefaulted to false instead of truePython's
AWSWorkloaddataclass defaultspublic_load_balancer=True, meaning internet-facing ALBs by default. Go'sboolzero value isfalse, flipping the scheme tointernalfor workloads that don't explicitly set this field.Fix: change to
*bool; nil (unset) is treated astrueto match Python.8. Minor label and value details dropped
app: "traefik"label on Traefik Ingress resources (both single and per-site)runAsUser: nullin AlloysecurityContext(Python always included this key)posit.team/true-name,posit.team/environment,Name) matches PythonChanges
lib/aws/ec2.go(new):GetNFSSecurityGroupID— finds a security group by name prefix in a VPClib/aws/eks.go:GetClusterVPCConfig— fetches subnet IDs, SG IDs, and VPC ID fromDescribeClusterlib/steps/helm_aws.go: VPC config pre-fetch; disruption defaults; session taint synthesis; SA name fixes; FSX role name fix;awsHelmChartCRwrapper; label additions;public_load_balancernil check; Traefik ingressapplabel; AlloyrunAsUser: nil; tag orderinglib/types/workload.go:PublicLoadBalancerchanged fromboolto*boollib/go.mod/cmd/go.mod: addgithub.com/aws/aws-sdk-go-v2/service/ec2dependencyTest
Dry-run against four staging workloads (two AWS, one Azure) after all fixes: