feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5
Conversation
…en/linear/canary) Replace the CodeDeploy/external-controller blue/green wiring with the ECS deployment controller's built-in strategies: - deployment_type now accepts rolling | blue_green | linear | canary; the deployment controller is always ECS (CODE_DEPLOY mapping removed) - new deployment_strategy_config seeds bake time + canary/linear tuning at create time; deployment_configuration is in ignore_changes because the Flightcontrol deploy manager passes the authoritative config (including pause lifecycle hooks) on every UpdateService call - native strategies wire load_balancer.advanced_configuration: alternate target group (tg-2), production listener rule (first ALB rule or the NLB listener), optional test_listener_rule_arn, and a new ECS infrastructure role (AmazonECSInfrastructureRolePolicyForLoadBalancers) - target groups tg-1/tg-2 are now production/alternate and gate on any native traffic-shift strategy, not just blue_green; outputs renamed accordingly (production_/alternate_target_group_*) and ecs_infrastructure_role_arn added - provider floor bumped to aws >= 6.21 (linear/canary configuration) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Always provision the production/alternate target-group pair, the ECS infrastructure role, and the service's load_balancer advanced_configuration whenever a load balancer is attached — not just for native traffic-shift deployment_types. Rolling deployments serve from the production target group only, so any service can now switch between rolling / blue_green / linear / canary on a single UpdateService call with zero Terraform changes; deployment_type is purely the create-time seed. Also fixes the module test suite: mock the iam_policy_document / partition / region / vpc data sources and computed ARNs so all 11 runs execute (basic_service was failing on main and skipping the rest). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| production_listener_rule = ( | ||
| local.enable_nlb_listener | ||
| ? aws_lb_listener.nlb[0].arn | ||
| : aws_lb_listener_rule.alb["0"].arn | ||
| ) |
There was a problem hiding this comment.
Only the first ALB listener rule participates in traffic shifting
production_listener_rule is hardcoded to alb["0"].arn. The AWS ECS advanced_configuration API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (alb["1"], alb["2"], …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires length(var.load_balancer_attachment.listener_rules) <= 1 when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 102-106
Comment:
**Only the first ALB listener rule participates in traffic shifting**
`production_listener_rule` is hardcoded to `alb["0"].arn`. The AWS ECS `advanced_configuration` API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (`alb["1"]`, `alb["2"]`, …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires `length(var.load_balancer_attachment.listener_rules) <= 1` when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.
How can I resolve this? If you propose a fix, please make it concise.| dynamic "load_balancer" { | ||
| for_each = local.enable_load_balancer && var.deployment_type == "blue_green" ? [1] : [] | ||
| for_each = local.enable_load_balancer ? [1] : [] | ||
| content { | ||
| target_group_arn = aws_lb_target_group.tg_1[0].arn | ||
| container_name = local.lb_container_name | ||
| container_port = local.lb_container_port | ||
|
|
||
| advanced_configuration { | ||
| alternate_target_group_arn = aws_lb_target_group.tg_2[0].arn | ||
| production_listener_rule = ( | ||
| local.enable_nlb_listener | ||
| ? aws_lb_listener.nlb[0].arn | ||
| : aws_lb_listener_rule.alb["0"].arn | ||
| ) | ||
| test_listener_rule = var.test_listener_rule_arn | ||
| role_arn = aws_iam_role.ecs_infrastructure[0].arn | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
advanced_configuration wired unconditionally for rolling deployments — not validated against real AWS API
The whole design bet is that AWS silently accepts advanced_configuration (alternate TG, production listener rule, infrastructure role) when deployment_configuration is absent (the rolling default). All tests run with command = plan against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects advanced_configuration on a CreateService call that carries no deployment_configuration, every rolling service with a load balancer (the module default) would fail to provision. A real tofu apply against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 93-111
Comment:
**`advanced_configuration` wired unconditionally for rolling deployments — not validated against real AWS API**
The whole design bet is that AWS silently accepts `advanced_configuration` (alternate TG, production listener rule, infrastructure role) when `deployment_configuration` is absent (the rolling default). All tests run with `command = plan` against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects `advanced_configuration` on a `CreateService` call that carries no `deployment_configuration`, every rolling service with a load balancer (the module default) would fail to provision. A real `tofu apply` against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.
How can I resolve this? If you propose a fix, please make it concise.| output "target_group_arn" { | ||
| description = "The ARN of the target group (null if load balancer disabled or blue/green deployment)." | ||
| value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn : null | ||
| description = "The ARN of the production target group the service serves from (alias of production_target_group_arn; null if load balancer disabled)." | ||
| value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn : null | ||
| } | ||
|
|
||
| output "target_group_arn_suffix" { | ||
| description = "The ARN suffix of the target group for CloudWatch metrics." | ||
| value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn_suffix : null | ||
| description = "The ARN suffix of the production target group for CloudWatch metrics." | ||
| value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn_suffix : null | ||
| } |
There was a problem hiding this comment.
target_group_name dropped without a backward-compatible alias
target_group_arn was deliberately kept as an alias for production_target_group_arn, but target_group_name — the natural companion — was removed entirely. Callers that reference module.xxx.target_group_name (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null } would match the alias pattern used for the ARN.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/outputs.tf
Line: 96-104
Comment:
**`target_group_name` dropped without a backward-compatible alias**
`target_group_arn` was deliberately kept as an alias for `production_target_group_arn`, but `target_group_name` — the natural companion — was removed entirely. Callers that reference `module.xxx.target_group_name` (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding `output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null }` would match the alias pattern used for the ARN.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- Reject >1 ALB listener rule for native traffic-shift strategies: advanced_configuration only rewrites a single production listener rule, so extra rules would keep serving the old revision for the whole deployment. Documented the constraint on listener_rules since the strategy is a per-deployment decision the deploy manager can change without Terraform. - Restore the target_group_name output as an alias of production_target_group_name, matching the target_group_arn alias. - Make TestEcsServiceWithAlb assert the deployed rolling service carries advanced_configuration (alternate TG, production listener rule, infrastructure role), validating that the real AWS API accepts it on CreateService without a deployment_configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ravion Module Publish PlanDry run only. No Ravion API mutations were made.
Diffsrvn-ecs-cluster 0.1.3 -> 0.2.0--- remote
+++ compiled
- **Public and private Network Load Balancers** for TCP/UDP and static IP use cases
- **CloudWatch Container Insights** dashboard metrics for production visibility
- Terraform source: [flightcontrolhq/modules/compute/ecs_cluster](https://github.com/flightcontrolhq/modules/tree/rvn-ecs-cluster@0.1.3/compute/ecs_cluster)
+ Terraform source: [flightcontrolhq/modules/compute/ecs_cluster](https://github.com/flightcontrolhq/modules/tree/rvn-ecs-cluster@0.2.0/compute/ecs_cluster)
## Use cases
@@
base_path: compute/ecs_cluster
branch: main
execution_environment_id: << module.input.execution_environment_id >>
- ref: rvn-ecs-cluster@0.1.3
+ ref: rvn-ecs-cluster@0.2.0
repo: https://github.com/flightcontrolhq/modules
stack_id: <<stack.id>>
terraform_variables:rvn-ecs-web 0.5.1 -> 0.6.0--- remote
+++ compiled
queue_overflow: oldest
queue_size: 1
infrastructure:
+ ecs_alternate_target_group_arn: <<stack.output.alternate_target_group_arn>>
ecs_cluster_arn: <<stack.output.service_cluster>>
+ ecs_infrastructure_role_arn: <<stack.output.ecs_infrastructure_role_arn>>
+ ecs_production_listener_rule_arn: <<stack.output.production_listener_rule_arn>>
ecs_service_arns:
- <<stack.output.service_arn>>
+ ecs_target_group_arn: <<stack.output.target_group_arn>>
+ ecs_test_listener_rule_arn: <<stack.output.test_listener_rule_arn>>
inputs:
- description: Pass the image tag or digest to deploy. For Nixpacks or Dockerfile builds, this is resolved in the Ravion-created ECR repository. For Prebuilt image from registry mode, this is resolved in the repository configured on the module. Do not pass a full image URI.
id: image_ref
@@
type: string
post_deploy: '<< module.input.post_deploy_enabled && len(module.input.post_deploy_command) > 0 ? {"container_overrides": [{"name": stack.output.container_name, "command": module.input.post_deploy_command, "environment": module.input.post_deploy_environment_variables, "cpu": module.input.post_deploy_cpu || nil, "memory": module.input.post_deploy_memory || nil}], "cpu": string(module.input.post_deploy_cpu || (module.input.capacity_provider == "ec2" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024))), "memory": string(module.input.post_deploy_memory || (module.input.capacity_provider == "ec2" ? int(float(module.input.task_memory) * 1024) : int(float(module.input.fargate_size.memory_gb) * 1024))), "ephemeral_storage": (module.input.post_deploy_ephemeral_storage_size_gib ? {size_in_gib: module.input.post_deploy_ephemeral_storage_size_gib} : nil), "task_role_arn": stack.output.task_role_arn, "execution_role_arn": stack.output.execution_role_arn, "capacity_provider_strategy": ((module.input.capacity_provider == "fargate" || module.input.additional_fargate_capacity_enabled ? [{capacity_provider: module.input.fargate_capacity_provider_name, weight: 1, base: 0}] : []) | concat(module.input.capacity_provider == "fargate_spot" || module.input.additional_fargate_spot_capacity_enabled ? [{capacity_provider: module.input.fargate_spot_capacity_provider_name, weight: 1, base: 0}] : []) | concat(module.input.capacity_provider == "ec2" || module.input.additional_ec2_capacity_enabled ? [{capacity_provider: module.input.ec2_capacity_provider_name, weight: 1, base: 0}] : [])), "network_configuration": {"awsvpc_configuration": {"subnets": (module.input.private_subnet_placement_enabled ? module.input.private_subnet_ids : module.input.public_subnet_ids), "security_groups": ([stack.output.security_group_id] | concat(module.input.security_group_ids != nil ? module.input.security_group_ids : [])), "assign_public_ip": (module.input.private_subnet_placement_enabled ? "DISABLED" : "ENABLED")}}, "enable_execute_command": module.input.execute_command_enabled, "timeout": module.input.post_deploy_timeout} : nil >>'
pre_deploy: '<< module.input.pre_deploy_enabled && len(module.input.pre_deploy_command) > 0 ? {"container_overrides": [{"name": stack.output.container_name, "command": module.input.pre_deploy_command, "environment": module.input.pre_deploy_environment_variables, "cpu": module.input.pre_deploy_cpu || nil, "memory": module.input.pre_deploy_memory || nil}], "cpu": string(module.input.pre_deploy_cpu || (module.input.capacity_provider == "ec2" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024))), "memory": string(module.input.pre_deploy_memory || (module.input.capacity_provider == "ec2" ? int(float(module.input.task_memory) * 1024) : int(float(module.input.fargate_size.memory_gb) * 1024))), "ephemeral_storage": (module.input.pre_deploy_ephemeral_storage_size_gib ? {size_in_gib: module.input.pre_deploy_ephemeral_storage_size_gib} : nil), "task_role_arn": stack.output.task_role_arn, "execution_role_arn": stack.output.execution_role_arn, "capacity_provider_strategy": ((module.input.capacity_provider == "fargate" || module.input.additional_fargate_capacity_enabled ? [{capacity_provider: module.input.fargate_capacity_provider_name, weight: 1, base: 0}] : []) | concat(module.input.capacity_provider == "fargate_spot" || module.input.additional_fargate_spot_capacity_enabled ? [{capacity_provider: module.input.fargate_spot_capacity_provider_name, weight: 1, base: 0}] : []) | concat(module.input.capacity_provider == "ec2" || module.input.additional_ec2_capacity_enabled ? [{capacity_provider: module.input.ec2_capacity_provider_name, weight: 1, base: 0}] : [])), "network_configuration": {"awsvpc_configuration": {"subnets": (module.input.private_subnet_placement_enabled ? module.input.private_subnet_ids : module.input.public_subnet_ids), "security_groups": ([stack.output.security_group_id] | concat(module.input.security_group_ids != nil ? module.input.security_group_ids : [])), "assign_public_ip": (module.input.private_subnet_placement_enabled ? "DISABLED" : "ENABLED")}}, "enable_execute_command": module.input.execute_command_enabled, "timeout": module.input.pre_deploy_timeout} : nil >>'
+ strategy: |
+ <<
+ module.input.deployment_strategy == "rolling" ? nil :
+ module.input.deployment_strategy == "blue_green" ? {
+ "type": "blue_green",
+ "bake_time_in_minutes": module.input.deployment_bake_time_in_minutes,
+ "pause_stages": module.input.deployment_pause_stages
+ } :
+ module.input.deployment_strategy == "linear" ? {
+ "type": "linear",
+ "bake_time_in_minutes": module.input.deployment_bake_time_in_minutes,
+ "pause_stages": module.input.deployment_pause_stages,
+ "linear": {
+ "step_percentage": module.input.linear_step_percentage,
+ "step_bake_time_in_minutes": module.input.linear_step_bake_time_in_minutes
+ }
+ } : {
+ "type": "canary",
+ "bake_time_in_minutes": module.input.deployment_bake_time_in_minutes,
+ "pause_stages": module.input.deployment_pause_stages,
+ "canary": {
+ "canary_percent": module.input.canary_percent,
+ "canary_bake_time_in_minutes": module.input.canary_bake_time_in_minutes
+ }
+ }
+ >>
task_definition:
container_definitions: "<< (module.input.firelens_enabled ? [{\"command\": (len(module.input.prebuilt_image_start_command) > 0 ? module.input.prebuilt_image_start_command : nil), \"cpu\": (module.input.capacity_provider == \"ec2\" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024)), \"depends_on\": [{\"container_name\": \"log_router\", \"condition\": \"START\"}], \"environment\": ([{name: \"PORT\", value: string(module.input.container_port)}] | concat(module.input.environment_variables != nil ? module.input.environment_variables : [])), \"essential\": true, \"image\": (module.input.build_type == \"prebuilt_image\" ? (deploy.input.image_ref contains \"sha256:\" ? module.input.image_repository + \"@\" + deploy.input.image_ref : module.input.image_repository + \":\" + deploy.input.image_ref) : (deploy.input.image_ref contains \"sha256:\" ? stack.output.ecr_repository_url + \"@\" + deploy.input.image_ref : stack.output.ecr_repository_url + \":\" + deploy.input.image_ref)), \"linux_parameters\": {\"init_process_enabled\": true}, \"log_configuration\": {\"log_driver\": \"awsfirelens\"}, \"memory\": (module.input.capacity_provider == \"ec2\" ? int(float(module.input.task_memory) * 1024) : int(float(module.input.fargate_size.memory_gb) * 1024)), \"name\": (stack.output.container_name), \"port_mappings\": [{\"app_protocol\": \"http\", \"container_port\": (module.input.container_port), \"protocol\": \"tcp\"}], \"readonly_root_filesystem\": false, \"repository_credentials\": (module.input.image_registry_credentials_secret_arn ? {credentials_parameter: module.input.image_registry_credentials_secret_arn} : nil), \"secrets\": (module.input.secrets), \"stop_timeout\": 30}, {\"name\": \"log_router\", \"image\": module.input.firelens_image, \"cpu\": 0, \"memory_reservation\": 51, \"essential\": true, \"environment\": ([{name: \"FIRELENS_CONFIG_CONTENT\", value: \"[SERVICE]\\n Flush 1\\n Grace 30\\n\\n\" + (module.input.firelens_config ? module.input.firelens_config + \"\\n\" : \"\") + (module.input.firelens_cloudwatch_output_enabled ? \"\\n[OUTPUT]\\n Name cloudwatch\\n Match *\\n region \" + stack.output.region + \"\\n log_group_name \" + stack.output.log_group_name + \"\\n auto_create_group true\\n log_stream_prefix \" + stack.output.log_stream_prefix + \"/\\n retry_limit 2\\n log_key log\\n log_format json/emf\\n\" : \"\")}] | concat(module.input.firelens_environment_variables != nil ? module.input.firelens_environment_variables : [])), \"secrets\": module.input.firelens_secrets != nil ? module.input.firelens_secrets : [], \"command\": [\"/bin/sh\", \"-c\", \"printf '%s' \\\"$FIRELENS_CONFIG_CONTENT\\\" > /flightcontrol-firelens.conf && exec /entrypoint.sh\"], \"user\": \"0\", \"log_configuration\": {\"log_driver\": \"awslogs\", \"options\": {\"awslogs-group\": stack.output.log_group_name, \"awslogs-region\": stack.output.region, \"awslogs-stream-prefix\": stack.output.log_stream_prefix + \"/firelens\"}}, \"firelens_configuration\": {\"type\": \"fluentbit\", \"options\": {\"config-file-type\": \"file\", \"config-file-value\": \"/flightcontrol-firelens.conf\", \"enable-ecs-log-metadata\": string(module.input.firelens_ecs_metadata_enabled)}}}] : [{\"command\": (len(module.input.prebuilt_image_start_command) > 0 ? module.input.prebuilt_image_start_command : nil), \"cpu\": (module.input.capacity_provider == \"ec2\" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024)), \"environment\": ([{name: \"PORT\", value: string(module.input.container_port)}] | concat(module.input.environment_variables != nil ? module.input.environment_variables : [])), \"essential\": true, \"image\": (module.input.build_type == \"prebuilt_image\" ? (deploy.input.image_ref contains \"sha256:\" ? module.input.image_repository + \"@\" + deploy.input.image_ref : module.input.image_repository + \":\" + deploy.input.image_ref) : (deploy.input.image_ref contains \"sha256:\" ? stack.output.ecr_repository_url + \"@\" + deploy.input.image_ref : stack.output.ecr_repository_url + \":\" + deploy.input.image_ref)), \"linux_parameters\": {\"init_process_enabled\": true}, \"log_configuration\": {\"log_driver\": \"awslogs\", \"options\": {\"awslogs-group\": stack.output.log_group_name, \"awslogs-region\": stack.output.region, \"awslogs-stream-prefix\": stack.output.log_stream_prefix}}, \"memory\": (module.input.capacity_provider == \"ec2\" ? int(float(module.input.task_memory) * 1024) : int(float(module.input.fargate_size.memory_gb) * 1024)), \"name\": (stack.output.container_name), \"port_mappings\": [{\"app_protocol\": \"http\", \"container_port\": (module.input.container_port), \"protocol\": \"tcp\"}], \"readonly_root_filesystem\": false, \"repository_credentials\": (module.input.image_registry_credentials_secret_arn ? {credentials_parameter: module.input.image_registry_credentials_secret_arn} : nil), \"secrets\": (module.input.secrets), \"stop_timeout\": 30}]) | concat(module.input.sidecars ? module.input.sidecars : []) >>"
cpu: '<< string(module.input.capacity_provider == "ec2" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024)) >>'
@@
show_when:
build_type: nixpacks
type: string_array
+ - id: section_deployment
+ label: Deployment
+ type: section
+ - default: rolling
+ description: Choose how traffic moves from the current deployment to the new deployment.
+ id: deployment_strategy
+ label: Deployment strategy
+ required: true
+ type: string
+ values:
+ - description: Replace tasks in place while the service
... diff truncated ... |
AWS rejects explicit from_port/to_port when ip_protocol is "-1", which broke updates to existing rules (the sg-module refactor changed the ECS instance ingress rule from -1/-1 to 0/0). Null the ports in the shared security-groups module whenever the protocol is "-1" or "all", and use -1 in the ecs_cluster caller for clarity. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts: # compute/ecs_cluster/README.md # compute/ecs_cluster/locals.tf # compute/ecs_cluster/rvn-ecs-cluster-definition.yml # compute/ecs_service/README.md # compute/ecs_service/rvn-ecs-web-definition.yml
777e8bd to
bd9ac38
Compare
…ner rule Make the green (test) ALB listener rule's distinguishing condition configurable between an HTTP header and a query string via test_traffic_condition_type. Default to the query string __x-rvn-test__=1 so test traffic reaches the green target group without a custom header. ALB AND-combines conditions and ECS native blue/green wires exactly one test rule, so the selector is one type per service, not simultaneous header-OR-query matching.
|
@greptile |
| var.load_balancer_attachment.listener_rules[0].priority, | ||
| local.green_default_production_priority, | ||
| ) : null | ||
| green_test_priority = local.green_alb_listener_rule_enabled ? local.green_production_priority - 1 : null |
There was a problem hiding this comment.
Test-rule priority becomes 0 when production priority is 1
green_test_priority = local.green_production_priority - 1 produces 0 when a caller explicitly sets priority = 1. AWS ALB rejects listener rule priorities outside the range 1–50000, so the apply would fail with an API error (ValidationError: The number must be between 1 and 50000). There is no variable validation that prevents listener_rules[0].priority = 1, making this a real failure path for services that want the highest-priority rule. A guard is needed — either validate that listener_rules[0].priority >= 2 when green_alb_listener_rule_enabled is true, or clamp green_test_priority to a minimum of 1.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/locals.tf
Line: 72
Comment:
**Test-rule priority becomes 0 when production priority is 1**
`green_test_priority = local.green_production_priority - 1` produces `0` when a caller explicitly sets `priority = 1`. AWS ALB rejects listener rule priorities outside the range 1–50000, so the apply would fail with an API error (`ValidationError: The number must be between 1 and 50000`). There is no variable validation that prevents `listener_rules[0].priority = 1`, making this a real failure path for services that want the highest-priority rule. A guard is needed — either validate that `listener_rules[0].priority >= 2` when `green_alb_listener_rule_enabled` is true, or clamp `green_test_priority` to a minimum of 1.
How can I resolve this? If you propose a fix, please make it concise.# Conflicts: # compute/ecs_service/rvn-ecs-web-definition.yml
|
@greptile |
| from = aws_lb_target_group.this[0] | ||
| to = aws_lb_target_group.tg_1[0] | ||
| } |
There was a problem hiding this comment.
Rolling-service upgrade will fail:
moved + name change + ignore_changes = [action] deadlock
For any existing rolling service with a load balancer, upgrading to this module version triggers a tofu apply failure:
- The
movedblock transfers the state address ofaws_lb_target_group.this[0](name suffix-tg, max-prefix 28 chars) toaws_lb_target_group.tg_1[0](suffix-tg-1, max-prefix 24 chars). Every service gets a differentname, so Terraform plans a replacement. - With
create_before_destroy = true, the new TG is created first (new ARN). However,ignore_changes = [action]onaws_lb_listener_rule.albsuppresses the ARN reference update — the listener rule keeps pointing to the old TG ARN. - Terraform then attempts to destroy the old TG. AWS rejects this: "Target group is currently in use by a listener rule", and the
applyfails.
The apply exits mid-way: the new TG exists but is unreachable, and the old TG is stuck — referenced by the listener rule but orphaned from Terraform state. Manual cleanup is required. This affects every load-balanced rolling service; the PR description mentions "target-group recreation on upgrade" but not that the apply itself fails.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/moved.tf
Line: 6-8
Comment:
**Rolling-service upgrade will fail: `moved` + name change + `ignore_changes = [action]` deadlock**
For any existing rolling service with a load balancer, upgrading to this module version triggers a `tofu apply` failure:
1. The `moved` block transfers the state address of `aws_lb_target_group.this[0]` (name suffix `-tg`, max-prefix 28 chars) to `aws_lb_target_group.tg_1[0]` (suffix `-tg-1`, max-prefix 24 chars). Every service gets a different `name`, so Terraform plans a replacement.
2. With `create_before_destroy = true`, the new TG is created first (new ARN). However, `ignore_changes = [action]` on `aws_lb_listener_rule.alb` suppresses the ARN reference update — the listener rule keeps pointing to the old TG ARN.
3. Terraform then attempts to destroy the old TG. AWS rejects this: "Target group is currently in use by a listener rule", and the `apply` fails.
The apply exits mid-way: the new TG exists but is unreachable, and the old TG is stuck — referenced by the listener rule but orphaned from Terraform state. Manual cleanup is required. This affects every load-balanced rolling service; the PR description mentions "target-group recreation on upgrade" but not that the apply itself fails.
How can I resolve this? If you propose a fix, please make it concise.Re-adopting a pre-existing target group via the moved block changed the name from -tg to -tg-1, forcing replacement. With ignore_changes=[action] on the listener rule, the rule never repointed to the new TG and the old TG's destroy failed (in use by listener rule), breaking apply mid-way. Ignore changes to name so migrated TGs keep their existing name/ARN in place; fresh services still get the -tg-1 name.
# Conflicts: # compute/ecs_cluster/ec2.tf # compute/ecs_cluster/rvn-ecs-cluster-definition.yml
|
@greptile |
| base_path: compute/ecs_cluster | ||
| terraform_variables: | ||
| ...overrides: << module.input.advanced_terraform_variables >> | ||
| capacity_provider_default: << module.input.capacity_provider_default >> |
There was a problem hiding this comment.
i don't see input added for this, but instead of that just derive from the enabled provider flags? defaulting to something automatically?
| values: | ||
| - label: Rolling | ||
| value: rolling | ||
| - label: Blue/green | ||
| value: blue_green | ||
| - label: Linear | ||
| value: linear | ||
| - label: Canary | ||
| value: canary | ||
| - id: deployment_bake_time_in_minutes |
| - id: test_header_name | ||
| label: Test traffic header name | ||
| type: string | ||
| description: HTTP header that routes a request to the green (test) target group during a deployment. Send this header with the value below to reach the new revision before production traffic shifts. Requests without it keep hitting production. | ||
| collapsible: true | ||
| required: true | ||
| default: X-Ravion-Test | ||
| - id: test_header_value | ||
| label: Test traffic header value | ||
| type: string | ||
| description: Value the test traffic header must carry to be routed to the green (test) target group. | ||
| collapsible: true | ||
| required: true | ||
| default: "1" |
There was a problem hiding this comment.
and can we add query string also?
There was a problem hiding this comment.
actually query string is added as default, I am going to remove this input altogether.
There was a problem hiding this comment.
update readme with instructions on how to access the standby environment
| ## Standby validation traffic | ||
|
|
||
| For Blue/green, Linear, and Canary deployments, Ravion creates a test listener rule on the same Application Load Balancer listener as the production rule. During the test traffic stage, requests that match the service's normal Domain host rules and Path rules and include the standby selector route to the standby, or green, task set on the alternate target group. | ||
|
|
||
| By default, the standby selector is the query parameter `__x-rvn-test__=1`. For example, if production traffic uses `https://app.example.com/health`, validate the standby service with: | ||
|
|
||
| ```text | ||
| https://app.example.com/health?__x-rvn-test__=1 | ||
| ``` | ||
|
|
||
| The alternate target group only has registered targets while ECS is running a traffic-shift deployment. Outside that window, the standby route may have no healthy targets. | ||
|
|
||
| Use Advanced Terraform variables to override the standby selector. Values in Advanced Terraform variables override the generated Terraform variables for the service. |
There was a problem hiding this comment.
need to add that it is sticky with cookie, so once first request goes to standy you are stuck on that uless you clear a cookie
lets document the cookie
and how can you tell if a response is from prod or standby?
Summary
Replaces the dual-service / CodeDeploy-era blue/green design with the native ECS deployment controller strategies:
rolling,blue_green,linear, andcanary— and makes the strategy a per-deployment decision that requires no Terraform changes to switch.Native strategies (b4ee71a)
deployment_controller.type = "ECS"always — the CodeDeploy/EXTERNAL mapping is removeddeployment_typevalueslinear/canaryalongsiderolling/blue_greendeployment_strategy_configvariable (bake time, canary %, linear step %) that seedsdeployment_configurationat create timetest_listener_rule_arnvariable for blue/green test-traffic validation (TEST_TRAFFIC_SHIFTstages)AmazonECSInfrastructureRolePolicyForLoadBalancers— ECS assumes it to rewrite listener rules and (de)register targets during traffic shiftsload_balancer.advanced_configurationwiring (production/alternate target group, production + optional test listener rule, role)deployment_configurationadded toignore_changes: the Flightcontrol deploy manager passes the authoritative configuration (strategy, bake times, pause lifecycle hooks) on everyUpdateServicecall, so Terraform must not fight itlinear_configuration/canary_configuration)Strategy is per-deployment (3c72a84)
Whenever a load balancer is attached, the module now always provisions:
tg-1) + alternate (tg-2) target-group pair (the rolling-onlyaws_lb_target_group.thisis removed)advanced_configurationRolling deployments serve from the production target group only and never touch the alternate, so any service can switch between all four strategies on a single
UpdateServicecall.deployment_typeis purely the create-time seed.Trade-off: every load-balanced rolling service carries one idle target group and one unused IAM role.
Outputs
production_target_group_arn/alternate_target_group_arn(+ names),ecs_infrastructure_role_arntarget_group_arnretained as an alias of the production target groupTests
Fixed the module test suite so it actually runs:
basic_servicewas failing onmain(mock provider emits invalid JSON policies / ARNs), which skipped every other run. Addedmock_data/mock_resourcedefaults and fixed two stale assertions.tofu validate✅ —tofu test✅ 11/11 passedBreaking changes
deployment_controlleris alwaysECS; services created with the old CodeDeploy blue/green mapping are not migrated (pre-GA, clean removal)aws_lb_target_group.thisremoved — existing load-balanced services will see target-group recreation on upgrade🤖 Generated with Claude Code
Greptile Summary
This PR replaces the CodeDeploy-era blue/green dual-service design with native ECS deployment controller strategies (
rolling,blue_green,linear,canary), and restructures the ECS cluster's capacity-provider default strategy to comply with the AWS restriction on mixing Fargate and EC2 families.ecs_servicemodule: Always usesdeployment_controller.type = \"ECS\"; always creates a production + alternate target-group pair and an ECS infrastructure IAM role when a load balancer is attached, so any service can switch between all four strategies on a singleUpdateServicecall with no Terraform changes; seedsdeployment_configurationat create time and places it inignore_changesso the Flightcontrol deploy manager owns it per-deployment.ecs_clustermodule: Newcapacity_provider_defaultvariable commits the cluster's default strategy to a single provider family (EC2, Fargate, or Fargate Spot), preventing the AWS API rejection that previously occurred when both Fargate and EC2 providers were added to one strategy.advanced_configurationon a rolling service.Confidence Score: 4/5
Safe to merge after fixing two logic bugs: the test_listener_rule_arn variable is silently overridden by the module-created test rule when both are provided, and the YAML definition wires header-name/value UI inputs that can never take effect because test_traffic_condition_type is not wired.
The core redesign is coherent and well-tested. Two defects stand out: (1) in locals.tf, local.green_alb_listener_rule_enabled takes precedence over var.test_listener_rule_arn, opposite to what the variable description promises, so callers supplying an externally-managed test rule ARN while leaving green_alb_listener_rule_enabled = true (the default) will have their ARN silently replaced by a newly-created rule; (2) in rvn-ecs-web-definition.yml, test_traffic_condition_type is never passed to Terraform, so the Test traffic header name and Test traffic header value fields in the UI are dead — the test rule always matches the x-rvn-test query string regardless of what the operator enters.
compute/ecs_service/locals.tf (test_listener_rule_arn precedence) and compute/ecs_service/rvn-ecs-web-definition.yml (missing test_traffic_condition_type wiring)
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD LB{Load balancer attached?} LB -- No --> NoTG[No target groups / No infra role] LB -- Yes --> TG1[aws_lb_target_group.tg_1 production] LB -- Yes --> TG2[aws_lb_target_group.tg_2 alternate] LB -- Yes --> ROLE[aws_iam_role.ecs_infrastructure] TG1 & TG2 & ROLE --> ADV[load_balancer.advanced_configuration] ADV --> SVC[aws_ecs_service deployment_controller=ECS] DTYPE{deployment_type at create time} DTYPE -- rolling --> DCFG_SKIP[No deployment_configuration block] DTYPE -- blue_green/linear/canary --> DCFG[deployment_configuration strategy+bake+canary/linear] DCFG_SKIP & DCFG --> SVC SVC --> IGN[ignore_changes: desired_count task_definition load_balancer deployment_configuration] TESTQ{green_alb_listener_rule_enabled AND test_listener_rule_arn==null?} TESTQ -- Yes --> TESTRULE[aws_lb_listener_rule.test] TESTQ -- No --> EXTARN[var.test_listener_rule_arn or null] TESTRULE & EXTARN --> ADV%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD LB{Load balancer attached?} LB -- No --> NoTG[No target groups / No infra role] LB -- Yes --> TG1[aws_lb_target_group.tg_1 production] LB -- Yes --> TG2[aws_lb_target_group.tg_2 alternate] LB -- Yes --> ROLE[aws_iam_role.ecs_infrastructure] TG1 & TG2 & ROLE --> ADV[load_balancer.advanced_configuration] ADV --> SVC[aws_ecs_service deployment_controller=ECS] DTYPE{deployment_type at create time} DTYPE -- rolling --> DCFG_SKIP[No deployment_configuration block] DTYPE -- blue_green/linear/canary --> DCFG[deployment_configuration strategy+bake+canary/linear] DCFG_SKIP & DCFG --> SVC SVC --> IGN[ignore_changes: desired_count task_definition load_balancer deployment_configuration] TESTQ{green_alb_listener_rule_enabled AND test_listener_rule_arn==null?} TESTQ -- Yes --> TESTRULE[aws_lb_listener_rule.test] TESTQ -- No --> EXTARN[var.test_listener_rule_arn or null] TESTRULE & EXTARN --> ADVPrompt To Fix All With AI
Reviews (4): Last reviewed commit: "Merge branch 'main' into native-ecs-depl..." | Re-trigger Greptile