Skip to content

feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5

Merged
mabadir merged 31 commits into
mainfrom
native-ecs-deployment-strategies
Jun 19, 2026
Merged

feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5
mabadir merged 31 commits into
mainfrom
native-ecs-deployment-strategies

Conversation

@mabadir

@mabadir mabadir commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the dual-service / CodeDeploy-era blue/green design with the native ECS deployment controller strategies: rolling, blue_green, linear, and canary — and makes the strategy a per-deployment decision that requires no Terraform changes to switch.

Native strategies (b4ee71a)

  • deployment_controller.type = "ECS" always — the CodeDeploy/EXTERNAL mapping is removed
  • New deployment_type values linear / canary alongside rolling / blue_green
  • New deployment_strategy_config variable (bake time, canary %, linear step %) that seeds deployment_configuration at create time
  • New test_listener_rule_arn variable for blue/green test-traffic validation (TEST_TRAFFIC_SHIFT stages)
  • ECS infrastructure IAM role with AmazonECSInfrastructureRolePolicyForLoadBalancers — ECS assumes it to rewrite listener rules and (de)register targets during traffic shifts
  • load_balancer.advanced_configuration wiring (production/alternate target group, production + optional test listener rule, role)
  • deployment_configuration added to ignore_changes: the Flightcontrol deploy manager passes the authoritative configuration (strategy, bake times, pause lifecycle hooks) on every UpdateService call, so Terraform must not fight it
  • Requires AWS provider >= 6.21 (linear_configuration / canary_configuration)

Strategy is per-deployment (3c72a84)

Whenever a load balancer is attached, the module now always provisions:

  • the production (tg-1) + alternate (tg-2) target-group pair (the rolling-only aws_lb_target_group.this is removed)
  • the ECS infrastructure role
  • the service's advanced_configuration

Rolling deployments serve from the production target group only and never touch the alternate, so any service can switch between all four strategies on a single UpdateService call. deployment_type is purely the create-time seed.

Trade-off: every load-balanced rolling service carries one idle target group and one unused IAM role.

Outputs

  • production_target_group_arn / alternate_target_group_arn (+ names), ecs_infrastructure_role_arn
  • target_group_arn retained as an alias of the production target group
  • Removed: blue/green dual-service outputs

Tests

Fixed the module test suite so it actually runs: basic_service was failing on main (mock provider emits invalid JSON policies / ARNs), which skipped every other run. Added mock_data / mock_resource defaults and fixed two stale assertions.

tofu validate ✅ — tofu test ✅ 11/11 passed

Breaking changes

  • deployment_controller is always ECS; services created with the old CodeDeploy blue/green mapping are not migrated (pre-GA, clean removal)
  • aws_lb_target_group.this removed — existing load-balanced services will see target-group recreation on upgrade
  • Blue/green dual-service variables/outputs removed

🤖 Generated with Claude Code

Greptile Summary

This PR replaces the CodeDeploy-era blue/green dual-service design with native ECS deployment controller strategies (rolling, blue_green, linear, canary), and restructures the ECS cluster's capacity-provider default strategy to comply with the AWS restriction on mixing Fargate and EC2 families.

  • ecs_service module: Always uses deployment_controller.type = \"ECS\"; always creates a production + alternate target-group pair and an ECS infrastructure IAM role when a load balancer is attached, so any service can switch between all four strategies on a single UpdateService call with no Terraform changes; seeds deployment_configuration at create time and places it in ignore_changes so the Flightcontrol deploy manager owns it per-deployment.
  • ecs_cluster module: New capacity_provider_default variable commits the cluster's default strategy to a single provider family (EC2, Fargate, or Fargate Spot), preventing the AWS API rejection that previously occurred when both Fargate and EC2 providers were added to one strategy.
  • Tests: Unit tests fixed (mock provider now emits valid JSON/ARNs), substantially expanded with new cases for each strategy, multi-rule rejection, stickiness handling, and header vs. query-string test selectors; a new Go integration test validates AWS API acceptance of advanced_configuration on a rolling service.

Confidence Score: 4/5

Safe to merge after fixing two logic bugs: the test_listener_rule_arn variable is silently overridden by the module-created test rule when both are provided, and the YAML definition wires header-name/value UI inputs that can never take effect because test_traffic_condition_type is not wired.

The core redesign is coherent and well-tested. Two defects stand out: (1) in locals.tf, local.green_alb_listener_rule_enabled takes precedence over var.test_listener_rule_arn, opposite to what the variable description promises, so callers supplying an externally-managed test rule ARN while leaving green_alb_listener_rule_enabled = true (the default) will have their ARN silently replaced by a newly-created rule; (2) in rvn-ecs-web-definition.yml, test_traffic_condition_type is never passed to Terraform, so the Test traffic header name and Test traffic header value fields in the UI are dead — the test rule always matches the x-rvn-test query string regardless of what the operator enters.

compute/ecs_service/locals.tf (test_listener_rule_arn precedence) and compute/ecs_service/rvn-ecs-web-definition.yml (missing test_traffic_condition_type wiring)

Important Files Changed

Filename Overview
compute/ecs_service/ecs_service.tf Core ECS service rewrite: replaces CodeDeploy/rolling dual-path with always-ECS controller; adds deployment_configuration seeding, advanced_configuration wiring, preconditions for listener rule count — well-structured with key edge cases guarded
compute/ecs_service/locals.tf New locals for native strategy, test-rule priority, stickiness detection; test_listener_rule_arn precedence logic is inverted relative to the documented behavior
compute/ecs_service/listener_rules.tf Adds the test (green) listener rule with mirrored production conditions plus the test-traffic selector; handles group stickiness correctly
compute/ecs_service/iam_infrastructure.tf New ECS infrastructure role with AmazonECSInfrastructureRolePolicyForLoadBalancers; scoped trust policy to ecs.amazonaws.com; created for any LB-attached service
compute/ecs_service/target_groups.tf Removes rolling-only aws_lb_target_group.this; tg_1/tg_2 are always created when LB is attached; ignore_changes=[name] guards the moved-block upgrade path
compute/ecs_service/variables.tf Adds deployment_strategy_config, test_listener_rule_arn, green_alb_listener_rule_enabled, test traffic selector variables; documented precedence of test_listener_rule_arn contradicts the implementation in locals.tf
compute/ecs_service/rvn-ecs-web-definition.yml Adds full deployment strategy UI section; wires test_header_name/test_header_value but omits test_traffic_condition_type, leaving the module at its query-string default regardless of the UI header inputs
compute/ecs_service/moved.tf State migration from aws_lb_target_group.this[0] to tg_1[0]; combined with ignore_changes=[name] on tg_1 to avoid forced replacement and listener-rule deadlock on upgrade
compute/ecs_service/outputs.tf Renames blue/green outputs to production/alternate; retains target_group_arn and target_group_name as aliases; adds ecs_infrastructure_role_arn, production_listener_rule_arn, test_listener_rule_arn
compute/ecs_service/tests/basic.tftest.hcl Substantial expansion — adds mock_resource defaults for ARN validation, new test cases for canary/linear/blue_green, header vs query-string selector, stickiness, and multi-rule rejection
compute/ecs_cluster/locals.tf Refactors capacity_provider_strategy to commit to a single family (EC2, fargate, or fargate_spot) to satisfy AWS restriction; controlled by new capacity_provider_default variable
test/ecs_service_test.go Adds integration assertions for advanced_configuration on a real rolling service (alternate TG, production and test listener rule ARNs), directly validating the AWS API acceptance assumption

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    LB{Load balancer attached?}
    LB -- No --> NoTG[No target groups / No infra role]
    LB -- Yes --> TG1[aws_lb_target_group.tg_1 production]
    LB -- Yes --> TG2[aws_lb_target_group.tg_2 alternate]
    LB -- Yes --> ROLE[aws_iam_role.ecs_infrastructure]
    TG1 & TG2 & ROLE --> ADV[load_balancer.advanced_configuration]
    ADV --> SVC[aws_ecs_service deployment_controller=ECS]
    DTYPE{deployment_type at create time}
    DTYPE -- rolling --> DCFG_SKIP[No deployment_configuration block]
    DTYPE -- blue_green/linear/canary --> DCFG[deployment_configuration strategy+bake+canary/linear]
    DCFG_SKIP & DCFG --> SVC
    SVC --> IGN[ignore_changes: desired_count task_definition load_balancer deployment_configuration]
    TESTQ{green_alb_listener_rule_enabled AND test_listener_rule_arn==null?}
    TESTQ -- Yes --> TESTRULE[aws_lb_listener_rule.test]
    TESTQ -- No --> EXTARN[var.test_listener_rule_arn or null]
    TESTRULE & EXTARN --> ADV
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    LB{Load balancer attached?}
    LB -- No --> NoTG[No target groups / No infra role]
    LB -- Yes --> TG1[aws_lb_target_group.tg_1 production]
    LB -- Yes --> TG2[aws_lb_target_group.tg_2 alternate]
    LB -- Yes --> ROLE[aws_iam_role.ecs_infrastructure]
    TG1 & TG2 & ROLE --> ADV[load_balancer.advanced_configuration]
    ADV --> SVC[aws_ecs_service deployment_controller=ECS]
    DTYPE{deployment_type at create time}
    DTYPE -- rolling --> DCFG_SKIP[No deployment_configuration block]
    DTYPE -- blue_green/linear/canary --> DCFG[deployment_configuration strategy+bake+canary/linear]
    DCFG_SKIP & DCFG --> SVC
    SVC --> IGN[ignore_changes: desired_count task_definition load_balancer deployment_configuration]
    TESTQ{green_alb_listener_rule_enabled AND test_listener_rule_arn==null?}
    TESTQ -- Yes --> TESTRULE[aws_lb_listener_rule.test]
    TESTQ -- No --> EXTARN[var.test_listener_rule_arn or null]
    TESTRULE & EXTARN --> ADV
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
compute/ecs_service/locals.tf:50-55
**`test_listener_rule_arn` silently ignored when `green_alb_listener_rule_enabled = true`**

The variable description (and the README) state that `test_listener_rule_arn` "takes precedence over `green_alb_listener_rule_enabled`", but the code is the other way around. When a caller provides an externally-managed test rule ARN while leaving the default `green_alb_listener_rule_enabled = true`, `local.green_alb_listener_rule_enabled` evaluates to `true`, the module creates a redundant second test listener rule, and the caller-supplied ARN is silently discarded — the wrong rule ends up wired into `advanced_configuration.test_listener_rule`.

```suggestion
  green_alb_listener_rule_enabled = (
    local.enable_load_balancer
    && !local.enable_nlb_listener
    && var.green_alb_listener_rule_enabled
    && length(var.load_balancer_attachment.listener_rules) > 0
    && var.test_listener_rule_arn == null
  )
```

### Issue 2 of 2
compute/ecs_service/rvn-ecs-web-definition.yml:344-349
**UI "Test traffic header name/value" inputs are silently ignored — query-string routing is used instead**

The YAML wires `test_header_name` and `test_header_value` as Terraform variables, but `test_traffic_condition_type` is never set. The Terraform module defaults to `"query-string"`, so the test listener rule always uses `__x-rvn-test__=1` as the selector regardless of what the operator enters in the header fields. The header inputs in the UI are dead configuration. Either add `test_traffic_condition_type: "header"` alongside the header variable mappings, or remove the header name/value UI inputs and document that test traffic is identified by the `__x-rvn-test__` query string.

Reviews (4): Last reviewed commit: "Merge branch 'main' into native-ecs-depl..." | Re-trigger Greptile

mabadir and others added 2 commits June 4, 2026 00:25
…en/linear/canary)

Replace the CodeDeploy/external-controller blue/green wiring with the
ECS deployment controller's built-in strategies:

- deployment_type now accepts rolling | blue_green | linear | canary;
  the deployment controller is always ECS (CODE_DEPLOY mapping removed)
- new deployment_strategy_config seeds bake time + canary/linear tuning
  at create time; deployment_configuration is in ignore_changes because
  the Flightcontrol deploy manager passes the authoritative config
  (including pause lifecycle hooks) on every UpdateService call
- native strategies wire load_balancer.advanced_configuration:
  alternate target group (tg-2), production listener rule (first ALB
  rule or the NLB listener), optional test_listener_rule_arn, and a new
  ECS infrastructure role (AmazonECSInfrastructureRolePolicyForLoadBalancers)
- target groups tg-1/tg-2 are now production/alternate and gate on any
  native traffic-shift strategy, not just blue_green; outputs renamed
  accordingly (production_/alternate_target_group_*) and
  ecs_infrastructure_role_arn added
- provider floor bumped to aws >= 6.21 (linear/canary configuration)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Always provision the production/alternate target-group pair, the ECS
infrastructure role, and the service's load_balancer
advanced_configuration whenever a load balancer is attached — not just
for native traffic-shift deployment_types. Rolling deployments serve
from the production target group only, so any service can now switch
between rolling / blue_green / linear / canary on a single
UpdateService call with zero Terraform changes; deployment_type is
purely the create-time seed.

Also fixes the module test suite: mock the iam_policy_document /
partition / region / vpc data sources and computed ARNs so all 11 runs
execute (basic_service was failing on main and skipping the rest).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +102 to +106
production_listener_rule = (
local.enable_nlb_listener
? aws_lb_listener.nlb[0].arn
: aws_lb_listener_rule.alb["0"].arn
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Only the first ALB listener rule participates in traffic shifting

production_listener_rule is hardcoded to alb["0"].arn. The AWS ECS advanced_configuration API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (alb["1"], alb["2"], …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires length(var.load_balancer_attachment.listener_rules) <= 1 when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 102-106

Comment:
**Only the first ALB listener rule participates in traffic shifting**

`production_listener_rule` is hardcoded to `alb["0"].arn`. The AWS ECS `advanced_configuration` API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (`alb["1"]`, `alb["2"]`, …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires `length(var.load_balancer_attachment.listener_rules) <= 1` when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 93 to 111
dynamic "load_balancer" {
for_each = local.enable_load_balancer && var.deployment_type == "blue_green" ? [1] : []
for_each = local.enable_load_balancer ? [1] : []
content {
target_group_arn = aws_lb_target_group.tg_1[0].arn
container_name = local.lb_container_name
container_port = local.lb_container_port

advanced_configuration {
alternate_target_group_arn = aws_lb_target_group.tg_2[0].arn
production_listener_rule = (
local.enable_nlb_listener
? aws_lb_listener.nlb[0].arn
: aws_lb_listener_rule.alb["0"].arn
)
test_listener_rule = var.test_listener_rule_arn
role_arn = aws_iam_role.ecs_infrastructure[0].arn
}
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 advanced_configuration wired unconditionally for rolling deployments — not validated against real AWS API

The whole design bet is that AWS silently accepts advanced_configuration (alternate TG, production listener rule, infrastructure role) when deployment_configuration is absent (the rolling default). All tests run with command = plan against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects advanced_configuration on a CreateService call that carries no deployment_configuration, every rolling service with a load balancer (the module default) would fail to provision. A real tofu apply against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 93-111

Comment:
**`advanced_configuration` wired unconditionally for rolling deployments — not validated against real AWS API**

The whole design bet is that AWS silently accepts `advanced_configuration` (alternate TG, production listener rule, infrastructure role) when `deployment_configuration` is absent (the rolling default). All tests run with `command = plan` against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects `advanced_configuration` on a `CreateService` call that carries no `deployment_configuration`, every rolling service with a load balancer (the module default) would fail to provision. A real `tofu apply` against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 96 to 104
output "target_group_arn" {
description = "The ARN of the target group (null if load balancer disabled or blue/green deployment)."
value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn : null
description = "The ARN of the production target group the service serves from (alias of production_target_group_arn; null if load balancer disabled)."
value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn : null
}

output "target_group_arn_suffix" {
description = "The ARN suffix of the target group for CloudWatch metrics."
value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn_suffix : null
description = "The ARN suffix of the production target group for CloudWatch metrics."
value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn_suffix : null
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 target_group_name dropped without a backward-compatible alias

target_group_arn was deliberately kept as an alias for production_target_group_arn, but target_group_name — the natural companion — was removed entirely. Callers that reference module.xxx.target_group_name (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null } would match the alias pattern used for the ARN.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/outputs.tf
Line: 96-104

Comment:
**`target_group_name` dropped without a backward-compatible alias**

`target_group_arn` was deliberately kept as an alias for `production_target_group_arn`, but `target_group_name` — the natural companion — was removed entirely. Callers that reference `module.xxx.target_group_name` (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding `output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null }` would match the alias pattern used for the ARN.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

- Reject >1 ALB listener rule for native traffic-shift strategies:
  advanced_configuration only rewrites a single production listener
  rule, so extra rules would keep serving the old revision for the
  whole deployment. Documented the constraint on listener_rules since
  the strategy is a per-deployment decision the deploy manager can
  change without Terraform.
- Restore the target_group_name output as an alias of
  production_target_group_name, matching the target_group_arn alias.
- Make TestEcsServiceWithAlb assert the deployed rolling service
  carries advanced_configuration (alternate TG, production listener
  rule, infrastructure role), validating that the real AWS API accepts
  it on CreateService without a deployment_configuration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mabadir mabadir added the run-terratest Run Terratest integration tests (real AWS apply) on this PR label Jun 5, 2026
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

Ravion Module Publish Plan

Dry run only. No Ravion API mutations were made.

Module Current Version New Version Description
rvn-ecs-cluster 0.1.3 0.2.0 Automatically derive the default capacity provider so the cluster default strategy commits to a single provider family
rvn-ecs-web 0.5.1 0.6.0 Add native ECS blue/green, linear, and canary deployments with per-deploy strategy controls, manual approval gates, standby validation traffic, production/alternate target groups, ALB group stickiness guidance, ECS infrastructure role, and load-balancer advanced configuration outputs.

Diffs

rvn-ecs-cluster 0.1.3 -> 0.2.0

--- remote
+++ compiled
   - **Public and private Network Load Balancers** for TCP/UDP and static IP use cases
   - **CloudWatch Container Insights** dashboard metrics for production visibility
 
-  Terraform source: [flightcontrolhq/modules/compute/ecs_cluster](https://github.com/flightcontrolhq/modules/tree/rvn-ecs-cluster@0.1.3/compute/ecs_cluster)
+  Terraform source: [flightcontrolhq/modules/compute/ecs_cluster](https://github.com/flightcontrolhq/modules/tree/rvn-ecs-cluster@0.2.0/compute/ecs_cluster)
 
   ## Use cases
 
@@
         base_path: compute/ecs_cluster
         branch: main
         execution_environment_id: << module.input.execution_environment_id >>
-        ref: rvn-ecs-cluster@0.1.3
+        ref: rvn-ecs-cluster@0.2.0
         repo: https://github.com/flightcontrolhq/modules
         stack_id: <<stack.id>>
         terraform_variables:

rvn-ecs-web 0.5.1 -> 0.6.0

--- remote
+++ compiled
     queue_overflow: oldest
     queue_size: 1
   infrastructure:
+    ecs_alternate_target_group_arn: <<stack.output.alternate_target_group_arn>>
     ecs_cluster_arn: <<stack.output.service_cluster>>
+    ecs_infrastructure_role_arn: <<stack.output.ecs_infrastructure_role_arn>>
+    ecs_production_listener_rule_arn: <<stack.output.production_listener_rule_arn>>
     ecs_service_arns:
       - <<stack.output.service_arn>>
+    ecs_target_group_arn: <<stack.output.target_group_arn>>
+    ecs_test_listener_rule_arn: <<stack.output.test_listener_rule_arn>>
   inputs:
     - description: Pass the image tag or digest to deploy. For Nixpacks or Dockerfile builds, this is resolved in the Ravion-created ECR repository. For Prebuilt image from registry mode, this is resolved in the repository configured on the module. Do not pass a full image URI.
       id: image_ref
@@
       type: string
   post_deploy: '<< module.input.post_deploy_enabled && len(module.input.post_deploy_command) > 0 ? {"container_overrides": [{"name": stack.output.container_name, "command": module.input.post_deploy_command, "environment": module.input.post_deploy_environment_variables, "cpu": module.input.post_deploy_cpu || nil, "memory": module.input.post_deploy_memory || nil}], "cpu": string(module.input.post_deploy_cpu || (module.input.capacity_provider == "ec2" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024))), "memory": string(module.input.post_deploy_memory || (module.input.capacity_provider == "ec2" ? int(float(module.input.task_memory) * 1024) : int(float(module.input.fargate_size.memory_gb) * 1024))), "ephemeral_storage": (module.input.post_deploy_ephemeral_storage_size_gib ? {size_in_gib: module.input.post_deploy_ephemeral_storage_size_gib} : nil), "task_role_arn": stack.output.task_role_arn, "execution_role_arn": stack.output.execution_role_arn, "capacity_provider_strategy": ((module.input.capacity_provider == "fargate" || module.input.additional_fargate_capacity_enabled ? [{capacity_provider: module.input.fargate_capacity_provider_name, weight: 1, base: 0}] : []) | concat(module.input.capacity_provider == "fargate_spot" || module.input.additional_fargate_spot_capacity_enabled ? [{capacity_provider: module.input.fargate_spot_capacity_provider_name, weight: 1, base: 0}] : []) | concat(module.input.capacity_provider == "ec2" || module.input.additional_ec2_capacity_enabled ? [{capacity_provider: module.input.ec2_capacity_provider_name, weight: 1, base: 0}] : [])), "network_configuration": {"awsvpc_configuration": {"subnets": (module.input.private_subnet_placement_enabled ? module.input.private_subnet_ids : module.input.public_subnet_ids), "security_groups": ([stack.output.security_group_id] | concat(module.input.security_group_ids != nil ? module.input.security_group_ids : [])), "assign_public_ip": (module.input.private_subnet_placement_enabled ? "DISABLED" : "ENABLED")}}, "enable_execute_command": module.input.execute_command_enabled, "timeout": module.input.post_deploy_timeout} : nil >>'
   pre_deploy: '<< module.input.pre_deploy_enabled && len(module.input.pre_deploy_command) > 0 ? {"container_overrides": [{"name": stack.output.container_name, "command": module.input.pre_deploy_command, "environment": module.input.pre_deploy_environment_variables, "cpu": module.input.pre_deploy_cpu || nil, "memory": module.input.pre_deploy_memory || nil}], "cpu": string(module.input.pre_deploy_cpu || (module.input.capacity_provider == "ec2" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024))), "memory": string(module.input.pre_deploy_memory || (module.input.capacity_provider == "ec2" ? int(float(module.input.task_memory) * 1024) : int(float(module.input.fargate_size.memory_gb) * 1024))), "ephemeral_storage": (module.input.pre_deploy_ephemeral_storage_size_gib ? {size_in_gib: module.input.pre_deploy_ephemeral_storage_size_gib} : nil), "task_role_arn": stack.output.task_role_arn, "execution_role_arn": stack.output.execution_role_arn, "capacity_provider_strategy": ((module.input.capacity_provider == "fargate" || module.input.additional_fargate_capacity_enabled ? [{capacity_provider: module.input.fargate_capacity_provider_name, weight: 1, base: 0}] : []) | concat(module.input.capacity_provider == "fargate_spot" || module.input.additional_fargate_spot_capacity_enabled ? [{capacity_provider: module.input.fargate_spot_capacity_provider_name, weight: 1, base: 0}] : []) | concat(module.input.capacity_provider == "ec2" || module.input.additional_ec2_capacity_enabled ? [{capacity_provider: module.input.ec2_capacity_provider_name, weight: 1, base: 0}] : [])), "network_configuration": {"awsvpc_configuration": {"subnets": (module.input.private_subnet_placement_enabled ? module.input.private_subnet_ids : module.input.public_subnet_ids), "security_groups": ([stack.output.security_group_id] | concat(module.input.security_group_ids != nil ? module.input.security_group_ids : [])), "assign_public_ip": (module.input.private_subnet_placement_enabled ? "DISABLED" : "ENABLED")}}, "enable_execute_command": module.input.execute_command_enabled, "timeout": module.input.pre_deploy_timeout} : nil >>'
+  strategy: |
+    <<
+    module.input.deployment_strategy == "rolling" ? nil :
+    module.input.deployment_strategy == "blue_green" ? {
+      "type": "blue_green",
+      "bake_time_in_minutes": module.input.deployment_bake_time_in_minutes,
+      "pause_stages": module.input.deployment_pause_stages
+    } :
+    module.input.deployment_strategy == "linear" ? {
+      "type": "linear",
+      "bake_time_in_minutes": module.input.deployment_bake_time_in_minutes,
+      "pause_stages": module.input.deployment_pause_stages,
+      "linear": {
+        "step_percentage": module.input.linear_step_percentage,
+        "step_bake_time_in_minutes": module.input.linear_step_bake_time_in_minutes
+      }
+    } : {
+      "type": "canary",
+      "bake_time_in_minutes": module.input.deployment_bake_time_in_minutes,
+      "pause_stages": module.input.deployment_pause_stages,
+      "canary": {
+        "canary_percent": module.input.canary_percent,
+        "canary_bake_time_in_minutes": module.input.canary_bake_time_in_minutes
+      }
+    }
+    >>
   task_definition:
     container_definitions: "<< (module.input.firelens_enabled ? [{\"command\": (len(module.input.prebuilt_image_start_command) > 0 ? module.input.prebuilt_image_start_command : nil), \"cpu\": (module.input.capacity_provider == \"ec2\" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024)), \"depends_on\": [{\"container_name\": \"log_router\", \"condition\": \"START\"}], \"environment\": ([{name: \"PORT\", value: string(module.input.container_port)}] | concat(module.input.environment_variables != nil ? module.input.environment_variables : [])), \"essential\": true, \"image\": (module.input.build_type == \"prebuilt_image\" ? (deploy.input.image_ref contains \"sha256:\" ? module.input.image_repository + \"@\" + deploy.input.image_ref : module.input.image_repository + \":\" + deploy.input.image_ref) : (deploy.input.image_ref contains \"sha256:\" ? stack.output.ecr_repository_url + \"@\" + deploy.input.image_ref : stack.output.ecr_repository_url + \":\" + deploy.input.image_ref)), \"linux_parameters\": {\"init_process_enabled\": true}, \"log_configuration\": {\"log_driver\": \"awsfirelens\"}, \"memory\": (module.input.capacity_provider == \"ec2\" ? int(float(module.input.task_memory) * 1024) : int(float(module.input.fargate_size.memory_gb) * 1024)), \"name\": (stack.output.container_name), \"port_mappings\": [{\"app_protocol\": \"http\", \"container_port\": (module.input.container_port), \"protocol\": \"tcp\"}], \"readonly_root_filesystem\": false, \"repository_credentials\": (module.input.image_registry_credentials_secret_arn ? {credentials_parameter: module.input.image_registry_credentials_secret_arn} : nil), \"secrets\": (module.input.secrets), \"stop_timeout\": 30}, {\"name\": \"log_router\", \"image\": module.input.firelens_image, \"cpu\": 0, \"memory_reservation\": 51, \"essential\": true, \"environment\": ([{name: \"FIRELENS_CONFIG_CONTENT\", value: \"[SERVICE]\\n    Flush 1\\n    Grace 30\\n\\n\" + (module.input.firelens_config ? module.input.firelens_config + \"\\n\" : \"\") + (module.input.firelens_cloudwatch_output_enabled ? \"\\n[OUTPUT]\\n    Name cloudwatch\\n    Match *\\n    region \" + stack.output.region + \"\\n    log_group_name \" + stack.output.log_group_name + \"\\n    auto_create_group true\\n    log_stream_prefix \" + stack.output.log_stream_prefix + \"/\\n    retry_limit 2\\n    log_key log\\n    log_format json/emf\\n\" : \"\")}] | concat(module.input.firelens_environment_variables != nil ? module.input.firelens_environment_variables : [])), \"secrets\": module.input.firelens_secrets != nil ? module.input.firelens_secrets : [], \"command\": [\"/bin/sh\", \"-c\", \"printf '%s' \\\"$FIRELENS_CONFIG_CONTENT\\\" > /flightcontrol-firelens.conf && exec /entrypoint.sh\"], \"user\": \"0\", \"log_configuration\": {\"log_driver\": \"awslogs\", \"options\": {\"awslogs-group\": stack.output.log_group_name, \"awslogs-region\": stack.output.region, \"awslogs-stream-prefix\": stack.output.log_stream_prefix + \"/firelens\"}}, \"firelens_configuration\": {\"type\": \"fluentbit\", \"options\": {\"config-file-type\": \"file\", \"config-file-value\": \"/flightcontrol-firelens.conf\", \"enable-ecs-log-metadata\": string(module.input.firelens_ecs_metadata_enabled)}}}] : [{\"command\": (len(module.input.prebuilt_image_start_command) > 0 ? module.input.prebuilt_image_start_command : nil), \"cpu\": (module.input.capacity_provider == \"ec2\" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024)), \"environment\": ([{name: \"PORT\", value: string(module.input.container_port)}] | concat(module.input.environment_variables != nil ? module.input.environment_variables : [])), \"essential\": true, \"image\": (module.input.build_type == \"prebuilt_image\" ? (deploy.input.image_ref contains \"sha256:\" ? module.input.image_repository + \"@\" + deploy.input.image_ref : module.input.image_repository + \":\" + deploy.input.image_ref) : (deploy.input.image_ref contains \"sha256:\" ? stack.output.ecr_repository_url + \"@\" + deploy.input.image_ref : stack.output.ecr_repository_url + \":\" + deploy.input.image_ref)), \"linux_parameters\": {\"init_process_enabled\": true}, \"log_configuration\": {\"log_driver\": \"awslogs\", \"options\": {\"awslogs-group\": stack.output.log_group_name, \"awslogs-region\": stack.output.region, \"awslogs-stream-prefix\": stack.output.log_stream_prefix}}, \"memory\": (module.input.capacity_provider == \"ec2\" ? int(float(module.input.task_memory) * 1024) : int(float(module.input.fargate_size.memory_gb) * 1024)), \"name\": (stack.output.container_name), \"port_mappings\": [{\"app_protocol\": \"http\", \"container_port\": (module.input.container_port), \"protocol\": \"tcp\"}], \"readonly_root_filesystem\": false, \"repository_credentials\": (module.input.image_registry_credentials_secret_arn ? {credentials_parameter: module.input.image_registry_credentials_secret_arn} : nil), \"secrets\": (module.input.secrets), \"stop_timeout\": 30}]) | concat(module.input.sidecars ? module.input.sidecars : []) >>"
     cpu: '<< string(module.input.capacity_provider == "ec2" ? int(float(module.input.task_cpu) * 1024) : int(float(module.input.fargate_size.vcpu) * 1024)) >>'
@@
     show_when:
       build_type: nixpacks
     type: string_array
+  - id: section_deployment
+    label: Deployment
+    type: section
+  - default: rolling
+    description: Choose how traffic moves from the current deployment to the new deployment.
+    id: deployment_strategy
+    label: Deployment strategy
+    required: true
+    type: string
+    values:
+      - description: Replace tasks in place while the service 
... diff truncated ...

mabadir and others added 8 commits June 10, 2026 10:39
AWS rejects explicit from_port/to_port when ip_protocol is "-1", which
broke updates to existing rules (the sg-module refactor changed the ECS
instance ingress rule from -1/-1 to 0/0). Null the ports in the shared
security-groups module whenever the protocol is "-1" or "all", and use
-1 in the ecs_cluster caller for clarity.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts:
#	compute/ecs_cluster/README.md
#	compute/ecs_cluster/locals.tf
#	compute/ecs_cluster/rvn-ecs-cluster-definition.yml
#	compute/ecs_service/README.md
#	compute/ecs_service/rvn-ecs-web-definition.yml
@mabadir mabadir force-pushed the native-ecs-deployment-strategies branch from 777e8bd to bd9ac38 Compare June 11, 2026 20:29
mabadir added 7 commits June 11, 2026 19:37
…ner rule

Make the green (test) ALB listener rule's distinguishing condition
configurable between an HTTP header and a query string via
test_traffic_condition_type. Default to the query string
__x-rvn-test__=1 so test traffic reaches the green target group
without a custom header. ALB AND-combines conditions and ECS native
blue/green wires exactly one test rule, so the selector is one type
per service, not simultaneous header-OR-query matching.

mabadir commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@greptile

var.load_balancer_attachment.listener_rules[0].priority,
local.green_default_production_priority,
) : null
green_test_priority = local.green_alb_listener_rule_enabled ? local.green_production_priority - 1 : null

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Test-rule priority becomes 0 when production priority is 1

green_test_priority = local.green_production_priority - 1 produces 0 when a caller explicitly sets priority = 1. AWS ALB rejects listener rule priorities outside the range 1–50000, so the apply would fail with an API error (ValidationError: The number must be between 1 and 50000). There is no variable validation that prevents listener_rules[0].priority = 1, making this a real failure path for services that want the highest-priority rule. A guard is needed — either validate that listener_rules[0].priority >= 2 when green_alb_listener_rule_enabled is true, or clamp green_test_priority to a minimum of 1.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/locals.tf
Line: 72

Comment:
**Test-rule priority becomes 0 when production priority is 1**

`green_test_priority = local.green_production_priority - 1` produces `0` when a caller explicitly sets `priority = 1`. AWS ALB rejects listener rule priorities outside the range 1–50000, so the apply would fail with an API error (`ValidationError: The number must be between 1 and 50000`). There is no variable validation that prevents `listener_rules[0].priority = 1`, making this a real failure path for services that want the highest-priority rule. A guard is needed — either validate that `listener_rules[0].priority >= 2` when `green_alb_listener_rule_enabled` is true, or clamp `green_test_priority` to a minimum of 1.

How can I resolve this? If you propose a fix, please make it concise.

mabadir added 2 commits June 17, 2026 10:38
# Conflicts:
#	compute/ecs_service/rvn-ecs-web-definition.yml
@mabadir

mabadir commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile

@mabadir mabadir requested a review from flybayer June 17, 2026 23:03
Comment on lines +6 to +8
from = aws_lb_target_group.this[0]
to = aws_lb_target_group.tg_1[0]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Rolling-service upgrade will fail: moved + name change + ignore_changes = [action] deadlock

For any existing rolling service with a load balancer, upgrading to this module version triggers a tofu apply failure:

  1. The moved block transfers the state address of aws_lb_target_group.this[0] (name suffix -tg, max-prefix 28 chars) to aws_lb_target_group.tg_1[0] (suffix -tg-1, max-prefix 24 chars). Every service gets a different name, so Terraform plans a replacement.
  2. With create_before_destroy = true, the new TG is created first (new ARN). However, ignore_changes = [action] on aws_lb_listener_rule.alb suppresses the ARN reference update — the listener rule keeps pointing to the old TG ARN.
  3. Terraform then attempts to destroy the old TG. AWS rejects this: "Target group is currently in use by a listener rule", and the apply fails.

The apply exits mid-way: the new TG exists but is unreachable, and the old TG is stuck — referenced by the listener rule but orphaned from Terraform state. Manual cleanup is required. This affects every load-balanced rolling service; the PR description mentions "target-group recreation on upgrade" but not that the apply itself fails.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/moved.tf
Line: 6-8

Comment:
**Rolling-service upgrade will fail: `moved` + name change + `ignore_changes = [action]` deadlock**

For any existing rolling service with a load balancer, upgrading to this module version triggers a `tofu apply` failure:

1. The `moved` block transfers the state address of `aws_lb_target_group.this[0]` (name suffix `-tg`, max-prefix 28 chars) to `aws_lb_target_group.tg_1[0]` (suffix `-tg-1`, max-prefix 24 chars). Every service gets a different `name`, so Terraform plans a replacement.
2. With `create_before_destroy = true`, the new TG is created first (new ARN). However, `ignore_changes = [action]` on `aws_lb_listener_rule.alb` suppresses the ARN reference update — the listener rule keeps pointing to the old TG ARN.
3. Terraform then attempts to destroy the old TG. AWS rejects this: "Target group is currently in use by a listener rule", and the `apply` fails.

The apply exits mid-way: the new TG exists but is unreachable, and the old TG is stuck — referenced by the listener rule but orphaned from Terraform state. Manual cleanup is required. This affects every load-balanced rolling service; the PR description mentions "target-group recreation on upgrade" but not that the apply itself fails.

How can I resolve this? If you propose a fix, please make it concise.

Re-adopting a pre-existing target group via the moved block changed the
name from -tg to -tg-1, forcing replacement. With ignore_changes=[action]
on the listener rule, the rule never repointed to the new TG and the old
TG's destroy failed (in use by listener rule), breaking apply mid-way.

Ignore changes to name so migrated TGs keep their existing name/ARN in
place; fresh services still get the -tg-1 name.
@mabadir mabadir removed the run-terratest Run Terratest integration tests (real AWS apply) on this PR label Jun 18, 2026
# Conflicts:
#	compute/ecs_cluster/ec2.tf
#	compute/ecs_cluster/rvn-ecs-cluster-definition.yml
@mabadir

mabadir commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@greptile

base_path: compute/ecs_cluster
terraform_variables:
...overrides: << module.input.advanced_terraform_variables >>
capacity_provider_default: << module.input.capacity_provider_default >>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see input added for this, but instead of that just derive from the enabled provider flags? defaulting to something automatically?

Comment on lines +53 to +62
values:
- label: Rolling
value: rolling
- label: Blue/green
value: blue_green
- label: Linear
value: linear
- label: Canary
value: canary
- id: deployment_bake_time_in_minutes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add descriptions

Comment on lines +170 to +183
- id: test_header_name
label: Test traffic header name
type: string
description: HTTP header that routes a request to the green (test) target group during a deployment. Send this header with the value below to reach the new revision before production traffic shifts. Requests without it keep hitting production.
collapsible: true
required: true
default: X-Ravion-Test
- id: test_header_value
label: Test traffic header value
type: string
description: Value the test traffic header must carry to be routed to the green (test) target group.
collapsible: true
required: true
default: "1"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and can we add query string also?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually query string is added as default, I am going to remove this input altogether.

@mabadir mabadir requested a review from flybayer June 18, 2026 23:58
Comment thread compute/ecs_cluster/rvn-ecs-cluster-definition.yml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update readme with instructions on how to access the standby environment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in the web def

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I got this

Comment on lines +694 to +706
## Standby validation traffic

For Blue/green, Linear, and Canary deployments, Ravion creates a test listener rule on the same Application Load Balancer listener as the production rule. During the test traffic stage, requests that match the service's normal Domain host rules and Path rules and include the standby selector route to the standby, or green, task set on the alternate target group.

By default, the standby selector is the query parameter `__x-rvn-test__=1`. For example, if production traffic uses `https://app.example.com/health`, validate the standby service with:

```text
https://app.example.com/health?__x-rvn-test__=1
```

The alternate target group only has registered targets while ECS is running a traffic-shift deployment. Outside that window, the standby route may have no healthy targets.

Use Advanced Terraform variables to override the standby selector. Values in Advanced Terraform variables override the generated Terraform variables for the service.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add that it is sticky with cookie, so once first request goes to standy you are stuck on that uless you clear a cookie

lets document the cookie

and how can you tell if a response is from prod or standby?

@mabadir mabadir requested a review from flybayer June 19, 2026 15:56
@mabadir mabadir merged commit 90b7da8 into main Jun 19, 2026
6 checks passed
@mabadir mabadir deleted the native-ecs-deployment-strategies branch June 20, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants