Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
874fc62
feat(cdk): integ-tests Phase 1 — core lifecycle E2E (#317)
ayushtr-aws Jun 15, 2026
b59a09a
test(cdk): wire Phase 1 gate scenarios to the provisioned sandbox (#317)
ayushtr-aws Jun 15, 2026
826da1f
fix(agent): exclude integ-runner output from the agent build context …
ayushtr-aws Jun 15, 2026
02b8fd2
test(cdk): make Phase 1 integ teardown + diagnostics robust (#317)
ayushtr-aws Jun 15, 2026
59953e1
test(cdk): assert task-record fields + approval metadata, not just st…
ayushtr-aws Jun 15, 2026
df73ef8
fix(cdk): satisfy onboarding gate + guardrail in Phase 1 scenarios (#…
ayushtr-aws Jun 15, 2026
e605fee
test(cdk): disable two-phase update workflow for the lifecycle test (…
ayushtr-aws Jun 15, 2026
d754b7d
test(cdk): per-run unique stack name + tolerate ENI teardown failure …
ayushtr-aws Jun 15, 2026
056d048
fix(cdk): drop nested Match from polled assertions — they never match…
ayushtr-aws Jun 15, 2026
f535449
fix(cdk): seed GitHub token before any preflight to avoid cached-empt…
ayushtr-aws Jun 15, 2026
309afdf
docs(cdk): clarify Phase 1 integ test is environment-agnostic (#317)
ayushtr-aws Jun 15, 2026
3f82e39
fix(security): force js-yaml >=4.2.0 to clear GHSA-h67p-54hq-rp68 (#317)
ayushtr-aws Jun 15, 2026
11edc69
Merge branch 'aws-samples:main' into feat/317-core-lifecycle-e2e
ayushtr-aws Jun 15, 2026
a91b08c
Merge branch 'main' into feat/317-core-lifecycle-e2e
isadeks Jun 17, 2026
34e7343
test(cdk): address PR review — fresh gate tokens, PENDING filter, env…
ayushtr-aws Jun 18, 2026
a5a188f
feat(ci): add integ-sweeper to reclaim stranded int-* stacks + alarm …
ayushtr-aws Jun 18, 2026
7032e80
Merge branch 'main' into feat/317-core-lifecycle-e2e
isadeks Jun 18, 2026
a66a8ab
Merge branch 'main' into feat/317-core-lifecycle-e2e
isadeks Jun 18, 2026
51bdde6
test(cdk): address re-review — sweeper hardening, gate-skip-on-unset,…
ayushtr-aws Jun 18, 2026
5fb4632
Merge branch 'main' into feat/317-core-lifecycle-e2e
isadeks Jun 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,14 @@ cdk/cdk.out/
cdk/lib/
cdk/node_modules/

# integ-runner output dirs. The agent artifact's build context is the repo
# root, and integ-runner writes its synth/snapshot output UNDER that root
# (cdk/test/integ/cdk-integ.out.<test>.ts[.snapshot]/). Without these excludes,
# staging the root copies its own output dir into itself recursively until the
# path overflows (ENAMETOOLONG). Mirrors .gitignore lines 70-71.
cdk/test/integ/cdk-integ.out.*/
cdk/test/integ/*.snapshot/

# CLI and docs build artifacts
cli/lib/
cli/node_modules/
Expand Down
204 changes: 204 additions & 0 deletions .github/workflows/integ-sweeper.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
name: integ-sweeper
# Reclaims stranded ephemeral integ stacks (issue #317 / PR #348 follow-up).
#
# The Phase-1 lifecycle integ test (integ.yml + cdk/test/integ/integ.task-lifecycle.ts)
# deploys a per-run `int-<commit-sha>` stack running the AgentCore Runtime in VPC
# mode. That runtime injects AWS-service-managed `agentic_ai` ENIs into the private
# subnets, which AWS releases only ASYNCHRONOUSLY (observed: 1+ hours after the
# runtime is deleted). So the in-run `cdk destroy` reliably fails the subnet/SG/VPC
# deletes (DependencyViolation) and the integ run tolerates that failure
# (destroy.expectError) rather than blocking on a wait it can't win. The per-run
# UNIQUE stack name means a stranded stack never blocks a later run — but nothing
# in the run reclaims it either.
#
# THIS workflow is that reclaimer: on a schedule (after the ENIs have had time to
# detach), it deletes every `int-*` stack, and FAILS LOUDLY + opens a tracking
# issue for any `int-*` stack older than the alarm threshold that still won't
# delete — so a genuine leak (cost in the shared account) surfaces instead of
# accumulating silently.
on:
workflow_dispatch: {}
schedule:
# Every 2 hours. Frequent enough that a normal stranded stack (ENIs release in
# ~1-2h) is reclaimed within a cycle or two, well before the 6h alarm age.
- cron: "0 */2 * * *"

concurrency:
group: integ-sweeper
cancel-in-progress: false

permissions:
contents: none

jobs:
sweep:
name: Reclaim stranded int-* stacks
runs-on: ubuntu-latest
# The integ deploy role (secrets.AWS_ROLE_TO_ASSUME) is scoped to the `integ`
# environment — same as integ.yml. The environment's protection rules must
# permit this scheduled run to assume the role (no manual approval is possible
# on a cron trigger).
environment: integ
timeout-minutes: 30
permissions:
id-token: write # OIDC role assumption
contents: read
issues: write # open a tracking issue on a genuine leak
env:
# Stacks older than this (hours) that STILL fail to delete are treated as a
# genuine leak → fail the job + file an issue. Comfortably past the observed
# ENI-release window so normal teardown lag never false-alarms.
ALARM_AGE_HOURS: "6"
AWS_REGION: ${{ vars.AWS_REGION || 'us-east-1' }}
AWS_DEFAULT_REGION: ${{ vars.AWS_REGION || 'us-east-1' }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@e7f100cf4c008499ea8adda475de1042d6975c7b # v6.2.0
with:
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
aws-region: ${{ vars.AWS_REGION || 'us-east-1' }}

- name: Sweep int-* stacks
id: sweep
run: |
set -uo pipefail

# Only the integ test's own per-run stacks are eligible. The test names
# them `int-<commit-hash>` where the hash is the 8-char short SHA
# (integ.task-lifecycle.ts: COMMIT_HASH.slice(0,8)). We therefore sweep
# ONLY names matching `int-<8 lowercase hex>` — NOT a bare `int-*` glob.
# `int-` is a short prefix; an unguarded glob in a shared account could
# delete an unrelated stack that merely starts with those 4 chars. The
# `int-local` fallback name (local dev runs) is intentionally NOT swept:
# CI never produces it, so a match would be someone's local stack.
STACK_RE='^int-[0-9a-f]{8}$'

# All non-deleted int-* stacks (active, DELETE_FAILED, or rollback states);
# the JMESPath prefilter narrows the API page, the regex below is the
# authoritative guard.
mapfile -t candidates < <(
aws cloudformation list-stacks \
--stack-status-filter CREATE_COMPLETE CREATE_FAILED ROLLBACK_COMPLETE ROLLBACK_FAILED \
UPDATE_COMPLETE UPDATE_ROLLBACK_COMPLETE UPDATE_ROLLBACK_FAILED DELETE_FAILED \
--query 'StackSummaries[?starts_with(StackName, `int-`)].StackName' \
--output text 2>/dev/null | tr '\t' '\n' | sort -u
)

stacks=()
for c in "${candidates[@]}"; do
[ -n "$c" ] || continue
if [[ "$c" =~ $STACK_RE ]]; then
stacks+=("$c")
else
echo "Skipping '$c' — does not match ${STACK_RE} (not a sweepable integ stack)."
fi
done

if [ "${#stacks[@]}" -eq 0 ]; then
echo "No int-* stacks present. Nothing to sweep."
exit 0
fi

echo "Found ${#stacks[@]} int-* stack(s): ${stacks[*]}"
now_epoch="$(date -u +%s)"
alarm_secs=$(( ALARM_AGE_HOURS * 3600 ))
leaked=""

for stack in "${stacks[@]}"; do
[ -n "$stack" ] || continue
echo "::group::$stack"

# Best-effort delete (idempotent; no-op if already deleting/gone).
aws cloudformation delete-stack --stack-name "$stack" || true
# Give CloudFormation a moment, then read the resulting status.
sleep 15
status="$(aws cloudformation describe-stacks --stack-name "$stack" \
--query 'Stacks[0].StackStatus' --output text 2>&1 || true)"

if echo "$status" | grep -qiE 'does not exist|ValidationError'; then
echo "✅ $stack deleted (or gone)."
echo "::endgroup::"
continue
fi

# Still present — how old is it? Alarm only if past the threshold.
created="$(aws cloudformation describe-stacks --stack-name "$stack" \
--query 'Stacks[0].CreationTime' --output text 2>/dev/null || true)"
created_epoch="$(date -u -d "$created" +%s 2>/dev/null || echo 0)"
age_secs=$(( now_epoch - created_epoch ))
age_hours=$(( age_secs / 3600 ))

if [ "$created_epoch" -gt 0 ] && [ "$age_secs" -ge "$alarm_secs" ]; then
echo "❌ $stack still present (status: $status), age ${age_hours}h ≥ ${ALARM_AGE_HOURS}h — LEAK."
leaked="${leaked}\n- \`${stack}\` — status \`${status}\`, age ~${age_hours}h"
else
echo "⏳ $stack still present (status: $status), age ~${age_hours}h — within ${ALARM_AGE_HOURS}h window; ENIs likely not yet released. Will retry next cycle."
fi
echo "::endgroup::"
done

if [ -n "$leaked" ]; then
{
echo "leaked<<EOF"
echo -e "$leaked"
echo "EOF"
} >> "$GITHUB_OUTPUT"
fi

- name: Open issue on genuine leak
if: steps.sweep.outputs.leaked != ''
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# Pass via env (not inline ${{ }} interpolation) so the value never
# expands into the shell script body — avoids template injection
# (zizmor template-injection). Stack names are AWS-controlled, but env
# is the correct, lint-clean pattern regardless.
LEAKED: ${{ steps.sweep.outputs.leaked }}
# Stable label used both to tag the tracking issue and to find an
# existing open one — this is the dedup key, so it must not change.
LEAK_LABEL: integ-leak
run: |
set -euo pipefail
body_file="$(mktemp)"
{
echo "The integ-sweeper found stranded \`int-*\` CloudFormation stacks older than ${ALARM_AGE_HOURS}h that still fail to delete — likely a real leak in the shared integ account (each carries a VPC + NAT gateway + interface endpoints + the AgentCore runtime, billing hourly)."
echo ""
echo "These are normally reclaimed automatically once the AgentCore \`agentic_ai\` ENIs detach (~1-2h). Past ${ALARM_AGE_HOURS}h, investigate: the ENIs may be genuinely stuck (needs manual ENI/VPC cleanup) or the deploy role lacks teardown permissions."
echo ""
echo "### Stranded stacks (as of this run)"
echo -e "${LEAKED}"
echo ""
echo "| Field | Value |"
echo "| --- | --- |"
echo "| Workflow run | [integ-sweeper #${GITHUB_RUN_NUMBER}](${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}) |"
echo "| Region | \`${AWS_REGION}\` |"
echo ""
echo "Close this issue once the stacks are deleted and the sweeper run is green."
} > "${body_file}"

# Dedup: a stuck stack re-alarms every 2h cycle. Without this guard each
# cycle files a fresh duplicate. Find an existing OPEN issue carrying the
# stable leak label and comment on it instead of opening another; only
# open a new issue when none exists. `--search` scopes to open issues with
# the label; `--json number --jq '.[0].number'` yields the first match (or
# empty). Ensure the label exists first (idempotent; ignore "already exists").
gh label create "${LEAK_LABEL}" \
--description "Stranded integ stacks flagged by integ-sweeper" \
--color B60205 2>/dev/null || true

existing="$(gh issue list --state open --label "${LEAK_LABEL}" \
--json number --jq '.[0].number // empty' 2>/dev/null || true)"

if [ -n "${existing}" ]; then
echo "Existing open leak issue #${existing} — commenting instead of opening a duplicate."
gh issue comment "${existing}" --body-file "${body_file}"
else
gh issue create \
--title "Stranded integ stacks not reclaimed (>${ALARM_AGE_HOURS}h)" \
--label "${LEAK_LABEL}" \
--body-file "${body_file}"
fi

- name: Fail job on genuine leak
if: steps.sweep.outputs.leaked != ''
run: exit 1
Loading
Loading