feat(cdk): integ-tests Phase 1 — core lifecycle E2E (#317)#348
feat(cdk): integ-tests Phase 1 — core lifecycle E2E (#317)#348ayushtr-aws wants to merge 20 commits into
Conversation
5168c27 to
aa69eca
Compare
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #348 +/- ##
=======================================
Coverage ? 88.21%
=======================================
Files ? 201
Lines ? 48945
Branches ? 4454
=======================================
Hits ? 43179
Misses ? 5766
Partials ? 0 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Add integ.task-lifecycle.ts: deploys the full AgentStack (orchestrator + AgentCore runtime + agent container) as backgroundagent-integ-lifecycle and drives a real agent through the four Cedar HITL terminal paths — COMPLETED (repo-less default/agent-v1), FAILED (coding/new-task-v1 against a nonexistent repo), and AWAITING_APPROVAL -> approve/deny (config.env write tripping the write_env_files soft-deny gate). Cost is bounded via low max_turns + max_budget_usd; teardown is forced on success and failure. Gate scenarios 3 & 4 read their sandbox repo + pre-seeded PAT secret from two top-of-file constants (placeholder until provisioned); the PAT is copied into the stack-created GitHubTokenSecret via a get/putSecretValue assertion, matching the documented operator flow (QUICK_START §4) with no agent.ts change. integ.yml: raise the integ job timeout 45->90m for the heavier full-stack deploy, extend the teardown safety-net to sweep both integ stacks, and add an always() sandbox bgagent/* branch-cleanup step gated on INTEG_SANDBOX_REPO / INTEG_PAT_SECRET_ID repo vars (no-op until set). Update the ROADMAP "Deployed runtime E2E verification" row (+ Starlight mirror).
…s-samples#317) Point SANDBOX_REPO at ayushtr-aws/abca-integ-sandbox; PRESEEDED_PAT_SECRET stays bgagent/integ/github-pat (now created in the E2E account 465528542731). CI repo vars INTEG_SANDBOX_REPO / INTEG_PAT_SECRET_ID set to match so the always() branch-cleanup step activates.
…ws-samples#317) The agent AgentRuntimeArtifact is built with the repo root as its asset/build context (agent.ts), and integ-runner writes its synth + snapshot output UNDER that root (cdk/test/integ/cdk-integ.out.<test>.ts[.snapshot]/). Staging the root then copied its own output dir into itself recursively until the path overflowed (ENAMETOOLONG), failing synth before any deploy. The .dockerignore already excluded cdk/cdk.out/ for the same recursive-include reason but missed the integ-runner dirs. Add them (mirrors .gitignore 70-71). Verified with `integ-runner --dry-run`: synth + asset staging now succeed (SUCCESS, 72s) where they previously errored.
…les#317) Two test-harness fixes from the first live run (no agent.ts changes): 1. integ-runner --verbose (cdk/mise.toml): the default one-line-per-test output hides which assertion failed and its actual-vs-expected payload. The lifecycle test polls DynamoDB for terminal task status; without verbose, a task stuck at SUBMITTED instead of COMPLETED is undiagnosable. Verbose surfaces the assertion diffs. 2. Retry-with-backoff teardown (integ.yml): the AgentCore Runtime runs in VPC mode and injects service-managed "agentic_ai" ENIs into the private subnets. Deleting the Runtime reclaims those ENIs ASYNCHRONOUSLY, so a single delete-stack races the reaper and fails DELETE_FAILED on the subnets + RuntimeSG. The first live run stranded the stack exactly this way; a manual re-delete minutes later succeeded. The teardown now loops delete→wait→ re-check with linear backoff (60s..6m) and only fails the step if all attempts are exhausted.
…atus (aws-samples#317) aws-samples#317 asks the terminal-state assertions to cover task_id, user_id, status, timestamps, and approval metadata. The first cut asserted only `status`. Add a `presentString` matcher (Match.objectLike on { S: stringLikeRegexp('.+') }) and extend every poll: - COMPLETED: + task_id, user_id, created_at, updated_at present - FAILED: + the above, plus error_message present (records WHY it failed) - AWAITING_APPROVAL (both gates): + task_id, user_id, awaiting_approval_request_id - APPROVED row: + task_id, request_id, user_id, decided_at (proves the owning caller recorded the decision, not just a status flip) - DENIED row: + the above, plus deny_reason present Runtime-generated values (user_id = Cognito sub UUID, ISO timestamps) are matched for presence, not pinned. Validated with integ-runner --force --dry-run (synth + assertion serialization pass).
…ws-samples#317) First live run diagnosed via the assertion-provider Lambda logs (which survive teardown). All three failures were test-side, not platform bugs: 1. Scenario 1 → 400 VALIDATION_ERROR "Task description was blocked by content policy". The terse imperative prompt ("Reply with exactly the single word: done. Do not use any tools.") tripped the Bedrock guardrail. Replaced with a benign natural-language request. 2. Scenarios 2/3/4 → 422 REPO_NOT_ONBOARDED. The submit path enforces a RepoTable onboarding gate BEFORE clone/preflight, which the plan never accounted for. RepoTable lives in-stack (RemovalPolicy.DESTROY), so it is fresh every run — onboard via putItem assertions (minimal active row) rather than adding a Blueprint construct to the production stack: - onboardFailRepo: onboards the (GitHub-nonexistent) repo so scenario 2 now passes admission and fails at CLONE (the intended terminal-FAILED path), not at the gate. - onboardSandbox: onboards SANDBOX_REPO so the gate scenarios reach the agent. Reads the RepoTableName CfnOutput; sequences both onboarding steps before their submits in the .next() chain. Validated with integ-runner --force --dry-run.
…ws-samples#317) integ-runner defaults stackUpdateWorkflow=true, which deploys the snapshot then re-deploys the current version to verify in-place updates. The AgentCore Runtime takes minutes to reach READY and is partly immutable, so the second phase races the first (Runtime still CREATING) → 409 "agent is currently being modified" → integ-runner aborts mid-deploy and teardown strands a CREATING Runtime (DELETE_FAILED). Observed live: 334/335 resources deployed, only the Runtime was mid-create when the run aborted ~95s in. This test validates runtime behavior, not stack-update safety, so a single clean deploy is correct. Set stackUpdateWorkflow: false on the IntegTest.
…ws-samples#317) The AgentCore Runtime injects service-managed `agentic_ai` ENIs that AWS releases asynchronously, so `cdk destroy` reliably fails the subnet/SG/VPC deletes (DependencyViolation) and strands the stack. With the old fixed name (backgroundagent-integ-lifecycle) a stranded stack BLOCKED the next run (name conflict) and falsely failed the run on teardown alone. Two changes: - Unique per-commit stack name `int-<short-sha>`, sourced from the COMMIT_HASH env var (read directly via process.env — integ-runner synths in a subprocess that inherits the env but not our shell's CDK context). CI sets COMMIT_HASH from the resolved head SHA; mise //cdk:integ falls back to the local git SHA. A stranded stack now never blocks a later run; an out-of-band ephemeral sweeper reclaims int-* stacks once their ENIs detach. - destroy.expectError:true (scoped to the dependency-violation message) so the expected ENI teardown failure no longer marks the whole run FAILED — the result reflects the ASSERTIONS, not the teardown race. integ.yml: pass COMMIT_HASH to the run; rework the safety-net teardown to target the per-run int-<sha> stack, best-effort (no retry-until-deleted, no job failure) since unique names mean stranded stacks are harmless and swept out of band. Verified: integ-runner --force --dry-run synthesizes `int-abcdef12` from COMMIT_HASH (was `int-local` before the env-read fix).
…aws-samples#317) Live run proved the platform works: scenario-1 task reached COMPLETED in 3.4s with all fields correct. But the assertion polled it 25× and timed out anyway. Root cause: nested Match.* matchers (Match.objectLike / Match.stringLikeRegexp, the field-presence strengthening added earlier) do NOT survive serialization into the waitForAssertions Step Functions waiter. The provider serializes the Match object's internals — {name, partial, pattern} — into the EXPECTED pattern, and the waiter then treats those as literal required keys that never exist on the DynamoDB row. Observed in the waiter log: "Missing key 'name' / 'partial' / 'pattern'", failing every poll despite status=COMPLETED being present. Fix: every polled assertion (waitForAssertions) now uses ONLY a flat, exact scalar — `Item.status.S` (or the APPROVED/DENIED decision string) — which serializes cleanly. Removed the unused Match import. Field-presence assertions (task_id/user_id/timestamps/approval metadata, aws-samples#317) need a separate non-polled getItem + assertAtPath — re-noted as a follow-up on aws-samples#317.
…y 401 (aws-samples#317) Live run: gate tasks failed preflight with 401 GITHUB_UNREACHABLE despite a valid PAT, so they never reached AWAITING_APPROVAL and the gate assertions timed out. Root cause: resolveGitHubToken (context-hydration.ts) caches the secret value for 5 min keyed by ARN. Scenario 2 (coding/new-task-v1) runs GitHub preflight, and in the old order it ran BEFORE the seedGet/seedPut step — so it read and cached the stack's INITIAL EMPTY GitHubTokenSecret. Every later gate task reused that cached empty token → 401 → FAILED before the Cedar gate could fire. (Scenario 2 itself still "passed" because it targets a nonexistent repo and fails regardless, hiding the poisoning. Scenario 1 is repo-less so its preflight short-circuits and never touches the token.) Fix: move seedGet→seedPut to immediately after auth, before ANY submit, so the secret is populated before the first token read and no empty value is ever cached. Stays true to the design (QUICK_START §4: operator populates the secret before submitting tasks) — no agent.ts change, no import-by-ARN deviation. Onboarding steps moved ahead of submits too.
…mples#317) The header + gate-config comments implied the test is pinned to a specific E2E account. Reword to reflect reality: it deploys to whatever account/region the caller's creds resolve to, config is name-based (not ARN-pinned), and note the PAT needs Contents+PR write for the approve scenario's git push.
…s-samples#317) osv.dev recently widened GHSA-h67p-54hq-rp68 to cover js-yaml <4.2.0 (was <4.x), so the transitive js-yaml@3.14.2 pulled in by @istanbuljs/load-nyc-config (Jest coverage tooling) now fails the required security-pr.yml dependency scan — main itself is currently red on this. Add a js-yaml "^4.2.0" resolution (same pattern the repo already uses for esbuild/vite/etc.). 4.2.0 only consolidates the existing 3.14.2 + 4.1.1 trees onto one version; load-nyc-config calls js-yaml.load(), unchanged across 3→4. Dev/test-only dependency — no source imports js-yaml, nothing shipped changes. Verified: osv-scanner → No issues found; cdk compile clean; full jest suite 2039/2039 pass.
aa69eca to
3f82e39
Compare
Review:
|
…-driven sandbox (aws-samples#317) Review fixes for PR aws-samples#348: - Token-TTL race (blocker): the single idToken minted at auth was reused for the approve/deny POSTs ~48 min later, past Cognito's 60-min default validity → risk of 401 → decision never recorded → false timeout. Re-mint a fresh token right before each gate POST (reAuthApprove/reAuthDeny in the .next() chain). Test-side only — no idTokenValidity change to the production app client. - Approval-row determinism (nit): queryApprove/queryDeny now filter status=PENDING so Items[0] is the live row even if a task trips the gate more than once or carries already-decided rows. - Env-driven sandbox config (nit): SANDBOX_REPO / PRESEEDED_PAT_SECRET now read INTEG_SANDBOX_REPO / INTEG_PAT_SECRET_ID (the CI vars the cleanup step already uses), falling back to the literals, so the gate scenarios bind to the running account's sandbox instead of a hardcoded contributor repo. integ.yml passes the vars to the run step. Verified: compile, eslint, zizmor, and integ-runner --force --dry-run all pass.
|
Thanks for the thorough review, @isadeks — really appreciate the depth. Pushed 🚫 # 2 Token-TTL race — fixed. Re-mint a fresh Cognito token immediately before each gate POST ( 🔧 🔧 Hardcoded sandbox — fixed. 🔧 PAT scope — confirmed: the sandbox PAT is a fine-grained token scoped to only the sandbox repo (Contents+PR write), not a classic/broad token. Verified locally: compile, eslint, zizmor, and |
…on leaks (aws-samples#317) Addresses PR aws-samples#348 blocker #1: the teardown comments claimed an 'ephemeral sweeper' reclaims stranded int-* stacks, but none existed — so failed-teardown stacks (the expected AgentCore agentic_ai ENI race) accumulated silently, each leaking a VPC + NAT gateway + interface endpoints + runtime in the shared account. New scheduled workflow .github/workflows/integ-sweeper.yml (every 2h): - lists all int-* CloudFormation stacks and best-effort delete-stacks each (idempotent; the ENIs have had time to detach by the time it runs, unlike the in-run teardown which races them); - for any int-* stack still undeletable past ALARM_AGE_HOURS (6h — comfortably past the observed ~1-2h ENI-release window, so normal lag never false-alarms), FAILS THE JOB and opens a tracking issue (mirrors security.yml's pattern), so a genuine leak surfaces loudly instead of hiding. Runs under environment: integ with OIDC (id-token: write) to assume the same AWS_ROLE_TO_ASSUME integ.yml uses — keeps the secret in a dedicated environment (zizmor secrets-outside-env clean). Issue body passed via env, not inline interpolation (zizmor template-injection clean). Also corrects the integ.yml + integ.task-lifecycle.ts comments to point at the now-real sweeper instead of overclaiming. Hybrid teardown story: in-run delete stays best-effort/quiet (expectError), the sweeper does reliable async cleanup + the loud alarm. Verified: YAML parse, bash -n, zizmor (new file 0 findings), compile, eslint.
|
🚫 # 1 Billable leak / missing sweeper — fixed in Added
Net teardown story is now a coherent hybrid: the in-run delete stays best-effort/quiet ( Security notes: the sweeper runs under All five review points (both blockers + three nits) are now addressed across |
Re-review — both blockers resolved ✅ (recommend approve once residuals are noted)Re-reviewed at head Blockers — resolved
Verified sound (design / security)
Non-blocking items
NetBoth original blockers properly resolved. Remaining: one environment-config decision (#1, partly a maintainer action) and three nits (#2–#4). Good to merge once #1 is decided and the nits are folded in or filed as follow-ups. |
isadeks
left a comment
There was a problem hiding this comment.
Approving — both blockers from the earlier review are resolved and verified at a5a188f:
- Billable leak → the new
integ-sweeper.ymlreclaims strandedint-*stacks and fails loudly + files an issue past the 6h threshold. - Token-TTL race → fresh
reAuthApprove/reAuthDenytokens are minted right before the gate POSTs. Items.0→ now filtered tostatus = PENDING.
Also confirmed sound: agent egress isolation (443-only SG + DNS firewall), least-privilege sweeper permissions, and the status-only assertions (field-presence correctly deferred since nested Match.* doesn't survive the polling waiter).
Remaining items are non-blocking follow-ups, not merge gates:
- The
integenvironment having no required reviewers is a repo-settings decision (maintainer action, out of this diff) — but worth a follow-up since the sweeper now depends on it staying approval-free and theinteg.ymlcomments overstate the approval gate. - Sweeper issue-spam dedup, the
int-delete-prefix breadth, and the personal public-repo fallback default — all nice-to-harden, none blocking.
Nice work on the teardown design and the token-race fix.
… correct env-gate comments (aws-samples#317) Re-review residuals from PR aws-samples#348 (isadeks): - Sweeper issue-spam (nit #2): dedup on a stable `integ-leak` label — comment on the existing open issue instead of filing a duplicate each 2h cycle. Label is created idempotently before use. - Sweeper prefix breadth (nit aws-samples#3): only sweep `int-<8 hex>` (the CI stack-name shape) via an authoritative regex guard, not a bare `int-*` glob, so an unrelated stack sharing the 4-char prefix is never deleted in the shared account. - Personal-repo fallback (nit aws-samples#4): drop the hardcoded ayushtr-aws/abca-integ-sandbox / secret literals. Gate scenarios 3 & 4 now run ONLY when INTEG_SANDBOX_REPO + INTEG_PAT_SECRET_ID are both set; otherwise they cleanly skip (warn) instead of routing a write-PAT clone into a contributor's personal repo. The gate assertion calls are constructed only inside the enabled branch so nothing is registered when skipped. - integ env comments (in-PR part of #1): correct comments that overstated the `integ` environment approval as the fork-code gate. The enforced gate is the `safe-to-test` label; the environment adds an approval ONLY if required reviewers are configured (currently none). Note that the cron-triggered sweeper depends on the env staying approval-free. Whether to add required reviewers is a maintainer/admin action outside this PR. Validated: compile, eslint, zizmor (both workflows clean), and integ-runner --force --dry-run pass with gates both disabled and enabled.
|
Thanks again, @isadeks — pushed 🔧 #2 Sweeper issue-spam — fixed. The leak-issue step now dedups on a stable 🔧 #3 Sweeper 🔧 #4 Personal public-repo fallback — fixed. Dropped the hardcoded 🚫/🔧 #1
One note re: the rebase onto the latest base — #109 just landed Validated on the rebased head: |
Summary
Phase 1 of the deploy-then-verify integration-test effort (#236). Builds on Phase 0 (#295) to drive a real agent through the full task lifecycle on a live
AgentStack, asserting the four terminal paths from the Cedar HITL E2E matrix (docs/design/CEDAR_HITL_GATES.md§15.3):default/agent-v1)coding/new-task-v1, clone fails)write_env_filessoft-deny gate)Closes #317.
Validation
Run end-to-end against a live deployed
AgentStack(deploy → 4 real agent runs → destroy, ~54 min): all 4 assertions pass. The approve scenario completed the full happy path — the agent cloned the repo, tripped the Cedar gate on aconfig.envwrite, was approved, resumed, pushed a branch, and opened a real PR on the sandbox repo. The deny scenario blocked the write and recordedDENIED.What's here
cdk/test/integ/integ.task-lifecycle.ts— the Phase 1 test: realAgentStack, Cognito auth chain, the 4 scenarios,waitForAssertionspolling, forced teardown..github/workflows/integ.yml— passCOMMIT_HASH, best-effort teardown of the per-run stack, raised timeout, sandbox branch cleanup.cdk/mise.toml—//cdk:integexportsCOMMIT_HASHand runs--verbose..dockerignore— exclude integ-runner output from the agent build context (fixes a recursive-copyENAMETOOLONGat synth).docs/guides/ROADMAP.md(+ Starlight mirror) — mark Phase 1 landed.Key design decisions
int-<commit-hash>(fromCOMMIT_HASH) — the AgentCore Runtime injects service-managedagentic_aiENIs that AWS releases asynchronously, socdk destroyreliably fails the VPC/subnet/SG deletes. A fixed name would let a stranded stack block the next run; unique names make stranded stacks harmless (swept out-of-band as ENIs release).destroy.expectErrortolerates the known teardown failure so the run reflects the assertions, not the teardown race.stackUpdateWorkflow: false— the Runtime's slow/immutable create makes integ-runner's default two-phase deploy race itself (409); a single clean deploy is correct for runtime-behavior validation.Match.*matchers don't survive serialization into thewaitForAssertionsStep Functions waiter (they leak{name,partial,pattern}into the expected pattern and never match).RepoTableputItem (the documented operator flow, noagent.tschange).Follow-ups (noted on #317)
getItem+assertAtPath(incompatible with the polling waiter).TaskEventsTableevent-shape assertions.Run policy,
integenvironment gate, concurrency, andinteg-smokesemantics are unchanged from Phase 0.