Skip to content

perf(ci): shard the CDK jest suite across a job matrix (#363 item #2) #365

Description

@scottschreckengaust

Summary

Implement item #2 of the build-performance umbrella (#363): shard the CDK Jest suite across a CI job matrix so the now-dominant //cdk:test (~298s after #357/#359) runs in parallel slices and aggregates into the single required build check.

Design is documented in docs/design/CI_BUILD_PERFORMANCE.md (PR #364, §"Item #2 — sharding the CDK suite").

Parent: #363. Predecessor: #357 (done, PR #359 merged).

Approach (fan-out + aggregate gate)

build-shard (matrix: shard ∈ [1..4])   # parallel; NOT individually required
        │
        ▼
build (needs: build-shard)             # single required context on merge_group
  • build-shard runs jest --shard=${{ matrix.shard }}/4 for the CDK suite (~119 test files) + uploads per-shard coverage.
  • A single build job needs: [build-shard], fails if any shard failed, and remains the one required status context (stable name regardless of shard count).
  • Non-test tasks (//cdk:synth:quiet, //cli:*, //docs:build, //agent:quality, drift checks) and the deploy artifact (cdk-agentcore-out) produced exactly once, off the shard critical path.
  • Cross-shard coverage merge before threshold enforcement (each shard sees only its slice).
  • Self-mutation/drift check runs once (aggregate job), not per-shard.

Acceptance criteria

  • CDK Jest runs as a 4-way shard matrix; all 2041 tests execute exactly once across shards (sum of per-shard counts = 2041).
  • The required build check remains a single stable context that reports on merge_group (feat(ci): require secrets/deps/SAST on every PR — make the merge gate enforceable (incident #313 class) #327) and fails if any shard fails.
  • Coverage thresholds (branches 82 / funcs 94 / lines 91 / stmts 91) enforced once on the merged report; not per-shard.
  • Deploy artifact cdk-agentcore-out still produced exactly once; deploy.yml unaffected.
  • Self-mutation / "Fail build on mutation" check still runs (once).
  • Real 4-core CI before/after posted: target slowest-shard + aggregate wall-clock ≈ ~170s (vs ~298s single), build step toward ~5–6 min.
  • No new flakiness; mise run build semantics preserved for local (non-CI) runs.

Implementer notes (from the design doc)

  • Overhead dominates past ~4 shards (~95s fixed per-job checkout/install/cache). 4-way is the start; measure before going higher; maximize cache hit-rate first.
  • --shard partitions by file count, not runtime — watch for a heavy suite skewing one shard.
  • Do not mark individual shard jobs required (context-name fragility / queue deadlock risk).

References


🤖 Generated with Claude Code

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions