Skip to content

feat: generate operational runbooks in eject bundle#249

Draft
timtalbot wants to merge 3 commits into
mainfrom
eject-runbooks
Draft

feat: generate operational runbooks in eject bundle#249
timtalbot wants to merge 3 commits into
mainfrom
eject-runbooks

Conversation

@timtalbot
Copy link
Copy Markdown
Contributor

Summary

  • Adds two Go template-based runbook generators (day-to-day-ops.md and disaster-recovery.md) populated with workload-specific data (cloud, region, sites, cluster name)
  • Wires runbook generation into eject.Run() so ptd eject --dry-run writes runbooks to the output directory
  • Covers both AWS and Azure with cloud-conditional sections
  • Shows PTD CLI commands and direct infrastructure commands (kubectl, aws/az CLI) — no raw Pulumi commands

Closes #218

Test plan

  • go test ./eject/ -run Runbook — 19 tests pass (both clouds, all sections, no --auto-apply, no raw Pulumi)
  • ptd eject <target> --dry-run against a real workload — review generated runbooks for accuracy
  • Spot-check rendered markdown formatting

Add day-to-day-ops and disaster-recovery runbook templates that render
with workload-specific data (cloud, region, sites, cluster name). Wired
into eject.Run() so `ptd eject --dry-run` produces runbooks/ in the
output directory.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 16, 2026

Claude finished @timtalbot's task —— View job


Review of PR #249: Operational Runbooks in Eject Bundle

  • Read review guidelines
  • Read changed files and PR diff
  • Check types and existing patterns
  • Analyze code and submit review

Overall this is solid work — clean templates, good cloud-conditional logic, and thorough test coverage (especially the no-pulumi and no-auto-apply guardrails). A few things worth addressing:


1. ClusterName is dead coderunbooks.go:22

RunbookData.ClusterName is defined in the struct and populated in test fixtures, but:

  • buildRunbookData() in eject.go never sets it
  • The templates use static <cluster-name> placeholders instead of {{.ClusterName}}
  • No tests assert on it

Either remove the field or wire it up so templates render the actual cluster name.

2. Azure resource group name skips sanitizationeject.go:99

data.ResourceGroup = fmt.Sprintf("rsg-ptd-%s", targetName)

The canonical implementation in lib/azure/target.go:ResourceGroupName() applies sanitizedName() (lowercase + strip non-alphanumeric chars). If targetName contains uppercase or special characters, this produces a wrong resource group name in the runbook. Consider reusing the sanitization logic or at minimum lowercasing.

3. Non-deterministic site orderingeject.go:93-94, eject.go:100-101

Iterating map[string]SiteConfig gives random ordering. The runbooks will list sites in a different order each run, making output harder to diff. Sort site names before appending:

names := maps.Keys(cfg.Sites)
slices.Sort(names)
for _, name := range names {
    site := cfg.Sites[name]
    data.Sites = append(data.Sites, SiteData{Name: name, Domain: site.Spec.Domain})
}

4. persistent_reprise step missing from step tablerunbooks.go:44-62

The step table covers bootstrap through sites but omits persistent_reprise (step 8 in the pipeline per the architecture docs). If it's relevant post-eject, include it; if not, a brief note explaining its omission would help operators.

5. RenderDayToDayOps / RenderDisasterRecovery unusedrunbooks.go:586-594

These exported functions are only called from tests. The actual code path uses GenerateRunbooks()renderTemplate(). If they're not intended as a public API for future callers, consider removing to keep the surface area small.


None of these are blocking — the runbook content and test coverage are well done.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review submitted via comment below due to tool permissions — no inline comments were attached to this review.

Address code review feedback on #249:
- Populate ClusterName from config (AWS: default_{name}-{release}-control-plane,
  Azure: {sanitized-name}-{release}) and render in kubeconfig commands
- Sort sites deterministically by name to produce stable output
- Apply Azure naming sanitization (lowercase, strip non-alphanumeric) to
  resource group name, matching lib/azure/target.go convention
- Remove false claims about S3/Azure blob versioning on state and data
  buckets — none of these have versioning enabled
- Remove false claims about Azure storage snapshots/soft-delete
- Keep accurate content: RDS 7-day backups, Azure PG default backups,
  FSx 30-day automatic backups
- Add "Prevention" notes suggesting customers enable versioning post-eject
- Remove redundant --dry-run + apply pattern from both runbooks — ptd
  ensure already shows a preview and prompts for confirmation
- Full rebuild uses bare `ptd ensure` instead of listing every step,
  which also avoids skipping custom steps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Operational runbooks

1 participant